Skip to content

Core API

Compare by Similarity

The similarity calculation is customizable; visit the metrics section for more information.

JCompare.similarity.find_similar_files_pairwise

find_similar_files_pairwise(folder1: Folder, folder2: Folder, threshold: float, same_parent_only: bool, comparer: Similarity, mode: int) -> dict[str, list[tuple[str, float]]]

Finds similar files between two folders in a pairwise manner.

Parameters:

Name Type Description Default
folder1 Folder

The first folder object, which contains the files to be compared.

required
folder2 Folder

The second folder object, which contains the files to be compared.

required
threshold float

The similarity threshold. Only pairs of files with a similarity score equal to or above this threshold will be included in the result.

required
same_parent_only bool

If set to True, only files with the same parent directory will be compared.

required
comparer Similarity

The similarity comparer object used to compare the files.

required
mode int

The mode of operation. If set to SYNC, the function will use synchronous I/O. If set to ASYNC, the function will use asynchronous I/O. If set to ASYNC_AND_MULTIPROCESS, the function will use both asynchronous I/O and multiprocessing.

required

Raises:

Type Description
ValueError

If an invalid mode is given.

Returns:

Type Description
dict[str, list[tuple[str, float]]]

dict[str, list[tuple[str, float]]]: A dictionary where each key is the relative path of a file in the first folder and each value is a list of tuples. Each tuple contains the relative path of a similar file in the second folder and the similarity score.

Source code in JCompare/similarity.py
def find_similar_files_pairwise(folder1: Folder, folder2: Folder, threshold: float, same_parent_only: bool, comparer: Similarity, mode: int) -> dict[str, list[tuple[str, float]]]:
    """
    Finds similar files between two folders in a pairwise manner.

    Args:
        folder1 (Folder): The first folder object, which contains the files to be compared.
        folder2 (Folder): The second folder object, which contains the files to be compared.
        threshold (float): The similarity threshold. Only pairs of files with a similarity score equal to or above this threshold will be included in the result.
        same_parent_only (bool): If set to True, only files with the same parent directory will be compared.
        comparer (Similarity): The similarity comparer object used to compare the files.
        mode (int): The mode of operation. If set to SYNC, the function will use synchronous I/O. If set to ASYNC, the function will use asynchronous I/O. If set to ASYNC_AND_MULTIPROCESS, the function will use both asynchronous I/O and multiprocessing.

    Raises:
        ValueError: If an invalid mode is given.

    Returns:
        dict[str, list[tuple[str, float]]]: A dictionary where each key is the relative path of a file in the first folder and each value is a list of tuples. Each tuple contains the relative path of a similar file in the second folder and the similarity score.
    """

    files1 = folder1.list
    files2 = folder2.list

    pairs = [(file1, file2) for file1 in files1 for file2 in files2]

    if same_parent_only:
        pairs = [(file1, file2) for file1, file2 in pairs if os.path.dirname(
            file1) == os.path.dirname(file2)]

    if mode == SYNC:
        similar_files = {}

        for file1, file2 in tqdm(pairs, desc="Comparing files", unit="pair"):
            file1_fulpath = os.path.join(folder1.folder_path, file1)
            file2_fulpath = os.path.join(folder2.folder_path, file2)

            similarity = comparer.cmp(
                (file1_fulpath, file1), (file2_fulpath, file2))

            if similarity >= threshold:
                if os.path.relpath(file1, folder1.path) not in similar_files:
                    similar_files[os.path.relpath(file1, folder1.path)] = []
                similar_files[os.path.relpath(file1, folder1.path)].append(
                    (os.path.relpath(file2, folder2.path), similarity))

        return similar_files

    elif mode == ASYNC or mode == ASYNC_AND_MULTIPROCESS:
        return asyncio.run(async_find_similar_files(pairs, folder1, folder2, threshold, comparer, mode))
    else:
        raise ValueError("Invalid Mode")

JCompare.similarity.find_dissimilar_files_pairwise

find_dissimilar_files_pairwise(folder1: Folder, folder2: Folder, threshold: float, same_parent_only: bool, comparer: Similarity, mode: int) -> dict[str, Union[list[str], bool]]

Finds dissimilar files between two folders in a pairwise manner.

Parameters:

Name Type Description Default
folder1 Folder

The first folder object, which contains the files to be compared.

required
folder2 Folder

The second folder object, which contains the files to be compared.

required
threshold float

The similarity threshold. Only pairs of files with a similarity score below this threshold will be included in the result.

required
same_parent_only bool

If set to True, only files with the same parent directory will be compared.

required
comparer Similarity

The similarity comparer object used to compare the files.

required
mode int

The mode of operation. If set to SYNC, the function will use synchronous I/O. If set to ASYNC, the function will use asynchronous I/O. If set to ASYNC_AND_MULTIPROCESS, the function will use both asynchronous I/O and multiprocessing.

required

Raises:

Type Description
ValueError

If an invalid mode is given.

Returns:

Type Description
dict[str, Union[list[str], bool]]

dict[str, Union[list[str], bool]]: A dictionary with three keys: 'folder1', 'folder2', and 'is_similar'. The value of 'folder1' is a list of the relative paths of the dissimilar files in the first folder. The value of 'folder2' is a list of the relative paths of the dissimilar files in the second folder. The value of 'is_similar' is a boolean indicating whether the two folders are similar.

Source code in JCompare/similarity.py
def find_dissimilar_files_pairwise(folder1: Folder, folder2: Folder, threshold: float, same_parent_only: bool, comparer: Similarity, mode: int) -> dict[str, Union[list[str],  bool]]:
    """
    Finds dissimilar files between two folders in a pairwise manner.

    Args:
        folder1 (Folder): The first folder object, which contains the files to be compared.
        folder2 (Folder): The second folder object, which contains the files to be compared.
        threshold (float): The similarity threshold. Only pairs of files with a similarity score below this threshold will be included in the result.
        same_parent_only (bool): If set to True, only files with the same parent directory will be compared.
        comparer (Similarity): The similarity comparer object used to compare the files.
        mode (int): The mode of operation. If set to SYNC, the function will use synchronous I/O. If set to ASYNC, the function will use asynchronous I/O. If set to ASYNC_AND_MULTIPROCESS, the function will use both asynchronous I/O and multiprocessing.

    Raises:
        ValueError: If an invalid mode is given.

    Returns:
        dict[str, Union[list[str],  bool]]: A dictionary with three keys: 'folder1', 'folder2', and 'is_similar'. The value of 'folder1' is a list of the relative paths of the dissimilar files in the first folder. The value of 'folder2' is a list of the relative paths of the dissimilar files in the second folder. The value of 'is_similar' is a boolean indicating whether the two folders are similar.
    """

    if same_parent_only:
        pairs = find_common_path(folder1.tree, folder2.tree)
        files1_list = [i[0] for i in pairs]
        files2_list = [i[1] for i in pairs]
    else:
        files1_list = folder1.list
        files2_list = folder2.list
        pairs = tuple((file1, file2)
                      for file1 in files1_list for file2 in files2_list)

    dissimilar_files_folder1 = {i: 0 for i in files1_list}
    dissimilar_files_folder2 = {i: 0 for i in files2_list}

    if mode == SYNC:
        for file1, file2 in tqdm(pairs, desc="Comparing files", unit="pair"):
            if dissimilar_files_folder1[file1] >= threshold and dissimilar_files_folder2[file2] >= threshold:
                continue

            file1_fulpath = os.path.join(folder1.folder_path, file1)
            file2_fulpath = os.path.join(folder2.folder_path, file2)

            similarity = comparer.cmp(
                (file1_fulpath, file1), (file2_fulpath, file2))

            dissimilar_files_folder1[file1] = max(
                similarity, dissimilar_files_folder1[file1])
            dissimilar_files_folder2[file2] = max(
                similarity, dissimilar_files_folder2[file2])

    elif mode == ASYNC or mode == ASYNC_AND_MULTIPROCESS:
        dissimilar_files_folder1, dissimilar_files_folder2 = asyncio.run(async_find_dissimilar_files(pairs, dissimilar_files_folder1, dissimilar_files_folder2, folder1, folder2,
                                                                                                     comparer, mode))
    else:
        raise ValueError("Invalid Mode")

    dissimilar_files_folder1 = [os.path.relpath(i[0], folder1.path) for i in filter(
        lambda item: item[1] < threshold, dissimilar_files_folder1.items())]
    dissimilar_files_folder2 = [os.path.relpath(i[0], folder2.path) for i in filter(
        lambda item: item[1] < threshold, dissimilar_files_folder2.items())]

    is_similar = not (dissimilar_files_folder1 or dissimilar_files_folder2)

    result = {
        "folder1": dissimilar_files_folder1,
        "folder2": dissimilar_files_folder2,
        "is_similar": is_similar
    }

    return result

Compare by Hash

JCompare.hash.find_identical_files

find_identical_files(folder1: Folder, folder2: Folder, same_parent_only: bool, hash_algorithm: tuple[str]) -> dict[str, list[str]]

Finds identical files between two folders based on their hash values.

Parameters:

Name Type Description Default
folder1 Folder

The first folder object, which contains the files to be compared.

required
folder2 Folder

The second folder object, which contains the files to be compared.

required
same_parent_only bool

If set to True, only files with the same parent folder will be compared.

required
hash_algorithm tuple[str]

A tuple of strings specifying the names of the hash algorithms to use.

required

Returns:

Type Description
dict[str, list[str]]

dict[str, list[str]]: A dictionary mapping the relative paths of the identical files in the first folder to lists of the relative paths of the identical files in the second folder.

Source code in JCompare/hash.py
def find_identical_files(folder1: Folder, folder2: Folder, same_parent_only: bool, hash_algorithm: tuple[str]) -> dict[str, list[str]]:
    """
    Finds identical files between two folders based on their hash values.

    Args:
        folder1 (Folder): The first folder object, which contains the files to be compared.
        folder2 (Folder): The second folder object, which contains the files to be compared.
        same_parent_only (bool): If set to True, only files with the same parent folder will be compared.
        hash_algorithm (tuple[str]): A tuple of strings specifying the names of the hash algorithms to use.

    Returns:
        dict[str, list[str]]: A dictionary mapping the relative paths of the identical files in the first folder to lists of the relative paths of the identical files in the second folder.
    """

    if same_parent_only:
        pairs = find_common_path(folder1.tree, folder2.tree)
        files1_list = [i[0] for i in pairs]
        files2_list = [i[1] for i in pairs]
    else:
        files1_list = folder1.list
        files2_list = folder2.list
        pairs = tuple((file1, file2)
                      for file1 in files1_list for file2 in files2_list)

    files1_list = list(set(files1_list))
    files2_list = list(set(files2_list))

    hash_dict1, hash_dict2 = calculate_hash(
        files1_list, files2_list, folder1, folder2, hash_algorithm)

    identical_files = {}

    for file1, file2 in pairs:
        if hash_dict1[file1] == hash_dict2[file2]:
            file1_relpath = os.path.relpath(file1, folder1.path)
            file2_relpath = os.path.relpath(file2, folder2.path)

            if file1_relpath not in identical_files:
                identical_files[file1_relpath] = []
            identical_files[file1_relpath].append(file2_relpath)

    return identical_files

JCompare.hash.find_different_files

find_different_files(folder1: Folder, folder2: Folder, same_parent_only: bool, hash_algorithm: tuple[str]) -> dict[str, Union[list[str], bool]]

Finds different files between two folders based on their hash values.

Parameters:

Name Type Description Default
folder1 Folder

The first folder object, which contains the files to be compared.

required
folder2 Folder

The second folder object, which contains the files to be compared.

required
same_parent_only bool

If set to True, only files with the same parent folder will be compared.

required
hash_algorithm tuple[str]

A tuple of strings specifying the names of the hash algorithms to use.

required

Returns:

Type Description
dict[str, Union[list[str], bool]]

dict[str, Union[list[str], bool]]: A dictionary with keys "folder1", "folder2", and "is_identical". The values for "folder1" and "folder2" are lists of the relative paths of the different files in the respective folders. The value for "is_identical" is a boolean indicating whether the two folders are identical.

Source code in JCompare/hash.py
def find_different_files(folder1: Folder, folder2: Folder, same_parent_only: bool, hash_algorithm: tuple[str]) -> dict[str, Union[list[str], bool]]:
    """
    Finds different files between two folders based on their hash values.

    Args:
        folder1 (Folder): The first folder object, which contains the files to be compared.
        folder2 (Folder): The second folder object, which contains the files to be compared.
        same_parent_only (bool): If set to True, only files with the same parent folder will be compared.
        hash_algorithm (tuple[str]): A tuple of strings specifying the names of the hash algorithms to use.

    Returns:
        dict[str, Union[list[str], bool]]: A dictionary with keys "folder1", "folder2", and "is_identical". The values for "folder1" and "folder2" are lists of the relative paths of the different files in the respective folders. The value for "is_identical" is a boolean indicating whether the two folders are identical.
    """

    if same_parent_only:
        pairs = find_common_path(folder1.tree, folder2.tree)
        files1_list = [i[0] for i in pairs]
        files2_list = [i[1] for i in pairs]
    else:
        files1_list = folder1.list
        files2_list = folder2.list
        pairs = tuple((file1, file2)
                      for file1 in files1_list for file2 in files2_list)

    files1_list = list(set(files1_list))
    files2_list = list(set(files2_list))

    hash_dict1, hash_dict2 = calculate_hash(
        files1_list, files2_list, folder1, folder2, hash_algorithm)

    name_dict1 = {i: [] for i in files1_list}
    name_dict2 = {i: [] for i in files2_list}
    for file1, file2 in pairs:
        name_dict1[file1].append(file2)
        name_dict2[file2].append(file1)

    different_files_folder1 = []
    different_files_folder2 = []

    for file1, file1_hash in hash_dict1.items():
        if file1_hash not in [hash_dict2[i] for i in name_dict1[file1]]:
            different_files_folder1.append(
                os.path.relpath(file1, folder1.path))

    for file2, file2_hash in hash_dict2.items():
        if file2_hash not in [hash_dict1[i] for i in name_dict2[file2]]:
            different_files_folder2.append(
                os.path.relpath(file2, folder2.path))

    is_identical = not (different_files_folder1 or different_files_folder2)

    result = {
        "folder1": different_files_folder1,
        "folder2": different_files_folder2,
        "is_identical": is_identical
    }

    return result

Compare by Directory Structure

JCompare.mcs.find_identical_files_by_mcs

find_identical_files_by_mcs(folder1: Folder, folder2: Folder, ignore_directory_names: bool = False, path: None | tuple[tuple[str], tuple[str]] = None) -> list[dict[str, list[str]]]

Finds identical files between two folders based on the maximum common subtree (MCS).

Parameters:

Name Type Description Default
folder1 Folder

The first folder object, which contains the files to be compared.

required
folder2 Folder

The second folder object, which contains the files to be compared.

required
ignore_directory_names bool

If set to True, directory names will be ignored when comparing the folder structures. Defaults to False.

False
path None | tuple[tuple[str], tuple[str]]

A tuple of two tuples, each containing the path to a subtree in the corresponding folder. If provided, only the specified subtrees will be compared. Defaults to None.

None

Returns:

Type Description
list[dict[str, list[str]]]

list[dict[str, list[str]]]: A list of dictionaries. Each dictionary represents a set of identical files in an MCS (there might be multiple), with the keys being the relative paths of the files in the first folder and the values being lists of the relative paths of the identical files in the second folder.

Source code in JCompare/mcs.py
def find_identical_files_by_mcs(folder1: Folder, folder2: Folder, ignore_directory_names: bool = False, path: None | tuple[tuple[str], tuple[str]] = None) -> list[dict[str, list[str]]]:
    """
    Finds identical files between two folders based on the maximum common subtree (MCS).

    Args:
        folder1 (Folder): The first folder object, which contains the files to be compared.
        folder2 (Folder): The second folder object, which contains the files to be compared.
        ignore_directory_names (bool, optional): If set to True, directory names will be ignored when comparing the folder structures. Defaults to False.
        path (None | tuple[tuple[str], tuple[str]], optional): A tuple of two tuples, each containing the path to a subtree in the corresponding folder. If provided, only the specified subtrees will be compared. Defaults to None.

    Returns:
        list[dict[str, list[str]]]: A list of dictionaries. Each dictionary represents a set of identical files in an MCS (there might be multiple), with the keys being the relative paths of the files in the first folder and the values being lists of the relative paths of the identical files in the second folder.
    """

    if path == None:
        tree1 = folder1.tree
        tree2 = folder2.tree
        subtrees = find_max_common_subtree(
            tree1, tree2, ignore_directory_names)
    else:
        tree1 = folder1.tree
        tree2 = folder2.tree
        for i in path[0]:
            tree1 = tree1[i]
        for i in path[1]:
            tree2 = tree2[i]
        subtrees = [[find_common_subtree(tree1, tree2, ignore_directory_names)[
            0], path[0], path[1]]]

    results = []

    if path == None:
        path = ((), ())

    for subtree in subtrees:
        tmp = {}
        for i in dict2list(subtree[0]):
            file1 = "../" + \
                "/".join(list(path[0]) + [j[0]
                         if isinstance(j, tuple) else j for j in i])
            file2 = "../" + \
                "/".join(list(path[1]) + [j[1]
                         if isinstance(j, tuple) else j for j in i])
            tmp[file1] = [file2]
        results.append(tmp)

    return results

JCompare.mcs.find_different_files_by_mcs

find_different_files_by_mcs(folder1: Folder, folder2: Folder, ignore_directory_names: bool = False, path: None | tuple[tuple[str], tuple[str]] = None) -> list[dict[str, list[str] | str, bool]]

Finds different files between two folders based on the maximum common subtree (MCS).

Parameters:

Name Type Description Default
folder1 Folder

The first folder object, which contains the files to be compared.

required
folder2 Folder

The second folder object, which contains the files to be compared.

required
ignore_directory_names bool

If set to True, directory names will be ignored when comparing the folder structures. Defaults to False.

False
path None | tuple[tuple[str], tuple[str]]

A tuple of two tuples, each containing the path to a subtree in the corresponding folder. If provided, only the specified subtrees will be compared. Defaults to None.

None

Returns:

Type Description
list[dict[str, list[str] | str, bool]]

list[dict[str, list[str] | str, bool]]: A list of dictionaries. Each dictionary represents a set of different files in an MCS (there might be multiple), with the keys being the relative paths of the files in the first and second folder and a boolean indicating whether the files are identical or not.

Source code in JCompare/mcs.py
def find_different_files_by_mcs(folder1: Folder, folder2: Folder, ignore_directory_names: bool = False, path: None | tuple[tuple[str], tuple[str]] = None) -> list[dict[str, list[str] | str, bool]]:
    """
    Finds different files between two folders based on the maximum common subtree (MCS).

    Args:
        folder1 (Folder): The first folder object, which contains the files to be compared.
        folder2 (Folder): The second folder object, which contains the files to be compared.
        ignore_directory_names (bool, optional): If set to True, directory names will be ignored when comparing the folder structures. Defaults to False.
        path (None | tuple[tuple[str], tuple[str]], optional): A tuple of two tuples, each containing the path to a subtree in the corresponding folder. If provided, only the specified subtrees will be compared. Defaults to None.

    Returns:
        list[dict[str, list[str] | str, bool]]: A list of dictionaries. Each dictionary represents a set of different files in an MCS (there might be multiple), with the keys being the relative paths of the files in the first and second folder and a boolean indicating whether the files are identical or not.
    """

    if path == None:
        tree1 = folder1.tree
        tree2 = folder2.tree
        subtrees = find_max_common_subtree(
            tree1, tree2, ignore_directory_names)
    else:
        tree1 = folder1.tree
        tree2 = folder2.tree
        for i in path[0]:
            tree1 = tree1[i]
        for i in path[1]:
            tree2 = tree2[i]
        subtrees = [[find_common_subtree(tree1, tree2, ignore_directory_names)[
            0], path[0], path[1]]]

    results = []

    for subtree in subtrees:
        if ignore_directory_names:
            f1_tree, f2_tree = folder1.tree, folder2.tree
        else:
            f1_tree, f2_tree = rm_common_node(folder1.tree, folder2.tree)

        tree1, tree2 = subtract_max_common_subtree(
            f1_tree, f2_tree, subtree)
        results.append({
            "folder1": tree2path(tree1),
            "folder2": tree2path(tree2),
            "is_identical": not tree1 and not tree2
        })
    return results