entity_resolution.pairwise_matching¶

class entity_resolution.pairwise_matching.pair_range_pairwise_matcher.PairwiseMatcher(**kwargs)¶

Class that compares pairs of records in distributed computational environment with Spark. This class considers pair range results, and even out the load on clusters

comparators¶: a list of instances of comparators.similarity.Similarity

partitions¶: a number of partitions (recommended: set equal, or to be multiple of number of cores)

sqlContext¶: sql context of pyspark

PairwiseMatcher.match(dataframes, pairs)¶

Compares pairs of records based on identifiers in each triplet of pair

Parameters

dataframes – a list of dataframes of pyspark, that contains extracted data (treated as one dataframe)
pairs – a list of triplets, with information rows should be compared to which

Returns

list of pairs that match as a dict {src=id1, dst=id2}

class entity_resolution.pairwise_matching.partitioning_by_pairs.PartitioningByPairs(sqlContext, partitions=4)¶

Class that partitions records in dataframes in such way that one cummulative dataframe with each partition that contains records that should be compared on each core of clusters nodes.

partitions¶: a number of partitions (recommended: set equal, or to be multiple of number of cores, default 4)

sqlContext¶: sql context of pyspark

PartitioningByPairs.get(dataframes, pairs)¶

Treats dataframes as one big dataframe and splits to partitions, where each partition contains records for comparison (its not trivial, since each partition should have some overlapping records)

Parameters

dataframes – a list of dataframes of pyspark, that contains extracted data (treated as one dataframe)
pairs – a list of triplets, with information rows should be compared to which

Returns

dataframe, that internally has partitions

class entity_resolution.pairwise_matching.pair_comparator.PairComparator(comparators)¶

Class that compares pair of records in partitions in distributed computational environment with Spark. This class is transported to each node for execution

comparators¶: a list of instances of comparators.similarity.Similarity

PairComparator.get_compared_pairs(partition, pairs)¶

Compares pairs of records with comparators

Parameters

partition – a partition of dataframe of pyspark, that contains extracted data
pairs – a list of triplets, with information rows should be compared to which

Yields

list of pairs that match as a dict by match means, that all comparators returned True, {src=id1, dst=id2, are_equal=True/False}

class entity_resolution.pairwise_matching.pair_range_pairwise_matcher_with_average_similarity.PairwiseMatcher(**kwargs)¶

Class that compares pairs of records in distributed computational environment with Spark. This class considers pair range results, and even out the load on clusters

This class uses pair_comparator_with_average_similarity module, hence on match returns list of all pairs with average similarity.

comparators¶: a list of instances of comparators.similarity.Similarity

partitions¶: a number of partitions (recommended: set equal, or to be multiple of number of cores)

sqlContext¶: sql context of pyspark

PairwiseMatcher.match(dataframes, pairs)¶

Compares pairs of records based on identifiers in each triplet of pair

Parameters

dataframes – a list of dataframes of pyspark, that contains extracted data (treated as one dataframe)
pairs – a list of triplets, with information rows should be compared to which

Returns

list of pairs that match as a dict {src=id1, dst=id2}

class entity_resolution.pairwise_matching.pair_comparator_with_average_similarity.PairComparator(comparators)¶: See base class.

PairComparator.get_compared_pairs(partition, pairs)¶

Compares pairs of records with comparators

Parameters

partition – a partition of dataframe of pyspark, that contains extracted data
pairs – a list of triplets, with information rows should be compared to which

Yields

list of pairs as a dict with average similarity (ignores whether comparator returned True or False) {src=id1, dst=id2, average_similarity=float}