entity_resolution.pairwise_matching¶
-
class
entity_resolution.pairwise_matching.pair_range_pairwise_matcher.PairwiseMatcher(**kwargs)¶ Class that compares pairs of records in distributed computational environment with Spark. This class considers pair range results, and even out the load on clusters
-
comparators¶ a list of instances of comparators.similarity.Similarity
-
partitions¶ a number of partitions (recommended: set equal, or to be multiple of number of cores)
-
sqlContext¶ sql context of pyspark
-
-
PairwiseMatcher.match(dataframes, pairs)¶ Compares pairs of records based on identifiers in each triplet of pair
- Parameters
dataframes – a list of dataframes of pyspark, that contains extracted data (treated as one dataframe)
pairs – a list of triplets, with information rows should be compared to which
- Returns
list of pairs that match as a dict {src=id1, dst=id2}
-
class
entity_resolution.pairwise_matching.partitioning_by_pairs.PartitioningByPairs(sqlContext, partitions=4)¶ Class that partitions records in dataframes in such way that one cummulative dataframe with each partition that contains records that should be compared on each core of clusters nodes.
-
partitions¶ a number of partitions (recommended: set equal, or to be multiple of number of cores, default 4)
-
sqlContext¶ sql context of pyspark
-
-
PartitioningByPairs.get(dataframes, pairs)¶ Treats dataframes as one big dataframe and splits to partitions, where each partition contains records for comparison (its not trivial, since each partition should have some overlapping records)
- Parameters
dataframes – a list of dataframes of pyspark, that contains extracted data (treated as one dataframe)
pairs – a list of triplets, with information rows should be compared to which
- Returns
dataframe, that internally has partitions
-
class
entity_resolution.pairwise_matching.pair_comparator.PairComparator(comparators)¶ Class that compares pair of records in partitions in distributed computational environment with Spark. This class is transported to each node for execution
-
comparators¶ a list of instances of comparators.similarity.Similarity
-
-
PairComparator.get_compared_pairs(partition, pairs)¶ Compares pairs of records with comparators
- Parameters
partition – a partition of dataframe of pyspark, that contains extracted data
pairs – a list of triplets, with information rows should be compared to which
- Yields
list of pairs that match as a dict by match means, that all comparators returned True, {src=id1, dst=id2, are_equal=True/False}
-
class
entity_resolution.pairwise_matching.pair_range_pairwise_matcher_with_average_similarity.PairwiseMatcher(**kwargs)¶ Class that compares pairs of records in distributed computational environment with Spark. This class considers pair range results, and even out the load on clusters
This class uses pair_comparator_with_average_similarity module, hence on match returns list of all pairs with average similarity.
-
comparators¶ a list of instances of comparators.similarity.Similarity
-
partitions¶ a number of partitions (recommended: set equal, or to be multiple of number of cores)
-
sqlContext¶ sql context of pyspark
-
-
PairwiseMatcher.match(dataframes, pairs)¶ Compares pairs of records based on identifiers in each triplet of pair
- Parameters
dataframes – a list of dataframes of pyspark, that contains extracted data (treated as one dataframe)
pairs – a list of triplets, with information rows should be compared to which
- Returns
list of pairs that match as a dict {src=id1, dst=id2}
-
class
entity_resolution.pairwise_matching.pair_comparator_with_average_similarity.PairComparator(comparators)¶ See base class.
-
PairComparator.get_compared_pairs(partition, pairs)¶ Compares pairs of records with comparators
- Parameters
partition – a partition of dataframe of pyspark, that contains extracted data
pairs – a list of triplets, with information rows should be compared to which
- Yields
list of pairs as a dict with average similarity (ignores whether comparator returned True or False) {src=id1, dst=id2, average_similarity=float}