entity_resolution.clustering¶
-
class
entity_resolution.clustering.connected_component.Clustering(sqlContext, checkpointInterval=2)¶ Class that clusters data, where each cluster is some entity. This is a wrapper over ConnectedComponents from graphframes package for pyspark.
-
sqlContext¶ sql context of pyspark
-
checkpointInterval¶ a number of checkpoints in execution where algorithm can store
-
temporary results - Type
zero is not recommended, default=2
-
-
Clustering.cluster(dataframes, pairwise_matching_graph)¶ Assigns cluster id to each record of dataframe (of dataframes)
- Parameters
dataframes – a list of dataframes of pyspark, that contains extracted data (treated as one dataframe)
pairwise_matching_graph – a dataframe with rows as {src=id1, dst=id2} which denotes that pairs match
- Returns
dataframe with records from input dataframes with new attribute component
-
class
entity_resolution.clustering.incremental_record_linkage.Clustering(sqlContext, pairwise_matcher, pair_composer)¶ Class that clusters data, where each cluster is some entity. Address the velocity problem of big data, it stores previous steps of pairwise matching and clustering
-
sqlContext¶ sql context of pyspark
-
pairwise_matcher¶ instance of class that matches pairs (example entity_resolution.pairwise_matching.pair_range_pairwise_matcher_wi th_average_similarity.PairwiseMatcher)
-
pair_composer¶ instance of class that composes list of pairs to match (example entity_resolution.blocking.pair_range.PairRange)
-
-
Clustering.cluster(data, delta_data, pairwise_graph)¶ Assigns cluster id to each record of dataframe (of dataframes)
- Parameters
data – a dataframe that contains list of records from previous iterations, each record should have component attribute
delta_data – a dataframe that contains list of new records
pairwise_matching_graph – a dataframe with rows as {src=id1, dst=id2} which denotes that pairs match
- Returns
dataframe with records from input dataframes with new attribute component pairwise_matching_graph: a dataframe with updated rows as {src=id1, dst=id2, average_similarity=float}