entity_resolution.clustering¶

class entity_resolution.clustering.connected_component.Clustering(sqlContext, checkpointInterval=2)¶

Class that clusters data, where each cluster is some entity. This is a wrapper over ConnectedComponents from graphframes package for pyspark.

sqlContext¶: sql context of pyspark

checkpointInterval¶: a number of checkpoints in execution where algorithm can store

temporary results

Type: zero is not recommended, default=2

Clustering.cluster(dataframes, pairwise_matching_graph)¶

Assigns cluster id to each record of dataframe (of dataframes)

Parameters

dataframes – a list of dataframes of pyspark, that contains extracted data (treated as one dataframe)
pairwise_matching_graph – a dataframe with rows as {src=id1, dst=id2} which denotes that pairs match

Returns

dataframe with records from input dataframes with new attribute component

class entity_resolution.clustering.incremental_record_linkage.Clustering(sqlContext, pairwise_matcher, pair_composer)¶

Class that clusters data, where each cluster is some entity. Address the velocity problem of big data, it stores previous steps of pairwise matching and clustering

sqlContext¶: sql context of pyspark

pairwise_matcher¶: instance of class that matches pairs (example entity_resolution.pairwise_matching.pair_range_pairwise_matcher_wi th_average_similarity.PairwiseMatcher)

pair_composer¶: instance of class that composes list of pairs to match (example entity_resolution.blocking.pair_range.PairRange)

Clustering.cluster(data, delta_data, pairwise_graph)¶

Assigns cluster id to each record of dataframe (of dataframes)

Parameters

data – a dataframe that contains list of records from previous iterations, each record should have component attribute
delta_data – a dataframe that contains list of new records
pairwise_matching_graph – a dataframe with rows as {src=id1, dst=id2} which denotes that pairs match

Returns

dataframe with records from input dataframes with new attribute component pairwise_matching_graph: a dataframe with updated rows as {src=id1, dst=id2, average_similarity=float}