entity_resolution.clustering

class entity_resolution.clustering.connected_component.Clustering(sqlContext, checkpointInterval=2)

Class that clusters data, where each cluster is some entity. This is a wrapper over ConnectedComponents from graphframes package for pyspark.

sqlContext

sql context of pyspark

checkpointInterval

a number of checkpoints in execution where algorithm can store

temporary results
Type

zero is not recommended, default=2

Clustering.cluster(dataframes, pairwise_matching_graph)

Assigns cluster id to each record of dataframe (of dataframes)

Parameters
  • dataframes – a list of dataframes of pyspark, that contains extracted data (treated as one dataframe)

  • pairwise_matching_graph – a dataframe with rows as {src=id1, dst=id2} which denotes that pairs match

Returns

dataframe with records from input dataframes with new attribute component

class entity_resolution.clustering.incremental_record_linkage.Clustering(sqlContext, pairwise_matcher, pair_composer)

Class that clusters data, where each cluster is some entity. Address the velocity problem of big data, it stores previous steps of pairwise matching and clustering

sqlContext

sql context of pyspark

pairwise_matcher

instance of class that matches pairs (example entity_resolution.pairwise_matching.pair_range_pairwise_matcher_wi th_average_similarity.PairwiseMatcher)

pair_composer

instance of class that composes list of pairs to match (example entity_resolution.blocking.pair_range.PairRange)

Clustering.cluster(data, delta_data, pairwise_graph)

Assigns cluster id to each record of dataframe (of dataframes)

Parameters
  • data – a dataframe that contains list of records from previous iterations, each record should have component attribute

  • delta_data – a dataframe that contains list of new records

  • pairwise_matching_graph – a dataframe with rows as {src=id1, dst=id2} which denotes that pairs match

Returns

dataframe with records from input dataframes with new attribute component pairwise_matching_graph: a dataframe with updated rows as {src=id1, dst=id2, average_similarity=float}