entity_resolution.blocking¶
-
class
entity_resolution.blocking.basic_blocker.Blocker(comparators, dataframes)¶ Class that splits data by blocks based on blocking function
-
comparators¶ a list of blocking functions
-
dataframes¶ a list of dataframes of pyspark, that contains extracted data (treated as one dataframe)
Example:
fields = [ { "name": "description", "minimum_score": "0.50" }, ] comparators=[LevenshteinSimilarity(fields=fields)]
-
-
Blocker.get_blocks()¶ gets blocks of data that split with comparators (blocking functions)
- Returns
a list of lists (list of blocks)
-
class
entity_resolution.blocking.pair_range.PairRange¶ Class prepares pairs of ids to compare, which helps to even out the load on cluster nodes
-
PairRange.get(blocks)¶ returns triplets [<id_of_block, id1_of_data, id2_of_data>,…]
- Parameters
blocks – a list of blocks of data
- Returns
a list of triplets, with information rows should be compared to which