entity_resolution.blocking¶

class entity_resolution.blocking.basic_blocker.Blocker(comparators, dataframes)¶

Class that splits data by blocks based on blocking function

dataframes¶: a list of dataframes of pyspark, that contains extracted data (treated as one dataframe)

Example:

fields = [
    {
        "name": "description",
        "minimum_score": "0.50"
    },
]
comparators=[LevenshteinSimilarity(fields=fields)]

Blocker.get_blocks()¶

gets blocks of data that split with comparators (blocking functions)

class entity_resolution.blocking.pair_range.PairRange¶: Class prepares pairs of ids to compare, which helps to even out the load on cluster nodes

PairRange.get(blocks)¶

returns triplets [<id_of_block, id1_of_data, id2_of_data>,…]

Parameters: blocks – a list of blocks of data
Returns: a list of triplets, with information rows should be compared to which