entity_resolution.blocking

class entity_resolution.blocking.basic_blocker.Blocker(comparators, dataframes)

Class that splits data by blocks based on blocking function

comparators

a list of blocking functions

dataframes

a list of dataframes of pyspark, that contains extracted data (treated as one dataframe)

Example:

fields = [
    {
        "name": "description",
        "minimum_score": "0.50"
    },
]
comparators=[LevenshteinSimilarity(fields=fields)]
Blocker.get_blocks()

gets blocks of data that split with comparators (blocking functions)

Returns

a list of lists (list of blocks)

class entity_resolution.blocking.pair_range.PairRange

Class prepares pairs of ids to compare, which helps to even out the load on cluster nodes

PairRange.get(blocks)

returns triplets [<id_of_block, id1_of_data, id2_of_data>,…]

Parameters

blocks – a list of blocks of data

Returns

a list of triplets, with information rows should be compared to which