class freediscovery.near_duplicates.IMatchNearDuplicates(n_rand_lexicons=1, rand_lexicon_ratio=0.7)[source]

Near duplicates detection using the randomized I-Match algorithm.

A classical near-duplicates detection involves comparing all pairs of samples in the collection. For a collection of size N, this is typically an O(N^2) operation. The I-Match algorithm allows to retrieve near duplicates with a computational effort reduced to O(N) (or O(N*log(N)) in worse case scenario).

This class exposes a scikit-learn compatible API, and currently supports only sparse CSR arrays (such as obtained after vectorizing text documents).

  • n_rand_lexicons (-) – number of random lexicons used for duplicate detection If equal to 1 no lexicon randomization is used which is equivalent to the original I-Match implementation by Chowdhury et al. (2002).
  • rand_lexicon_ratio (-) – fraction of the vocabulary used in random lexicons.


fit(X, y=None)[source]
Parameters:X (array_like or sparse (CSR) matrix, shape (n_samples, n_features)) – List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns:self – Returns self.
Return type:object

Get parameters for this estimator.

Parameters:deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Return type:self