# freediscovery.near_duplicates.IMatchNearDuplicates¶

class freediscovery.near_duplicates.IMatchNearDuplicates(n_rand_lexicons=1, rand_lexicon_ratio=0.7)[source]

Near duplicates detection using the randomized I-Match algorithm.

A classical near-duplicates detection involves comparing all pairs of samples in the collection. For a collection of size N, this is typically an O(N^2) operation. The I-Match algorithm allows to retrieve near duplicates with a computational effort reduced to O(N) (or O(N*log(N)) in worse case scenario).

This class exposes a scikit-learn compatible API, and currently supports only sparse CSR arrays (such as obtained after vectorizing text documents).

Parameters: n_rand_lexicons (-) – number of random lexicons used for duplicate detection If equal to 1 no lexicon randomization is used which is equivalent to the original I-Match implementation by Chowdhury et al. (2002). rand_lexicon_ratio (-) – fraction of the vocabulary used in random lexicons.

References

 [Chowdhury2002] Chowdhury, A., Frieder, O., Grossman, D., & McCabe, M. C. (2002). Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS), 20(2), 171-191.
 [Kolcz2008] Kołcz, A., & Chowdhury, A. (2008). Lexicon randomization for near-duplicate detection with I-Match. The Journal of Supercomputing, 45(3), 255-276.
fit(X, y=None)[source]
Parameters: X (array_like or sparse (CSR) matrix, shape (n_samples, n_features)) – List of n_features-dimensional data points. Each row corresponds to a single data point. self – Returns self. object
get_params(deep=True)[source]

Get parameters for this estimator.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators. params – Parameter names mapped to their values. mapping of string to any
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns: self