# freediscovery.near_duplicates.SimhashNearDuplicates¶

class freediscovery.near_duplicates.SimhashNearDuplicates(hash_func='murmurhash3_int_u32', hash_func_nbytes=32)[source]

Near duplicates detection using the simhash algorithm.

A classical near-duplicates detection involves comparing all pairs of samples in the collection. For a collection of size N, this is typically an O(N^2) operation. Simhash algorithm allows to retrieve near duplicates with a significantly better computational scaling.

Note

this estimator requires the simhash-py <https://github.com/seomoz/simhash-py>_Python package to be installed.

Parameters: hash_func (str or function, default='murmurhash3_int_u32') – the hashing function used to hash documents. Possibles values are “murmurhash3_int_u32” or a custom function. hash_func_nbytes (int, default=64) – expected size of the hash produced by hash_func

References

 [Charikar2002] Charikar, M. S. (2002, May). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (pp. 380-388). ACM.
fit(X, y=None)[source]
Parameters: X ({array, sparse matrix}, shape (n_samples, n_features)) – List of n_features-dimensional data points. Each row corresponds to a single data point. self – Returns self. object
get_index_by_hash(shash)[source]

Get document index by hash

Parameters: shash (uint64) – a simhash value index – a document index int
get_params(deep=True)[source]

Get parameters for this estimator.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators. params – Parameter names mapped to their values. mapping of string to any
query(distance=2, blocks='auto')[source]

Find all the nearests neighbours for the dataset

Parameters: distance (int, default=2) – Maximum number of differnet bits in the simhash blocks (int or 'auto', default='auto') – number of blocks into which the simhash is split when searching for duplicates, see https://github.com/seomoz/simhash-py simhash (array) – the simhash value for all documents in the collection cluster_id (array) – the exact duplicates (documents with the same simhash) are grouped by in cluster_id dup_pairs (list) – list of tuples for the near-duplicates
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns: self