A classical near-duplicates detection involves comparing all pairs of
samples in the collection. For a collection of size N, this is
typically an O(N^2) operation. The I-Match algorithm allows to
retrieve near duplicates with a computational effort reduced to
O(N) (or O(N*log(N)) in worse case scenario).

This class exposes a scikit-learn compatible API, and currently supports
only sparse CSR arrays (such as obtained after vectorizing text documents).

Parameters:

n_rand_lexicons (-) – number of random lexicons used for duplicate detection
If equal to 1 no lexicon randomization is used which is equivalent
to the original I-Match implementation by Chowdhury et al. (2002).

rand_lexicon_ratio (-) – fraction of the vocabulary used in random lexicons.

The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each
component of a nested object.