Finding near-duplicates with Jaccard similarity and MinHash
Jaccard similarity and MinHash are used to find near-duplicates in document collections efficiently. By comparing sampled elements, approximate duplicates are grouped, simplifying detection and scaling well for large datasets. Adjusting similarity thresholds helps detect fuzzier duplicates.
Read original articleThis article discusses the use of Jaccard similarity and MinHash for identifying near-duplicates in a large collection of documents. Jaccard similarity measures the overlap between sets, with a range from 0 to 1 indicating dissimilarity to identity. MinHash signatures are used to approximate Jaccard similarity efficiently by comparing randomly sampled elements from sets. By grouping documents based on these signatures, approximate duplicates can be identified without comparing every pair. The approach simplifies the process and scales well for large datasets. By using multiple hash functions, the accuracy of the similarity approximation can be improved. The method allows for efficient detection of documents with high similarity without exhaustive comparisons. The article also discusses strategies for detecting fuzzier duplicates by adjusting the similarity threshold. Overall, the approach offers a practical and scalable solution for identifying near-duplicates in large document collections.
Related
Group Actions and Hashing Unordered Multisets
Group actions are used to analyze hash functions for unordered sets and multisets, ensuring order-agnostic hashing. By leveraging group theory, particularly abelian groups, hash functions' structure is explored, emphasizing efficient and order-independent hashing techniques.
New JavaScript Set Methods
New JavaScript Set methods introduced in major browsers like Firefox 127 offer efficient set operations without polyfills. These methods simplify tasks like finding intersections, unions, and subsets, enhancing working with unique collections.
A new method of recording and searching information (1953)
Fermat's Library explores Hans Peter Luhn's method for organizing information using descriptive metadata called "legends." Luhn's system enhances search accuracy by linking terms and normalizing language, improving information retrieval efficiency.
The question of what's fair illuminates the question of what's hard
Computational complexity theorists repurpose fairness tools to analyze complex problems. Transitioning from multiaccuracy to multicalibration enhances understanding and simplifies approximating hard functions, benefiting algorithmic fairness and complexity theory.
Benchmarking Perfect Hashing in C++
Benchmarking perfect hashing functions in C++ using clang++-19 and g++-13 reveals mph as the fastest with limitations. Various hash function implementations are compared for lookup time, build time, and size, aiding system optimization.
The only complication is that you then need to choose a suitable T-Norm / T-Conorm pair, which express the concept of intersection and union for fuzzy sets, and there's an infinite family of them. But that's a good thing, since you get to pick the pair with your desired semantics.
I wrote about this ([0][1]) in the context of validating medical image segmentations when both the segmentation and ground truth are probabilistic/fuzzy rather than binary masks.
Otherwise what most people do instead is to simply threshold at 0.5 to obtain binary sets for use with the binary variants of the jaccard / dice coefficients. Which apparently decreases the precision of your validation operator by 2 orders of magnitude. It's like, you publish your algorithm claiming it's better than SOTA by 0.001, ignoring the fact that your validation operator has an error margin of 0.1 ...
[0] https://link.springer.com/chapter/10.1007/978-3-319-46723-8_...
[1] https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...
Of course, nowaday I would probably just tell them to use datasketch (https://pypi.org/project/datasketch/).
With this trip to memory lane, I looked around a little, and noticed people are still creating new stuff on the topic. E.G:
https://pypi.org/project/rensa/
Which is basically a more specialized but faster version of datasketch minhash, written in rust, with a little python on top.
The problem is to find (pseudo) inverses of many different proper submatrices of a big square matrix.
Certain matrix identities (Woodbury, banachiewicz) allow you to update the inverse of a “close” submatrix to cheaply calculate a new one.
So you store the inverses you’ve calculated already, with their row/column indices as keys. Then for each new submatrix you want to find a close already-computed inverse to update from.
This is solved with minhashing. You minwise hash the indices so that close matrices are likely to have the same hash.
In my particular implementation I did a “multi-resolution” hash so that I can change the selectiveness of the search as the number of prior computed inverses grows
you can find the relevant material for free by searching "chapter 3 pdf mmds ullman"
enjoy!
edit: oh no! i'm wrong! according to wikipedia it was invented at dec for altavista. https://en.wikipedia.org/wiki/MinHash either way there's a nice description in the ullman book and they do describe how it was used at google as well.
https://websla.sh/tools/minhash
It's not really complete, I'd like to show Jaccard Similarity calculations among other things, but this already allows you to enter multiple strings and see with your own eyes what a "minhash" actually is.
There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim
Here's the repo: https://github.com/NVIDIA/NeMo-Curator/
Some documentation on the fuzzy dedup scripts: https://docs.nvidia.com/nemo-framework/user-guide/latest/dat...
And a Python example: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fu...
Would be interested in any feedback from the folks here.
I first learned about this technique from Douglas Eck (https://research.google/people/douglas-eck/), who used it for song clustering at Google. I remember him saying something about hashing and random vectors, and was puzzled because I figured that less random optimiztion would work better.
Typo:
It’s worth noticing that this definition is not in general transitive: we may well have three documents A,B,C such that S(A,B)≥S_{crit} and S(B,C)≥S_{crit} but S(A,B)<S _{crit}
That last S(A,B) should be S(A,C).
(Unless I'm very much mistaken ... and I could be)
this simple shuffle was a poor man's similarity sort that I used to find sound-alike words. the thing about the quality of similarity is that it's hard to know for sure if it worked, but this was sufficient.
Related
Group Actions and Hashing Unordered Multisets
Group actions are used to analyze hash functions for unordered sets and multisets, ensuring order-agnostic hashing. By leveraging group theory, particularly abelian groups, hash functions' structure is explored, emphasizing efficient and order-independent hashing techniques.
New JavaScript Set Methods
New JavaScript Set methods introduced in major browsers like Firefox 127 offer efficient set operations without polyfills. These methods simplify tasks like finding intersections, unions, and subsets, enhancing working with unique collections.
A new method of recording and searching information (1953)
Fermat's Library explores Hans Peter Luhn's method for organizing information using descriptive metadata called "legends." Luhn's system enhances search accuracy by linking terms and normalizing language, improving information retrieval efficiency.
The question of what's fair illuminates the question of what's hard
Computational complexity theorists repurpose fairness tools to analyze complex problems. Transitioning from multiaccuracy to multicalibration enhances understanding and simplifies approximating hard functions, benefiting algorithmic fairness and complexity theory.
Benchmarking Perfect Hashing in C++
Benchmarking perfect hashing functions in C++ using clang++-19 and g++-13 reveals mph as the fastest with limitations. Various hash function implementations are compared for lookup time, build time, and size, aiding system optimization.