July 4th, 2024

Finding near-duplicates with Jaccard similarity and MinHash

Jaccard similarity and MinHash are used to find near-duplicates in document collections efficiently. By comparing sampled elements, approximate duplicates are grouped, simplifying detection and scaling well for large datasets. Adjusting similarity thresholds helps detect fuzzier duplicates.

Read original articleLink Icon
Finding near-duplicates with Jaccard similarity and MinHash

This article discusses the use of Jaccard similarity and MinHash for identifying near-duplicates in a large collection of documents. Jaccard similarity measures the overlap between sets, with a range from 0 to 1 indicating dissimilarity to identity. MinHash signatures are used to approximate Jaccard similarity efficiently by comparing randomly sampled elements from sets. By grouping documents based on these signatures, approximate duplicates can be identified without comparing every pair. The approach simplifies the process and scales well for large datasets. By using multiple hash functions, the accuracy of the similarity approximation can be improved. The method allows for efficient detection of documents with high similarity without exhaustive comparisons. The article also discusses strategies for detecting fuzzier duplicates by adjusting the similarity threshold. Overall, the approach offers a practical and scalable solution for identifying near-duplicates in large document collections.

Related

Group Actions and Hashing Unordered Multisets

Group Actions and Hashing Unordered Multisets

Group actions are used to analyze hash functions for unordered sets and multisets, ensuring order-agnostic hashing. By leveraging group theory, particularly abelian groups, hash functions' structure is explored, emphasizing efficient and order-independent hashing techniques.

New JavaScript Set Methods

New JavaScript Set Methods

New JavaScript Set methods introduced in major browsers like Firefox 127 offer efficient set operations without polyfills. These methods simplify tasks like finding intersections, unions, and subsets, enhancing working with unique collections.

A new method of recording and searching information (1953)

A new method of recording and searching information (1953)

Fermat's Library explores Hans Peter Luhn's method for organizing information using descriptive metadata called "legends." Luhn's system enhances search accuracy by linking terms and normalizing language, improving information retrieval efficiency.

The question of what's fair illuminates the question of what's hard

The question of what's fair illuminates the question of what's hard

Computational complexity theorists repurpose fairness tools to analyze complex problems. Transitioning from multiaccuracy to multicalibration enhances understanding and simplifies approximating hard functions, benefiting algorithmic fairness and complexity theory.

Benchmarking Perfect Hashing in C++

Benchmarking Perfect Hashing in C++

Benchmarking perfect hashing functions in C++ using clang++-19 and g++-13 reveals mph as the fastest with limitations. Various hash function implementations are compared for lookup time, build time, and size, aiding system optimization.

Link Icon 15 comments
By @tpoacher - 5 months
One thing many people miss about set based metrics like the jaccard similarity (aka Tanimoto coefficient) and F1 score (aka Dice coefficient), is that they can also be used identically with fuzzy sets.

The only complication is that you then need to choose a suitable T-Norm / T-Conorm pair, which express the concept of intersection and union for fuzzy sets, and there's an infinite family of them. But that's a good thing, since you get to pick the pair with your desired semantics.

I wrote about this ([0][1]) in the context of validating medical image segmentations when both the segmentation and ground truth are probabilistic/fuzzy rather than binary masks.

Otherwise what most people do instead is to simply threshold at 0.5 to obtain binary sets for use with the binary variants of the jaccard / dice coefficients. Which apparently decreases the precision of your validation operator by 2 orders of magnitude. It's like, you publish your algorithm claiming it's better than SOTA by 0.001, ignoring the fact that your validation operator has an error margin of 0.1 ...

[0] https://link.springer.com/chapter/10.1007/978-3-319-46723-8_...

[1] https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...

By @BiteCode_dev - 5 months
I worked with a client that implemented their own Python version of this to deduplicate citizen entries in a big french gov database. It worked well.

Of course, nowaday I would probably just tell them to use datasketch (https://pypi.org/project/datasketch/).

With this trip to memory lane, I looked around a little, and noticed people are still creating new stuff on the topic. E.G:

https://pypi.org/project/rensa/

Which is basically a more specialized but faster version of datasketch minhash, written in rust, with a little python on top.

By @pkeenan11 - 5 months
Holy synchronicity Batman! I just implemented a minhash system that some may find interesting.

The problem is to find (pseudo) inverses of many different proper submatrices of a big square matrix.

Certain matrix identities (Woodbury, banachiewicz) allow you to update the inverse of a “close” submatrix to cheaply calculate a new one.

So you store the inverses you’ve calculated already, with their row/column indices as keys. Then for each new submatrix you want to find a close already-computed inverse to update from.

This is solved with minhashing. You minwise hash the indices so that close matrices are likely to have the same hash.

In my particular implementation I did a “multi-resolution” hash so that I can change the selectiveness of the search as the number of prior computed inverses grows

By @a-dub - 5 months
a little backstory (that seems to be missing from the blog post). this is a technique that was invented in the early days of google for deduplicating the crawl set (turns out that building an llm and building a plain ol' web text index look awfully similar; i wonder why that is?!). you can read about it in depth in jeffrey ullman's free book "mining massive datasets" which describes many of the really cool and impressive techniques that were used to prepare an index build for the entire internet when such a thing was really hard.

you can find the relevant material for free by searching "chapter 3 pdf mmds ullman"

enjoy!

edit: oh no! i'm wrong! according to wikipedia it was invented at dec for altavista. https://en.wikipedia.org/wiki/MinHash either way there's a nice description in the ullman book and they do describe how it was used at google as well.

By @carnewal - 5 months
As I got into Minhash & its variants I found it hard to wrap my head around this, so I'm building an online visualizer

https://websla.sh/tools/minhash

It's not really complete, I'd like to show Jaccard Similarity calculations among other things, but this already allows you to enter multiple strings and see with your own eyes what a "minhash" actually is.

By @ashvardanian - 5 months
Hashing or tiny neural nets combined with a Vector Search engine with Tanimoto/Jaccard is a very common deduplication strategy for large datasets. It might be wiser than using linear-complexity MapReduce operations.

There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim

By @vivzkestrel - 5 months
I have this problem right now in postgres. I have 600000 feed_items with the schema (feed_item_id uuid, author varchar, content text, guid varchar, link varchar, title varchar, summary text, feed_id integer) The content and summary columns especially for some of the news items are highly similar but not equal. For any given 2 such news items, I am trying to cut them down to 1. Any ideas?
By @ryantwolf - 5 months
Love the article! My team at nvidia recently released a GPU accelerated version of the fuzzy deduplication algorithm described, and I figure this community might be interested.

Here's the repo: https://github.com/NVIDIA/NeMo-Curator/

Some documentation on the fuzzy dedup scripts: https://docs.nvidia.com/nemo-framework/user-guide/latest/dat...

And a Python example: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fu...

Would be interested in any feedback from the folks here.

By @dekhn - 5 months
This is one of those techniques I don't understand when I read it as text, but will instantly absorb when I have a working code example and pass some of my own data through it a few times, inspecting the working process.

I first learned about this technique from Douglas Eck (https://research.google/people/douglas-eck/), who used it for song clustering at Google. I remember him saying something about hashing and random vectors, and was puzzled because I figured that less random optimiztion would work better.

By @derefr - 5 months
As a document clustering / dataset deduplication technique, how well does "throwing ML at the problem" (e.g. using a pre-trained LLM encoder to generate vector embeddings of your documents, and then sticking those vectors into a vector DB and doing k-means to cluster the vectors) compare, quality-wise and performance-wise, to these simpler discrete-algorithm methods?
By @ColinWright - 5 months
In case the author is following this:

Typo:

It’s worth noticing that this definition is not in general transitive: we may well have three documents A,B,C such that S(A,B)≥S_{crit} and S(B,C)≥S_{crit} but S(A,B)<S _{crit}

That last S(A,B) should be S(A,C).

(Unless I'm very much mistaken ... and I could be)

By @is_true - 5 months
I had to do this with sport teams but I used levenshtein for the names. I ended up creating a vector with other data (country, gender) and using that vector to calculate the distance (similarity).
By @motohagiography - 5 months
there was a really simple string sort by similarity oneliner I can't find in my history, but the basic idea was to take a wordlist, split each word in half, reverse the halves, concatenate them back together, sort the resulting strings, then flip the strings back to their original words.

this simple shuffle was a poor man's similarity sort that I used to find sound-alike words. the thing about the quality of similarity is that it's hard to know for sure if it worked, but this was sufficient.

By @ptspts - 5 months
Do the minhash-based algorithms in the article give exactly the same results (but faster) as pairwise calculation of J(a, b), or are the minhash-based results approximate?