November 20th, 2024

Understanding the BM25 full text search algorithm

BM25 is a widely used algorithm for full-text search that ranks documents based on relevance using query terms, IDF, term frequency, and document length normalization, enhancing hybrid search systems.

Read original articleLink Icon
Understanding the BM25 full text search algorithm

BM25, or Best Match 25, is a prominent algorithm used for full-text search, particularly in systems like Lucene, Elasticsearch, and SQLite. It ranks documents based on their relevance to a search query by utilizing several components: query terms, Inverse Document Frequency (IDF), term frequency within documents, and document length normalization. The algorithm operates under the Probability Ranking Principle, which suggests that documents should be ordered by their likelihood of relevance. BM25 calculates scores for each query term, adjusting for how common or rare the terms are across the document collection, and normalizes scores based on document length to prevent bias towards longer documents. The algorithm's cleverness lies in its ability to rank documents without needing to calculate exact probabilities, instead focusing on relative scores. While BM25 scores can indicate a document's relevance within a specific collection, they cannot be directly compared across different queries or collections. This limitation arises because BM25 does not produce a standardized score range, but rather ranks documents based on their likelihood of relevance to the query at hand. The exploration of BM25 highlights its effectiveness in hybrid search systems that combine full-text and vector similarity searches, enhancing the retrieval of relevant content.

- BM25 is a widely used algorithm for full-text search, especially in Lucene and Elasticsearch.

- It ranks documents based on query terms, IDF, term frequency, and document length normalization.

- The algorithm operates under the Probability Ranking Principle, focusing on relative scores rather than exact probabilities.

- BM25 scores cannot be directly compared across different queries or collections.

- It is effective in hybrid search systems that integrate full-text and vector similarity searches.

Link Icon 9 comments
By @DavidPP - 2 months
We use https://typesense.org/ for regular search, but it now has support for doing hybrid search, curious if anyone has tried it yet?
By @hubraumhugo - 2 months
Given the recent advances in vector-based semantic search, what's the SOTA search stack that people are using for hybrid keyword + semantic search these days?
By @jll29 - 2 months
Nice write-up.

A few more details/background that are harder to find: "BM25" stands for "Best Matching 25", "best matching" becaue it is a formula for ranking and term weighting (the matching refers to the term in the query versus the document), and the number 25 simply indicates a running number (there were 24 earlier formula variants and some later ones, but #25 turned out to work best, so it was the one that was published).

It was conceived by Stephen Robertson and Karen Spärck Jones (the latter of IDF fame) and first implemented in the former's OKAPI information retrieval (research) system. The OKAPI system was benchmarked at the annual US NIST TREC (Text Retrieval Conference) for a number of years, the international "World Champtionship" of search engine methods (although the event is not about winning, but about compariing notes and learning from each other, a highly recommended annual event held every November in Gaithersburg, Maryland, attended by global academic and industry teams that conduct research on improving search - see trec.nist.gov).

Besides the "bag of words" Vector Space Model (sparse vectors of terms), the Probabilistic Modles (that BM25 belongs to), there are suprising and still growing number of other theoretical frameworks how to rank a set of documents, given a query ("Divergence from Randomness", "Statistical Language Modeling, "Learning to Rank", "Quantum Information Retrieval", "Neural Ranking" etc.). Conferences like ICTIR and SIGIR still publish occasionaly entirely new paradigms for search. Note that the "Statistical Language Modeling" paradigm is not about Large Language Models that are on vogue now (that's covered under the "Neural Retrieval" umbrella), and that "Quantum IR" is not going to get you to a tutorial about Quantum Information Retrieval but to methods of infrared spectroscopy or a company with the same name that produces cement; such are the intricacies of search technology, even in the 21st century.

If you want to play with BM25 and compare it with some of the alternatives, I recommend the research platform Terrier, and open-source search engine developed at the University of Glasgow (today, perhaps the epicenter of search research).

BM25 is over a quarter century old, but has proven to be a hard baseline to beat (it is still often used as a reference point for comparing new nethods against), and a more recent variant, BM24F, can deal with multiple fields and hypertext (e.g. title, body of documents, hyperlinks).

The recommended paper to read is: Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 1". Information Processing & Management 36(6): 779–808, and its successor, Part 2. (Sadly they are not open access.)

By @MPSimmons - 2 months
Does anyone know if the average document length mentioned in the document length normalization is median? It seems like it would need to be to properly deweight excessively long documents, otherwise the excessively long documents would unfairly weight the average, right?
By @sidcool - 2 months
Good article. I am genuinely interested to learn about how to think of problems in such a mathematical form. And how to test it. Any resources?
By @tselvaraj - 2 months
Hybrid search solves the long-standing challenge of relevance with search results. We can use ranking fusion between keyword and vector to create a hybrid search that works in most scenarios.
By @RA_Fisher - 2 months
BM25 is an ancient algo developed in the 1970s. It’s basically a crappy statistical model and statisticians can do far better today. Search is strictly dominated by learning (that yes, can use search as an input). Not many folks realize that yet, and / or are incentivized to keep the old tech going as long as possible, but market pressures will change that.