Understanding the BM25 full text search algorithm
BM25 is a widely used algorithm for full-text search that ranks documents based on relevance using query terms, IDF, term frequency, and document length normalization, enhancing hybrid search systems.
Read original articleBM25, or Best Match 25, is a prominent algorithm used for full-text search, particularly in systems like Lucene, Elasticsearch, and SQLite. It ranks documents based on their relevance to a search query by utilizing several components: query terms, Inverse Document Frequency (IDF), term frequency within documents, and document length normalization. The algorithm operates under the Probability Ranking Principle, which suggests that documents should be ordered by their likelihood of relevance. BM25 calculates scores for each query term, adjusting for how common or rare the terms are across the document collection, and normalizes scores based on document length to prevent bias towards longer documents. The algorithm's cleverness lies in its ability to rank documents without needing to calculate exact probabilities, instead focusing on relative scores. While BM25 scores can indicate a document's relevance within a specific collection, they cannot be directly compared across different queries or collections. This limitation arises because BM25 does not produce a standardized score range, but rather ranks documents based on their likelihood of relevance to the query at hand. The exploration of BM25 highlights its effectiveness in hybrid search systems that combine full-text and vector similarity searches, enhancing the retrieval of relevant content.
- BM25 is a widely used algorithm for full-text search, especially in Lucene and Elasticsearch.
- It ranks documents based on query terms, IDF, term frequency, and document length normalization.
- The algorithm operates under the Probability Ranking Principle, focusing on relative scores rather than exact probabilities.
- BM25 scores cannot be directly compared across different queries or collections.
- It is effective in hybrid search systems that integrate full-text and vector similarity searches.
Related
BM42 – a new baseline for hybrid search
Qdrant introduces BM42, combining BM25 with embeddings to enhance text retrieval. Addressing SPLADE's limitations, it leverages transformer models for semantic information extraction, promising improved retrieval quality and adaptability across domains.
Hybrid Search in CrateDB - ranking and scoring calculations in pure SQL
CrateDB's hybrid search enhances relevancy using kNN, BM25, and geospatial search. It integrates semantic and lexical searches, improving results in contexts like e-commerce through advanced query structuring and ranking techniques.
BMX: A Freshly Baked Take on BM25
Researchers have developed BMX, a new lexical search algorithm that enhances BM25 by integrating similarity and semantic understanding. Extensive tests show BMX outperforms BM25 across various datasets and languages.
Hybrid Search with PostgreSQL and Pgvector
Hybrid search improves relevancy in vector similarity searches by combining methods in PostgreSQL with pgvector. It enhances recall, index size, and query latency, utilizing reciprocal ranked fusion for result merging.
Phrase Matching in Marginalia Search
Marginalia Search has updated its engine to support phrase matching, improving accuracy by storing exact term positions. Funded by nlnet, it enhances indexing and ranking algorithms for better search results.
A few more details/background that are harder to find: "BM25" stands for "Best Matching 25", "best matching" becaue it is a formula for ranking and term weighting (the matching refers to the term in the query versus the document), and the number 25 simply indicates a running number (there were 24 earlier formula variants and some later ones, but #25 turned out to work best, so it was the one that was published).
It was conceived by Stephen Robertson and Karen Spärck Jones (the latter of IDF fame) and first implemented in the former's OKAPI information retrieval (research) system. The OKAPI system was benchmarked at the annual US NIST TREC (Text Retrieval Conference) for a number of years, the international "World Champtionship" of search engine methods (although the event is not about winning, but about compariing notes and learning from each other, a highly recommended annual event held every November in Gaithersburg, Maryland, attended by global academic and industry teams that conduct research on improving search - see trec.nist.gov).
Besides the "bag of words" Vector Space Model (sparse vectors of terms), the Probabilistic Modles (that BM25 belongs to), there are suprising and still growing number of other theoretical frameworks how to rank a set of documents, given a query ("Divergence from Randomness", "Statistical Language Modeling, "Learning to Rank", "Quantum Information Retrieval", "Neural Ranking" etc.). Conferences like ICTIR and SIGIR still publish occasionaly entirely new paradigms for search. Note that the "Statistical Language Modeling" paradigm is not about Large Language Models that are on vogue now (that's covered under the "Neural Retrieval" umbrella), and that "Quantum IR" is not going to get you to a tutorial about Quantum Information Retrieval but to methods of infrared spectroscopy or a company with the same name that produces cement; such are the intricacies of search technology, even in the 21st century.
If you want to play with BM25 and compare it with some of the alternatives, I recommend the research platform Terrier, and open-source search engine developed at the University of Glasgow (today, perhaps the epicenter of search research).
BM25 is over a quarter century old, but has proven to be a hard baseline to beat (it is still often used as a reference point for comparing new nethods against), and a more recent variant, BM24F, can deal with multiple fields and hypertext (e.g. title, body of documents, hyperlinks).
The recommended paper to read is: Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 1". Information Processing & Management 36(6): 779–808, and its successor, Part 2. (Sadly they are not open access.)
Related
BM42 – a new baseline for hybrid search
Qdrant introduces BM42, combining BM25 with embeddings to enhance text retrieval. Addressing SPLADE's limitations, it leverages transformer models for semantic information extraction, promising improved retrieval quality and adaptability across domains.
Hybrid Search in CrateDB - ranking and scoring calculations in pure SQL
CrateDB's hybrid search enhances relevancy using kNN, BM25, and geospatial search. It integrates semantic and lexical searches, improving results in contexts like e-commerce through advanced query structuring and ranking techniques.
BMX: A Freshly Baked Take on BM25
Researchers have developed BMX, a new lexical search algorithm that enhances BM25 by integrating similarity and semantic understanding. Extensive tests show BMX outperforms BM25 across various datasets and languages.
Hybrid Search with PostgreSQL and Pgvector
Hybrid search improves relevancy in vector similarity searches by combining methods in PostgreSQL with pgvector. It enhances recall, index size, and query latency, utilizing reciprocal ranked fusion for result merging.
Phrase Matching in Marginalia Search
Marginalia Search has updated its engine to support phrase matching, improving accuracy by storing exact term positions. Funded by nlnet, it enhances indexing and ranking algorithms for better search results.