BMX: A Freshly Baked Take on BM25
Researchers have developed BMX, a new lexical search algorithm that enhances BM25 by integrating similarity and semantic understanding. Extensive tests show BMX outperforms BM25 across various datasets and languages.
Read original articleResearchers from Mixedbread and the Hong Kong Polytechnic University have introduced a new lexical search algorithm called BMX, which aims to improve upon the widely used BM25 algorithm. BMX addresses BM25's limitations by incorporating both similarity and semantic understanding into its scoring mechanism. Key innovations include entropy-weighted similarity, which adjusts scores based on the significance of query tokens, and weighted query augmentation (WQA), allowing BMX to process original and augmented queries simultaneously for enhanced efficiency. Extensive benchmarking on various datasets, including BEIR and BRIGHT, demonstrates that BMX consistently outperforms BM25, achieving superior retrieval quality across multiple domains. Additionally, BMX shows promise in multilingual contexts, outperforming BM25 in tasks involving languages such as Chinese, Japanese, and German. The Baguetter library provides an open-source implementation of BMX, making it accessible for practical use. The development of BMX could significantly enhance user experience in applications relying on search algorithms, while also benefiting natural language processing tasks. The Mixedbread team encourages community feedback and discussions regarding the potential applications of BMX.
- BMX is a new lexical search algorithm that improves upon BM25.
- It incorporates similarity and semantic understanding to enhance search accuracy.
- Extensive benchmarking shows BMX outperforms BM25 across various datasets and languages.
- BMX is available through the open-source Baguetter library for practical implementation.
- The algorithm aims to improve user experience in search-related applications and natural language processing.
Related
BM42 – a new baseline for hybrid search
Qdrant introduces BM42, combining BM25 with embeddings to enhance text retrieval. Addressing SPLADE's limitations, it leverages transformer models for semantic information extraction, promising improved retrieval quality and adaptability across domains.
How we improved search results in 1Password
1Password has improved its search functionality by integrating large language models, enhancing accuracy and flexibility while ensuring user privacy. The update retains original search options for user preference.
Unsafe Read Beyond of Death
The article details the "Unsafe Read Beyond of Death" optimization for the GxHash algorithm, enhancing performance through SIMD instructions and achieving over tenfold speed increases for small payloads while ensuring safety.
Hybrid Search in CrateDB - ranking and scoring calculations in pure SQL
CrateDB's hybrid search enhances relevancy using kNN, BM25, and geospatial search. It integrates semantic and lexical searches, improving results in contexts like e-commerce through advanced query structuring and ranking techniques.
Usearch: Single-File Similarity Search
USearch is a high-performance similarity search engine optimized for vectors and text, supporting multiple programming languages and platforms, claiming to be up to 10 times faster than FAISS.
Sounds a lot like BM25 weighted word embeddings (e.g. fastText).
https://blogs.perficient.com/2012/09/25/a-mathematical-model...
Related
BM42 – a new baseline for hybrid search
Qdrant introduces BM42, combining BM25 with embeddings to enhance text retrieval. Addressing SPLADE's limitations, it leverages transformer models for semantic information extraction, promising improved retrieval quality and adaptability across domains.
How we improved search results in 1Password
1Password has improved its search functionality by integrating large language models, enhancing accuracy and flexibility while ensuring user privacy. The update retains original search options for user preference.
Unsafe Read Beyond of Death
The article details the "Unsafe Read Beyond of Death" optimization for the GxHash algorithm, enhancing performance through SIMD instructions and achieving over tenfold speed increases for small payloads while ensuring safety.
Hybrid Search in CrateDB - ranking and scoring calculations in pure SQL
CrateDB's hybrid search enhances relevancy using kNN, BM25, and geospatial search. It integrates semantic and lexical searches, improving results in contexts like e-commerce through advanced query structuring and ranking techniques.
Usearch: Single-File Similarity Search
USearch is a high-performance similarity search engine optimized for vectors and text, supporting multiple programming languages and platforms, claiming to be up to 10 times faster than FAISS.