August 14th, 2024

BMX: A Freshly Baked Take on BM25

Researchers have developed BMX, a new lexical search algorithm that enhances BM25 by integrating similarity and semantic understanding. Extensive tests show BMX outperforms BM25 across various datasets and languages.

Read original articleLink Icon
BMX: A Freshly Baked Take on BM25

Researchers from Mixedbread and the Hong Kong Polytechnic University have introduced a new lexical search algorithm called BMX, which aims to improve upon the widely used BM25 algorithm. BMX addresses BM25's limitations by incorporating both similarity and semantic understanding into its scoring mechanism. Key innovations include entropy-weighted similarity, which adjusts scores based on the significance of query tokens, and weighted query augmentation (WQA), allowing BMX to process original and augmented queries simultaneously for enhanced efficiency. Extensive benchmarking on various datasets, including BEIR and BRIGHT, demonstrates that BMX consistently outperforms BM25, achieving superior retrieval quality across multiple domains. Additionally, BMX shows promise in multilingual contexts, outperforming BM25 in tasks involving languages such as Chinese, Japanese, and German. The Baguetter library provides an open-source implementation of BMX, making it accessible for practical use. The development of BMX could significantly enhance user experience in applications relying on search algorithms, while also benefiting natural language processing tasks. The Mixedbread team encourages community feedback and discussions regarding the potential applications of BMX.

- BMX is a new lexical search algorithm that improves upon BM25.

- It incorporates similarity and semantic understanding to enhance search accuracy.

- Extensive benchmarking shows BMX outperforms BM25 across various datasets and languages.

- BMX is available through the open-source Baguetter library for practical implementation.

- The algorithm aims to improve user experience in search-related applications and natural language processing.

Link Icon 9 comments
By @leobg - 6 months
> Entropy-weighted similarity: We adjust the similarity scores between query tokens and related documents based on the entropy of each token.

Sounds a lot like BM25 weighted word embeddings (e.g. fastText).

By @antman - 6 months
How about computational complexity? There seems to be a small improvement in metrics but not sure if it is enough to switch to bmx
By @intalentive - 6 months
Neat. I wonder how GPT-4’s query expansion might compare with SPLADE or similar masked BERT methods. Also if you really want to go nuts you can apply term expansion to the document corpus.
By @deepsquirrelnet - 6 months
Very cool! Glad to see continued research in this direction. I’ve really enjoyed reading the Mixedbread blog. If you’re interested in retrieval topics, they’re doing some cool stuff.
By @bernihackernews - 6 months
baguetter library for the win!
By @yokee - 6 months
Super cool! It is definitely a good choice for the RAG system.
By @timsuchanek - 6 months
Amazing! When will we have this in the major databases?
By @herrmannfield - 6 months
By @flawn - 6 months
Gemischtes Brot!