August 14th, 2024

BMX: A Freshly Baked Take on BM25

Researchers have developed BMX, a new lexical search algorithm that enhances BM25 by integrating similarity and semantic understanding. Extensive tests show BMX outperforms BM25 across various datasets and languages.

Read original article

Researchers from Mixedbread and the Hong Kong Polytechnic University have introduced a new lexical search algorithm called BMX, which aims to improve upon the widely used BM25 algorithm. BMX addresses BM25's limitations by incorporating both similarity and semantic understanding into its scoring mechanism. Key innovations include entropy-weighted similarity, which adjusts scores based on the significance of query tokens, and weighted query augmentation (WQA), allowing BMX to process original and augmented queries simultaneously for enhanced efficiency. Extensive benchmarking on various datasets, including BEIR and BRIGHT, demonstrates that BMX consistently outperforms BM25, achieving superior retrieval quality across multiple domains. Additionally, BMX shows promise in multilingual contexts, outperforming BM25 in tasks involving languages such as Chinese, Japanese, and German. The Baguetter library provides an open-source implementation of BMX, making it accessible for practical use. The development of BMX could significantly enhance user experience in applications relying on search algorithms, while also benefiting natural language processing tasks. The Mixedbread team encourages community feedback and discussions regarding the potential applications of BMX.

- BMX is a new lexical search algorithm that improves upon BM25.

- It incorporates similarity and semantic understanding to enhance search accuracy.

- Extensive benchmarking shows BMX outperforms BM25 across various datasets and languages.

- BMX is available through the open-source Baguetter library for practical implementation.

- The algorithm aims to improve user experience in search-related applications and natural language processing.

BM42 – a new baseline for hybrid search

Qdrant introduces BM42, combining BM25 with embeddings to enhance text retrieval. Addressing SPLADE's limitations, it leverages transformer models for semantic information extraction, promising improved retrieval quality and adaptability across domains.

How we improved search results in 1Password

1Password has improved its search functionality by integrating large language models, enhancing accuracy and flexibility while ensuring user privacy. The update retains original search options for user preference.

Unsafe Read Beyond of Death

The article details the "Unsafe Read Beyond of Death" optimization for the GxHash algorithm, enhancing performance through SIMD instructions and achieving over tenfold speed increases for small payloads while ensuring safety.

Hybrid Search in CrateDB - ranking and scoring calculations in pure SQL

CrateDB's hybrid search enhances relevancy using kNN, BM25, and geospatial search. It integrates semantic and lexical searches, improving results in contexts like e-commerce through advanced query structuring and ranking techniques.

Usearch: Single-File Similarity Search

USearch is a high-performance similarity search engine optimized for vectors and text, supporting multiple programming languages and platforms, claiming to be up to 10 times faster than FAISS.

9 comments

By @leobg - 8 months

> Entropy-weighted similarity: We adjust the similarity scores between query tokens and related documents based on the entropy of each token.

Sounds a lot like BM25 weighted word embeddings (e.g. fastText).

By @antman - 8 months

How about computational complexity? There seems to be a small improvement in metrics but not sure if it is enough to switch to bmx

By @intalentive - 8 months

Neat. I wonder how GPT-4’s query expansion might compare with SPLADE or similar masked BERT methods. Also if you really want to go nuts you can apply term expansion to the document corpus.

By @deepsquirrelnet - 8 months

Very cool! Glad to see continued research in this direction. I’ve really enjoyed reading the Mixedbread blog. If you’re interested in retrieval topics, they’re doing some cool stuff.

By @bernihackernews - 8 months

baguetter library for the win!

By @yokee - 8 months

Super cool! It is definitely a good choice for the RAG system.

By @timsuchanek - 8 months

Amazing! When will we have this in the major databases?

By @herrmannfield - 8 months

why do i have this vibe ?

https://blogs.perficient.com/2012/09/25/a-mathematical-model...

By @flawn - 8 months

Gemischtes Brot!