Hybrid Search in CrateDB - ranking and scoring calculations in pure SQL
CrateDB's hybrid search enhances relevancy using kNN, BM25, and geospatial search. It integrates semantic and lexical searches, improving results in contexts like e-commerce through advanced query structuring and ranking techniques.
Read original articleCrateDB's hybrid search combines multiple search algorithms to enhance relevancy and accuracy. It supports three primary search functions: k-nearest neighbors (kNN) search, BM25 (full-text) search, and geospatial search. Hybrid search is particularly effective when integrating semantic search, which understands context, with lexical search, which focuses on keyword frequency. For instance, in an e-commerce context, a user searching for "gpu ASUS" would benefit from results that match both the product type and brand.
BM25 is a bag-of-words algorithm that ranks documents based on keyword occurrences, document length, and average document length. CrateDB utilizes Lucene's capabilities to allow for various search customizations, including fuzziness and analyzers. Vector search transforms data into dense vectors, enabling similarity calculations based on vector proximity. This method is useful for clustering and recommendations.
Hybrid search can be implemented through techniques like convex combination, which applies weighted scores from different search methods, or reciprocal rank fusion (RRF), which merges ranks without considering specific scores. Both methods aim to produce a single, more relevant result set by combining the strengths of different search approaches.
To execute hybrid searches, users can structure queries using common table expressions to join results from both search methods, allowing for a comprehensive search experience that leverages the strengths of each algorithm. This approach is particularly beneficial for applications requiring nuanced search capabilities, such as those found in e-commerce and data analytics.
Related
Redis Alternative at Apache Software Foundation Now Supports RediSearch and SQL
A new query engine, KQIR, supports SQL and RediSearch queries for Apache Kvrocks, a Redis-compatible database. It aims to combine performance with transaction guarantees and complex query support, utilizing an intermediate language for consistency. Future plans include expanding field types and enhancing transaction guarantees.
BM42 – a new baseline for hybrid search
Qdrant introduces BM42, combining BM25 with embeddings to enhance text retrieval. Addressing SPLADE's limitations, it leverages transformer models for semantic information extraction, promising improved retrieval quality and adaptability across domains.
DuckDB: Vector Similarity Search Extension
The vss extension in DuckDB enhances vector similarity search with HNSW indexing for ARRAY columns. Users can optimize queries with distance metrics but should be cautious due to limitations and experimental features.
turbopuffer: Fast Search on Object Storage
Simon Hørup Eskildsen founded turbopuffer in 2023 to offer a cost-efficient search engine using object storage and SSD caching. Notable customers experienced 10x cost reduction and improved latency. Application-based access.
Korvus: Single-Query RAG with Postgres
Korvus is a search SDK merging RAG pipeline into a Postgres query, using Python, JavaScript, and Rust bindings. It streamlines search processes, minimizes infrastructure needs, and offers detailed documentation on GitHub.
Related
Redis Alternative at Apache Software Foundation Now Supports RediSearch and SQL
A new query engine, KQIR, supports SQL and RediSearch queries for Apache Kvrocks, a Redis-compatible database. It aims to combine performance with transaction guarantees and complex query support, utilizing an intermediate language for consistency. Future plans include expanding field types and enhancing transaction guarantees.
BM42 – a new baseline for hybrid search
Qdrant introduces BM42, combining BM25 with embeddings to enhance text retrieval. Addressing SPLADE's limitations, it leverages transformer models for semantic information extraction, promising improved retrieval quality and adaptability across domains.
DuckDB: Vector Similarity Search Extension
The vss extension in DuckDB enhances vector similarity search with HNSW indexing for ARRAY columns. Users can optimize queries with distance metrics but should be cautious due to limitations and experimental features.
turbopuffer: Fast Search on Object Storage
Simon Hørup Eskildsen founded turbopuffer in 2023 to offer a cost-efficient search engine using object storage and SSD caching. Notable customers experienced 10x cost reduction and improved latency. Application-based access.
Korvus: Single-Query RAG with Postgres
Korvus is a search SDK merging RAG pipeline into a Postgres query, using Python, JavaScript, and Rust bindings. It streamlines search processes, minimizes infrastructure needs, and offers detailed documentation on GitHub.