September 13th, 2024

PGVector's Missing Features

Trieve's blog post outlines PGVector's limitations in vector search, including issues with required words, performance, and support for sparse vectors, suggesting dedicated solutions like Trieve for advanced search needs.

Read original article

Trieve's recent blog post discusses the limitations of PGVector, a tool used for vector search within PostgreSQL. While PGVector simplifies infrastructure, it lacks several key features that are essential for effective search solutions. The blog highlights four main shortcomings: the inability to handle required and negated words effectively, the lack of explainability through keyword highlights, slow performance on filter and order by queries, and limited support for sparse vectors and full-text search modes. These issues can lead to poor user experiences, especially for those accustomed to traditional keyword searches. Trieve emphasizes that while PGVector may be suitable for some applications due to its simplicity, it may not be the best choice for those needing advanced search capabilities. The post encourages users to consider dedicated search solutions like Trieve, which offer better performance and features tailored for semantic search. Trieve also invites readers to engage further through meetings or discussions on their platform.

- PGVector has limitations in handling required and negated words in search queries.

- It lacks keyword highlighting, which affects user experience and query refinement.

- Performance issues with filter and order by queries make PGVector less efficient.

- PGVector does not support sparse vectors or full-text search, limiting its functionality.

- Trieve offers a more robust solution for developers needing advanced search capabilities.

Is an All-in-One Database the Future?

Specialized databases are emerging to tackle complex data challenges, leading to intricate infrastructures. A universal, all-in-one database remains unfulfilled due to optimization issues and unique challenges of different database types.

What Postgres Full Text Search Is Missing

Companies are evaluating Elasticsearch versus native Postgres full text search for text data management. Postgres FTS offers simplicity, while Elasticsearch provides advanced features but lacks reliability as a primary data store.

MariaDB Introduces Open-Source Vector Preview

MariaDB 11.6 introduces a public preview of its open-source Vector search feature, utilizing the HNSW algorithm to support large language models, aiming to compete with MySQL and attract developers.

Postgres as a Search Engine

Postgres can function as a search engine by integrating full-text, semantic, and fuzzy search techniques, enhancing retrieval quality and allowing for effective ranking and relevance tuning within existing databases.

300μs typo correction for 1.3M words

Trieve has enhanced its typo correction system, achieving 300 microseconds for correct queries and 5 milliseconds for misspellings, utilizing ClickHouse, BKTree, and Redis for improved search efficiency.

6 comments

By @refibrillator - 5 months

> Pgvector is very slow, seconds to 10’s of seconds, on filter and order by queries

This is an absurd claim to make with no qualification on index type, number of rows, etc. Really undermines the authority of the post, even though the other points are fair albeit obvious from the docs (pgvector has no keyword or hybrid search, negated words, or match highlighting).

In comparison pgvecto.rs does look more promising in some of these aspects, some docs here: https://docs.pgvecto.rs/faqs/comparison-pgvector.html

By @nikcub - 5 months

pgvectorscale from the timescaledb team introduces StreamingDiskANN indexes and compression to pgvector:

https://github.com/timescale/pgvectorscale

the sales blurb:

> On a benchmark dataset of 50 million Cohere embeddings with 768 dimensions each, PostgreSQL with pgvector and pgvectorscale achieves 28x lower p95 latency and 16x higher query throughput compared to Pinecone's storage optimized (s1) index for approximate nearest neighbor queries at 99% recall, all at 75% less cost when self-hosted on AWS EC2.

been using it in dev (~2.5M embeddings to date) and so far, pretty great

edit: HN announcement + thread from a couple of months ago:

https://news.ycombinator.com/item?id=40646276

By @viraptor - 5 months

I'm confused by this one:

> Postgres pgvector does not come up with a way to highlight the keyword snippets being matched on.

Yeah... That's just not what it's for. But you can do that post processing yourself if you want to - there's lots of different ways to achieve it. I'm not sure that's really a missing feature, since it doesn't belong to a vector database.

By @hakanderyal - 5 months

Filtering issue also prevents shared database multi-tenant setups, which is a blocker by itself for many.

I would love to not having to run another service for vector search, but it's just not there yet.

By @avthar - 5 months

While I value the depth in which the blog authors went into, some of these are not problems with pgvector but problems with similarity search itself and would be problems with other vector databases as well. There are also solutions to the problems describe that the authors do not mention that I think any reader should take into account.

> 1. Required and Negated Words

This is not a bug of pgvector but a bug in embedding models and simple similarity search itself. You'd run into this issue on when doing RAG with any vectordb or ANN search library. You could probably solve this with query expansion, running multiple similarity searches in parallel, and doing filtering of results containing the required theme vs trying to get this all in a single search.

> 2. Explainability With Highlights

This is a misunderstanding of the purpose of embeddings based semantic search. The point of semantic search is not to match on exact keywords but to match on the /meaning/. If you want exact keyword matching, use full text search. This is something that can also be solved by hybrid search, combining full text search and semantic search and using a re-ranker. PostgreSQL has built in FTS with tsvector.

> 3. Performant Filters and Order By’s

The authors do not disclose any details about what index they use here or what kind of filtering they are trying to do. As other commenters point out, the StreamingDiskANN in the pgvectorscale extension [0] (complement to pgvector, you can use them together) improves on performance and accuracy of filtering vs pgvector HNSW (see details in [1]).

>4. Support for sparse vectors, BM25, and other inverse document frequency search modes

This is probably the most fair point in the post. But there exists projects like pg_search from ParadeDB which bring bm_25 to Postgres and help solve this [2]

Lastly, I respect companies trying to provide real-world examples of the trade-offs of different systems, and so thank the authors for sharing their experience and spurring discussion.

Disclaimer: I work at Timescale, where we offer pgvector, and also made other extensions for AI/ vector workloads on PostgreSQL, namely pgvectorscale, and pgai. I've tried to be as even in my analysis as possible but as with everything on the internet, you can make up your own mind and decide for yourself.

[0]: https://github.com/timescale/pgvectorscale/ [1]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas... [2] https://github.com/paradedb/paradedb