What Postgres Full Text Search Is Missing
Companies are evaluating Elasticsearch versus native Postgres full text search for text data management. Postgres FTS offers simplicity, while Elasticsearch provides advanced features but lacks reliability as a primary data store.
Read original articleOver the past six months, discussions with over 50 companies have highlighted the choice between Elasticsearch and native Postgres full text search (FTS) for managing text data in Postgres. Full text search identifies entries based on keywords, utilizing algorithms like BM25 for ranking. Postgres FTS is integrated into the database, offering simplicity, real-time search capabilities, and reliable results due to its ACID transactions. However, it has limitations, including a lack of advanced features and performance issues with large datasets. In contrast, Elasticsearch provides a comprehensive feature set and high performance, capable of querying billions of rows quickly. Yet, it is not a reliable primary data store due to its lack of ACID compliance, necessitating complex ETL processes that can lead to data freshness issues and increased operational costs. Alternative search engines like Algolia and Meilisearch are available but face similar challenges when integrated with Postgres. ParadeDB emerges as a potential solution, combining the strengths of both Postgres FTS and Elasticsearch by embedding an advanced search engine within Postgres, thus avoiding the need for additional infrastructure. Compatibility with managed Postgres services is forthcoming.
- Companies are weighing Elasticsearch against native Postgres FTS for text search solutions.
- Postgres FTS offers simplicity and real-time search but lacks advanced features and struggles with large datasets.
- Elasticsearch excels in performance and features but poses risks as a primary data store and incurs high operational costs.
- Alternative search engines face similar integration challenges with Postgres.
- ParadeDB aims to provide a hybrid solution by embedding advanced search capabilities directly into Postgres.
Related
Just Use Postgres for Everything
The article promotes using Postgres extensively in tech stacks to simplify development, improve scalability, and reduce operational complexity. By replacing various technologies with Postgres, developers can enhance productivity, focus on customer value, and potentially cut costs.
Just Use Postgres for Everything
The blog post advocates for using PostgreSQL extensively in tech stacks to simplify development, improve productivity, and reduce complexity. It highlights benefits like scalability, efficiency, and cost-effectiveness, promoting a consolidated approach.
DuckDB Meets Postgres
Organizations shift historical Postgres data to S3 with Apache Iceberg, enhancing query capabilities. ParadeDB integrates Iceberg with S3 and Google Cloud Storage, replacing DataFusion with DuckDB for improved analytics in pg_lakehouse.
Postgres vs. Pinecone
Postgres and Pinecone differ in performance and cost. Pinecone criticizes Postgres for index issues, while Postgres showcases superior performance with tweaks, specialized indexes, and cost-effectiveness, offering transparency and customization.
Is an All-in-One Database the Future?
Specialized databases are emerging to tackle complex data challenges, leading to intricate infrastructures. A universal, all-in-one database remains unfulfilled due to optimization issues and unique challenges of different database types.
- Elasticsearch is favored for its advanced search capabilities and features, but concerns about its reliability as a primary data store persist.
- Postgres is seen as limited for complex search needs, often requiring additional tools for better search quality.
- Many users advocate for using dedicated search tools like Solr or Meilisearch instead of relying solely on Postgres or Elasticsearch.
- Consultants emphasize the importance of understanding search quality and its impact on business models.
- There is a growing interest in alternatives and integrations, such as using Elasticsearch alongside Postgres or exploring new technologies like ParadeDB.
BM25 is similar to TF/IDF. In both cases, the key idea is to consider statistics of the overall corpus as part of relevance calculations. If the user searches for "charities in new orleans" in a corpus where "new orleans" is only represented in a few documents, those should clearly rank highly. If the corpus has "new orleans" in almost every document then the term "charity" is more important.
PostgreSQL FTS cannot do this, because it doesn't maintain statistics for word frequencies across the entire corpus. This severely limits what it can implement in terms of relevance scoring - each result is scored based purely on if the search terms are present or not.
For comparison, SQLite FTS (which a lot of people are unaware of) actually does implement full index statistics, and SQLite FTS5 implements BM25 out of the box.
https://web.archive.org/web/20200815141031/https://roamanaly...
pg_search [0] not to be confused with the ruby gem of the same name [1]:
[0] - https://github.com/paradedb/paradedb/tree/dev/pg_search#over...
If the order of the search results matters to you, you might want something that gives you some more tools to control that. And if you are not measuring search quality to begin with, you probably don't care enough to even know that you are missing the tools to do a better job.
I consult clients on this stuff professionally and I've seen companies do all sorts of silly shit. Mostly it's because they simply lack the in house expertise which is usually how they end up talking to me.
I've actually had to sit clients down and explain them their own business model. Usually I come in for some technical problem and then end up talking to product managers or senior managers about stuff like this because usually the real problem is at that level. The technical issues are just a symptom.
Here's a discussion I had with a client fairly recently (paraphrasing/exaggerating, obviously):
"Me: So your business model is that your users find shit on your web site (it has a very prominent search box at the top) and then some transaction happens that causes you to make money? Customer: yes. Me: so you make more money if your search works better and users find stuff they want. Customer: yes, we want to make more money! Me: congratulations, you are a search company! Customer: LOL whut?! Me: So, why aren't you doing the kinds of things that other search companies do to ensure you maximize profit? Like measuring how good your search is or generally giving a shit whether users can actually find what they are looking for. Customer: uhhhhh ???? Me: where's your search team? Customer: oh we don't have one, you do it! Me: how did you end up with what you currently have. Customer: oh that guy (some poor, overworked dev) over there picked solution X 3 years ago and we never gave it a second thought."
Honestly, some companies can't be helped and this was an example of a company that was kind of hopelessly flailing around and doing very sub optimal things at all levels in the company. And wasting lots of time and money in the process. Not realizing your revenue and competitiveness are literally defined by your search quality is never a good sign. You take different decisions if you do.
- the fulltext tool, can and should hold only 'active' data
- as it has only active data, data size is usually much much smaller
- as data size is smaller, it better fits in RAM
- as data size is smaller, it can be probably run on poorer HW the full ACID db
- as the indexed data are mostly read-only, the VM where it runs can be relatively easily cloned (never seen a corruption till now)
- as FTS tools are usually schema-less, there is no outage during schema changes (compared to doing changes in ACID db)
- as the indexed data are mostly read-only, the can be easily backup-ed
- as the backups are smaller, restoring a backup can be very fast
- and there is no such thing as database upgrade outage, you just spin a new version, feed it with new data and than change the backends
- functionality and extensibility
There is probably more, but if one doesn't needs to do a fulltext search on whole database (and you usually don't), than its IMHO better to use separate tool, that doesn't comes with all the ACID constraints. Probably only downside is that you need to format data for the FTS and index them, but if you want run a serious full-text search, you will have to take almost the same steps in the database.
On a 15y old side project, I use SOLR for full-text search, serving 20-30k/request per day on a cheap VM, and PostgreSQL is used as primary data source. The PostgreSQL has had several longer outages - during major upgrades, because of disk corruption, because of failed schema migrations, because of 'problems' between the chair and keyboard etc... During that outages the full-text search always worked - it didn't had most recent data, but most users probably never noticed.
I’ll admit haven’t kept up with this but is it still the case that Elasticsearch is “not a reliable data store”?
I remember there used to be a line in the Elasticsearch docs saying that Elasticseach shouldn’t be your primary data store or something to that effect. At some point they removed that verbiage, seemingly indicating more confidence in their reliability but I still hear people sticking with the previous guidance.
and maybe
One clarification question - the blog post lists "lack of ACID transactions and MVCC can lead to data inconsistencies and loss, while its lack of relational properties and real-time consistency makes many database queries challenging" as the bad for ElasticSearch. What is pg_bm25's consistency model? It had been mentioned previously as offering "weak consistency" [0], which I interpret to have the same problems with transactions, MVCC, etc?
Elasticsearch/OpenSearch offers more than just search functionality; they come with extensive ecosystems, including ELK for logging and processors for ETL pipelines. These platforms are powerful and provide out-of-the-box solutions for developers. However, as the article mentioned, scaling them up or out can be costly.
ParadeDB looks like is based on tantivy(https://github.com/quickwit-oss/tantivy) which is an impressive project. We leverage it to implement full-text indexing for GreptimeDB too (https://github.com/GreptimeTeam/greptimedb). Nevertheless, building full-text indexes with Tantivy is still resource-intensive. This is why Grafana Loki, which only indexes tags instead of building full-text indexes, is popular in the observability space. In GreptimeDB, we offer users the flexibility to choose whether to build full-text indexes for text fields, while always creating inverted indexes for tags.
oh no…
Related
Just Use Postgres for Everything
The article promotes using Postgres extensively in tech stacks to simplify development, improve scalability, and reduce operational complexity. By replacing various technologies with Postgres, developers can enhance productivity, focus on customer value, and potentially cut costs.
Just Use Postgres for Everything
The blog post advocates for using PostgreSQL extensively in tech stacks to simplify development, improve productivity, and reduce complexity. It highlights benefits like scalability, efficiency, and cost-effectiveness, promoting a consolidated approach.
DuckDB Meets Postgres
Organizations shift historical Postgres data to S3 with Apache Iceberg, enhancing query capabilities. ParadeDB integrates Iceberg with S3 and Google Cloud Storage, replacing DataFusion with DuckDB for improved analytics in pg_lakehouse.
Postgres vs. Pinecone
Postgres and Pinecone differ in performance and cost. Pinecone criticizes Postgres for index issues, while Postgres showcases superior performance with tweaks, specialized indexes, and cost-effectiveness, offering transparency and customization.
Is an All-in-One Database the Future?
Specialized databases are emerging to tackle complex data challenges, leading to intricate infrastructures. A universal, all-in-one database remains unfulfilled due to optimization issues and unique challenges of different database types.