August 6th, 2024

What Postgres Full Text Search Is Missing

Companies are evaluating Elasticsearch versus native Postgres full text search for text data management. Postgres FTS offers simplicity, while Elasticsearch provides advanced features but lacks reliability as a primary data store.

Read original articleLink Icon
ConcernFrustrationSkepticism
What Postgres Full Text Search Is Missing

Over the past six months, discussions with over 50 companies have highlighted the choice between Elasticsearch and native Postgres full text search (FTS) for managing text data in Postgres. Full text search identifies entries based on keywords, utilizing algorithms like BM25 for ranking. Postgres FTS is integrated into the database, offering simplicity, real-time search capabilities, and reliable results due to its ACID transactions. However, it has limitations, including a lack of advanced features and performance issues with large datasets. In contrast, Elasticsearch provides a comprehensive feature set and high performance, capable of querying billions of rows quickly. Yet, it is not a reliable primary data store due to its lack of ACID compliance, necessitating complex ETL processes that can lead to data freshness issues and increased operational costs. Alternative search engines like Algolia and Meilisearch are available but face similar challenges when integrated with Postgres. ParadeDB emerges as a potential solution, combining the strengths of both Postgres FTS and Elasticsearch by embedding an advanced search engine within Postgres, thus avoiding the need for additional infrastructure. Compatibility with managed Postgres services is forthcoming.

- Companies are weighing Elasticsearch against native Postgres FTS for text search solutions.

- Postgres FTS offers simplicity and real-time search but lacks advanced features and struggles with large datasets.

- Elasticsearch excels in performance and features but poses risks as a primary data store and incurs high operational costs.

- Alternative search engines face similar integration challenges with Postgres.

- ParadeDB aims to provide a hybrid solution by embedding advanced search capabilities directly into Postgres.

AI: What people are saying
The discussion around Elasticsearch versus Postgres for full-text search reveals several key insights and common themes.
  • Elasticsearch is favored for its advanced search capabilities and features, but concerns about its reliability as a primary data store persist.
  • Postgres is seen as limited for complex search needs, often requiring additional tools for better search quality.
  • Many users advocate for using dedicated search tools like Solr or Meilisearch instead of relying solely on Postgres or Elasticsearch.
  • Consultants emphasize the importance of understanding search quality and its impact on business models.
  • There is a growing interest in alternatives and integrations, such as using Elasticsearch alongside Postgres or exploring new technologies like ParadeDB.
Link Icon 22 comments
By @simonw - 6 months
The BM25 thing is actually a really big deal.

BM25 is similar to TF/IDF. In both cases, the key idea is to consider statistics of the overall corpus as part of relevance calculations. If the user searches for "charities in new orleans" in a corpus where "new orleans" is only represented in a few documents, those should clearly rank highly. If the corpus has "new orleans" in almost every document then the term "charity" is more important.

PostgreSQL FTS cannot do this, because it doesn't maintain statistics for word frequencies across the entire corpus. This severely limits what it can implement in terms of relevance scoring - each result is scored based purely on if the search terms are present or not.

For comparison, SQLite FTS (which a lot of people are unaware of) actually does implement full index statistics, and SQLite FTS5 implements BM25 out of the box.

By @ashconnor - 6 months
Facets are possible in Postgres but it looks complex:

https://web.archive.org/web/20200815141031/https://roamanaly...

pg_search [0] not to be confused with the ruby gem of the same name [1]:

[0] - https://github.com/paradedb/paradedb/tree/dev/pg_search#over...

[1] - https://github.com/Casecommons/pg_search

By @jillesvangurp - 5 months
If you care about search quality, postgres ends up being fairly limited. I've done stuff with it; it's alright for really simple stuff. But it's the wrong tool for the job generally.

If the order of the search results matters to you, you might want something that gives you some more tools to control that. And if you are not measuring search quality to begin with, you probably don't care enough to even know that you are missing the tools to do a better job.

I consult clients on this stuff professionally and I've seen companies do all sorts of silly shit. Mostly it's because they simply lack the in house expertise which is usually how they end up talking to me.

I've actually had to sit clients down and explain them their own business model. Usually I come in for some technical problem and then end up talking to product managers or senior managers about stuff like this because usually the real problem is at that level. The technical issues are just a symptom.

Here's a discussion I had with a client fairly recently (paraphrasing/exaggerating, obviously):

"Me: So your business model is that your users find shit on your web site (it has a very prominent search box at the top) and then some transaction happens that causes you to make money? Customer: yes. Me: so you make more money if your search works better and users find stuff they want. Customer: yes, we want to make more money! Me: congratulations, you are a search company! Customer: LOL whut?! Me: So, why aren't you doing the kinds of things that other search companies do to ensure you maximize profit? Like measuring how good your search is or generally giving a shit whether users can actually find what they are looking for. Customer: uhhhhh ???? Me: where's your search team? Customer: oh we don't have one, you do it! Me: how did you end up with what you currently have. Customer: oh that guy (some poor, overworked dev) over there picked solution X 3 years ago and we never gave it a second thought."

Honestly, some companies can't be helped and this was an example of a company that was kind of hopelessly flailing around and doing very sub optimal things at all levels in the company. And wasting lots of time and money in the process. Not realizing your revenue and competitiveness are literally defined by your search quality is never a good sign. You take different decisions if you do.

By @samsk - 6 months
For internet user facing full-text search I would always prefer to use a separate tool and not a SQL database, because

- the fulltext tool, can and should hold only 'active' data

- as it has only active data, data size is usually much much smaller

- as data size is smaller, it better fits in RAM

- as data size is smaller, it can be probably run on poorer HW the full ACID db

- as the indexed data are mostly read-only, the VM where it runs can be relatively easily cloned (never seen a corruption till now)

- as FTS tools are usually schema-less, there is no outage during schema changes (compared to doing changes in ACID db)

- as the indexed data are mostly read-only, the can be easily backup-ed

- as the backups are smaller, restoring a backup can be very fast

- and there is no such thing as database upgrade outage, you just spin a new version, feed it with new data and than change the backends

- functionality and extensibility

There is probably more, but if one doesn't needs to do a fulltext search on whole database (and you usually don't), than its IMHO better to use separate tool, that doesn't comes with all the ACID constraints. Probably only downside is that you need to format data for the FTS and index them, but if you want run a serious full-text search, you will have to take almost the same steps in the database.

On a 15y old side project, I use SOLR for full-text search, serving 20-30k/request per day on a cheap VM, and PostgreSQL is used as primary data source. The PostgreSQL has had several longer outages - during major upgrades, because of disk corruption, because of failed schema migrations, because of 'problems' between the chair and keyboard etc... During that outages the full-text search always worked - it didn't had most recent data, but most users probably never noticed.

By @radpanda - 6 months
> Because Elasticsearch is not a reliable data store, organizations that use Postgres typically extract, transform, and load (ETL) data from Postgres to Elasticsearch

I’ll admit haven’t kept up with this but is it still the case that Elasticsearch is “not a reliable data store”?

I remember there used to be a line in the Elasticsearch docs saying that Elasticseach shouldn’t be your primary data store or something to that effect. At some point they removed that verbiage, seemingly indicating more confidence in their reliability but I still hear people sticking with the previous guidance.

By @sgift - 5 months
Solr. The correct answer is Solr. All the search features of Elasticsearch (both are built on Lucene, which provides the actual search implementation) without the problems that Elastic (the company) or Elasticsearch (the product) brings with it. 99% of companies using Elasticsearch would be far better of with Solr.
By @glintik - 5 months
Postgres has very limited features related to FTS, so it is useful only for very simple cases or for very limited budget. Elasticsearch is stable enough(but good devops and devs needed) and has numerous of features. Anyway Elasticsearch is not a "plug and play" solution and requires much work to be done for good search quality. Why I know this all? I'm running ecommerce search startup, based on Elasticsearch as primary storage.
By @alanwli - 6 months
Always great to see Postgres-based alternatives.

One clarification question - the blog post lists "lack of ACID transactions and MVCC can lead to data inconsistencies and loss, while its lack of relational properties and real-time consistency makes many database queries challenging" as the bad for ElasticSearch. What is pg_bm25's consistency model? It had been mentioned previously as offering "weak consistency" [0], which I interpret to have the same problems with transactions, MVCC, etc?

[0]: https://news.ycombinator.com/item?id=37864089

By @dewey - 5 months
I've had good experiences with both Postgres + ES and also Postgres + Meilisearch. The latter one I only used for small side projects but it was very easy to get running (With a RoR app).
By @coding123 - 6 months
The analytical part is what usually makes the Elasticsearch (or opensearch) the choice - it's not that you have to search just the text, its that with that text you need to show 5 different categories some of the top values and the number of occurrences and permutations thereof if the user decides to clicky clicky and filter it more.
By @sideway - 5 months
Semi-related: I guess some companies have already started augmenting their search with the use of LLMs. If you've worked on a similar project, what is your (very) high-level architecture? Did you see a noticeable difference in relevance and query time?
By @killme2008 - 6 months
I believe the primary issue with using full-text search in PostgreSQL is its performance and scalability limitations.

Elasticsearch/OpenSearch offers more than just search functionality; they come with extensive ecosystems, including ELK for logging and processors for ETL pipelines. These platforms are powerful and provide out-of-the-box solutions for developers. However, as the article mentioned, scaling them up or out can be costly.

ParadeDB looks like is based on tantivy(https://github.com/quickwit-oss/tantivy) which is an impressive project. We leverage it to implement full-text indexing for GreptimeDB too (https://github.com/GreptimeTeam/greptimedb). Nevertheless, building full-text indexes with Tantivy is still resource-intensive. This is why Grafana Loki, which only indexes tags instead of building full-text indexes, is popular in the observability space. In GreptimeDB, we offer users the flexibility to choose whether to build full-text indexes for text fields, while always creating inverted indexes for tags.

By @demilich - 5 months
https://github.com/infiniflow/infinity, dense vector + sparse vector + fulltext search(BM25) + late interact reranker(Colbert)
By @j45 - 6 months
Solr still seems to be an option
By @Sytten - 5 months
AWS RDS support is key IMO, I have done a lot of consulting on infrastructure and most clients I had don't want to have third party services and don't want to manage their DBs.
By @haolez - 6 months
Are there success cases of using ElasticSearch/OpenSearch as the primary data store in production? Just curious.
By @izietto - 6 months
Is this distributed as a PostgreSQL extension as well?
By @corytheboyd - 6 months
> We’ve talked to many companies who have tried and regretted their decision to use Elasticsearch as their primary data store

oh no…