August 6th, 2024

What Postgres Full Text Search Is Missing

Companies are evaluating Elasticsearch versus native Postgres full text search for text data management. Postgres FTS offers simplicity, while Elasticsearch provides advanced features but lacks reliability as a primary data store.

Read original article

ConcernFrustrationSkepticism

What Postgres Full Text Search Is Missing

Over the past six months, discussions with over 50 companies have highlighted the choice between Elasticsearch and native Postgres full text search (FTS) for managing text data in Postgres. Full text search identifies entries based on keywords, utilizing algorithms like BM25 for ranking. Postgres FTS is integrated into the database, offering simplicity, real-time search capabilities, and reliable results due to its ACID transactions. However, it has limitations, including a lack of advanced features and performance issues with large datasets. In contrast, Elasticsearch provides a comprehensive feature set and high performance, capable of querying billions of rows quickly. Yet, it is not a reliable primary data store due to its lack of ACID compliance, necessitating complex ETL processes that can lead to data freshness issues and increased operational costs. Alternative search engines like Algolia and Meilisearch are available but face similar challenges when integrated with Postgres. ParadeDB emerges as a potential solution, combining the strengths of both Postgres FTS and Elasticsearch by embedding an advanced search engine within Postgres, thus avoiding the need for additional infrastructure. Compatibility with managed Postgres services is forthcoming.

- Companies are weighing Elasticsearch against native Postgres FTS for text search solutions.

- Postgres FTS offers simplicity and real-time search but lacks advanced features and struggles with large datasets.

- Elasticsearch excels in performance and features but poses risks as a primary data store and incurs high operational costs.

- Alternative search engines face similar integration challenges with Postgres.

- ParadeDB aims to provide a hybrid solution by embedding advanced search capabilities directly into Postgres.

Just Use Postgres for Everything

The article promotes using Postgres extensively in tech stacks to simplify development, improve scalability, and reduce operational complexity. By replacing various technologies with Postgres, developers can enhance productivity, focus on customer value, and potentially cut costs.

Just Use Postgres for Everything

The blog post advocates for using PostgreSQL extensively in tech stacks to simplify development, improve productivity, and reduce complexity. It highlights benefits like scalability, efficiency, and cost-effectiveness, promoting a consolidated approach.

DuckDB Meets Postgres

Organizations shift historical Postgres data to S3 with Apache Iceberg, enhancing query capabilities. ParadeDB integrates Iceberg with S3 and Google Cloud Storage, replacing DataFusion with DuckDB for improved analytics in pg_lakehouse.

Postgres vs. Pinecone

Postgres and Pinecone differ in performance and cost. Pinecone criticizes Postgres for index issues, while Postgres showcases superior performance with tweaks, specialized indexes, and cost-effectiveness, offering transparency and customization.

Is an All-in-One Database the Future?

Specialized databases are emerging to tackle complex data challenges, leading to intricate infrastructures. A universal, all-in-one database remains unfulfilled due to optimization issues and unique challenges of different database types.

AI: What people are saying

The discussion around Elasticsearch versus Postgres for full-text search reveals several key insights and common themes.

Elasticsearch is favored for its advanced search capabilities and features, but concerns about its reliability as a primary data store persist.
Postgres is seen as limited for complex search needs, often requiring additional tools for better search quality.
Many users advocate for using dedicated search tools like Solr or Meilisearch instead of relying solely on Postgres or Elasticsearch.
Consultants emphasize the importance of understanding search quality and its impact on business models.
There is a growing interest in alternatives and integrations, such as using Elasticsearch alongside Postgres or exploring new technologies like ParadeDB.

22 comments

By @simonw - 9 months

The BM25 thing is actually a really big deal.

BM25 is similar to TF/IDF. In both cases, the key idea is to consider statistics of the overall corpus as part of relevance calculations. If the user searches for "charities in new orleans" in a corpus where "new orleans" is only represented in a few documents, those should clearly rank highly. If the corpus has "new orleans" in almost every document then the term "charity" is more important.

PostgreSQL FTS cannot do this, because it doesn't maintain statistics for word frequencies across the entire corpus. This severely limits what it can implement in terms of relevance scoring - each result is scored based purely on if the search terms are present or not.

For comparison, SQLite FTS (which a lot of people are unaware of) actually does implement full index statistics, and SQLite FTS5 implements BM25 out of the box.

By @ashconnor - 9 months

Facets are possible in Postgres but it looks complex:

https://web.archive.org/web/20200815141031/https://roamanaly...

pg_search [0] not to be confused with the ruby gem of the same name [1]:

[0] - https://github.com/paradedb/paradedb/tree/dev/pg_search#over...

[1] - https://github.com/Casecommons/pg_search

By @jillesvangurp - 9 months

If you care about search quality, postgres ends up being fairly limited. I've done stuff with it; it's alright for really simple stuff. But it's the wrong tool for the job generally.

If the order of the search results matters to you, you might want something that gives you some more tools to control that. And if you are not measuring search quality to begin with, you probably don't care enough to even know that you are missing the tools to do a better job.

I consult clients on this stuff professionally and I've seen companies do all sorts of silly shit. Mostly it's because they simply lack the in house expertise which is usually how they end up talking to me.

I've actually had to sit clients down and explain them their own business model. Usually I come in for some technical problem and then end up talking to product managers or senior managers about stuff like this because usually the real problem is at that level. The technical issues are just a symptom.

Here's a discussion I had with a client fairly recently (paraphrasing/exaggerating, obviously):

"Me: So your business model is that your users find shit on your web site (it has a very prominent search box at the top) and then some transaction happens that causes you to make money? Customer: yes. Me: so you make more money if your search works better and users find stuff they want. Customer: yes, we want to make more money! Me: congratulations, you are a search company! Customer: LOL whut?! Me: So, why aren't you doing the kinds of things that other search companies do to ensure you maximize profit? Like measuring how good your search is or generally giving a shit whether users can actually find what they are looking for. Customer: uhhhhh ???? Me: where's your search team? Customer: oh we don't have one, you do it! Me: how did you end up with what you currently have. Customer: oh that guy (some poor, overworked dev) over there picked solution X 3 years ago and we never gave it a second thought."

Honestly, some companies can't be helped and this was an example of a company that was kind of hopelessly flailing around and doing very sub optimal things at all levels in the company. And wasting lots of time and money in the process. Not realizing your revenue and competitiveness are literally defined by your search quality is never a good sign. You take different decisions if you do.

By @samsk - 9 months

For internet user facing full-text search I would always prefer to use a separate tool and not a SQL database, because

- the fulltext tool, can and should hold only 'active' data

- as it has only active data, data size is usually much much smaller

- as data size is smaller, it better fits in RAM

- as data size is smaller, it can be probably run on poorer HW the full ACID db

- as the indexed data are mostly read-only, the VM where it runs can be relatively easily cloned (never seen a corruption till now)

- as FTS tools are usually schema-less, there is no outage during schema changes (compared to doing changes in ACID db)

- as the indexed data are mostly read-only, the can be easily backup-ed

- as the backups are smaller, restoring a backup can be very fast

- and there is no such thing as database upgrade outage, you just spin a new version, feed it with new data and than change the backends

- functionality and extensibility

There is probably more, but if one doesn't needs to do a fulltext search on whole database (and you usually don't), than its IMHO better to use separate tool, that doesn't comes with all the ACID constraints. Probably only downside is that you need to format data for the FTS and index them, but if you want run a serious full-text search, you will have to take almost the same steps in the database.

On a 15y old side project, I use SOLR for full-text search, serving 20-30k/request per day on a cheap VM, and PostgreSQL is used as primary data source. The PostgreSQL has had several longer outages - during major upgrades, because of disk corruption, because of failed schema migrations, because of 'problems' between the chair and keyboard etc... During that outages the full-text search always worked - it didn't had most recent data, but most users probably never noticed.

By @radpanda - 9 months

> Because Elasticsearch is not a reliable data store, organizations that use Postgres typically extract, transform, and load (ETL) data from Postgres to Elasticsearch

I’ll admit haven’t kept up with this but is it still the case that Elasticsearch is “not a reliable data store”?

I remember there used to be a line in the Elasticsearch docs saying that Elasticseach shouldn’t be your primary data store or something to that effect. At some point they removed that verbiage, seemingly indicating more confidence in their reliability but I still hear people sticking with the previous guidance.

By @sgift - 9 months

Solr. The correct answer is Solr. All the search features of Elasticsearch (both are built on Lucene, which provides the actual search implementation) without the problems that Elastic (the company) or Elasticsearch (the product) brings with it. 99% of companies using Elasticsearch would be far better of with Solr.

By @amai - 9 months

Alternatives to both are

https://www.meilisearch.com/

https://typesense.org/

and maybe

https://github.com/Sygil-Dev/whoosh-reloaded

By @glintik - 9 months

Postgres has very limited features related to FTS, so it is useful only for very simple cases or for very limited budget. Elasticsearch is stable enough(but good devops and devs needed) and has numerous of features. Anyway Elasticsearch is not a "plug and play" solution and requires much work to be done for good search quality. Why I know this all? I'm running ecommerce search startup, based on Elasticsearch as primary storage.

By @alanwli - 9 months

Always great to see Postgres-based alternatives.

One clarification question - the blog post lists "lack of ACID transactions and MVCC can lead to data inconsistencies and loss, while its lack of relational properties and real-time consistency makes many database queries challenging" as the bad for ElasticSearch. What is pg_bm25's consistency model? It had been mentioned previously as offering "weak consistency" [0], which I interpret to have the same problems with transactions, MVCC, etc?

[0]: https://news.ycombinator.com/item?id=37864089

By @dewey - 9 months

I've had good experiences with both Postgres + ES and also Postgres + Meilisearch. The latter one I only used for small side projects but it was very easy to get running (With a RoR app).

By @coding123 - 9 months

The analytical part is what usually makes the Elasticsearch (or opensearch) the choice - it's not that you have to search just the text, its that with that text you need to show 5 different categories some of the top values and the number of occurrences and permutations thereof if the user decides to clicky clicky and filter it more.

By @sideway - 9 months

Semi-related: I guess some companies have already started augmenting their search with the use of LLMs. If you've worked on a similar project, what is your (very) high-level architecture? Did you see a noticeable difference in relevance and query time?

By @killme2008 - 9 months

I believe the primary issue with using full-text search in PostgreSQL is its performance and scalability limitations.

Elasticsearch/OpenSearch offers more than just search functionality; they come with extensive ecosystems, including ELK for logging and processors for ETL pipelines. These platforms are powerful and provide out-of-the-box solutions for developers. However, as the article mentioned, scaling them up or out can be costly.

ParadeDB looks like is based on tantivy(https://github.com/quickwit-oss/tantivy) which is an impressive project. We leverage it to implement full-text indexing for GreptimeDB too (https://github.com/GreptimeTeam/greptimedb). Nevertheless, building full-text indexes with Tantivy is still resource-intensive. This is why Grafana Loki, which only indexes tags instead of building full-text indexes, is popular in the observability space. In GreptimeDB, we offer users the flexibility to choose whether to build full-text indexes for text fields, while always creating inverted indexes for tags.

By @demilich - 9 months

https://github.com/infiniflow/infinity, dense vector + sparse vector + fulltext search(BM25) + late interact reranker(Colbert)

By @j45 - 9 months

Solr still seems to be an option

By @Sytten - 9 months

AWS RDS support is key IMO, I have done a lot of consulting on infrastructure and most clients I had don't want to have third party services and don't want to manage their DBs.

By @haolez - 9 months

Are there success cases of using ElasticSearch/OpenSearch as the primary data store in production? Just curious.

By @izietto - 9 months

Is this distributed as a PostgreSQL extension as well?

By @corytheboyd - 9 months

> We’ve talked to many companies who have tried and regretted their decision to use Elasticsearch as their primary data store

oh no…

What Postgres Full Text Search Is Missing

Related

Just Use Postgres for Everything

Just Use Postgres for Everything

DuckDB Meets Postgres

Postgres vs. Pinecone

Is an All-in-One Database the Future?

Related

Just Use Postgres for Everything

Just Use Postgres for Everything

DuckDB Meets Postgres

Postgres vs. Pinecone

Is an All-in-One Database the Future?