Postgres Just Cracked the Top Fastest Databases for Analytics
PostgreSQL has been optimized for analytics with pg_mooncake, achieving a Top 10 ClickBench ranking. It uses a columnstore format and DuckDB for improved performance, rivaling specialized databases.
Read original articleClickbench has recognized PostgreSQL as a strong contender in the analytics database space, particularly following the development of pg_mooncake, which has achieved a Top 10 ranking on ClickBench, a benchmark primarily dominated by specialized analytics databases. Traditionally, databases like ClickHouse have outperformed general-purpose databases such as PostgreSQL and MySQL by significant margins. However, the creators of pg_mooncake have optimized PostgreSQL to enhance its analytics capabilities. They built pg_mooncake as a native PostgreSQL extension, utilizing a columnstore format to improve data access speeds and efficiency. This format allows for sequential reads, which is faster than traditional row-based storage. Additionally, they integrated DuckDB as the execution engine to facilitate vectorized execution, processing data in batches for improved performance. The architecture also includes efficient metadata management directly within PostgreSQL, which enhances query planning and execution speed. The advancements made with pg_mooncake demonstrate that PostgreSQL can now deliver analytics performance comparable to specialized databases while maintaining its flexibility and ecosystem benefits. The initial version, v0.1, was launched recently and is available on platforms like Neon Postgres and Supabase.
- PostgreSQL has been optimized to perform well in analytics, achieving a Top 10 ranking on ClickBench.
- pg_mooncake is a native PostgreSQL extension that utilizes a columnstore format for improved data access.
- DuckDB is embedded as the execution engine to enhance query performance through vectorized execution.
- Efficient metadata management within PostgreSQL contributes to faster query planning and execution.
- The advancements position PostgreSQL as a viable alternative to specialized analytics databases.
Related
Does PostgreSQL respond to the challenge of analytical queries?
PostgreSQL has advanced in handling analytical queries with foreign data wrappers and partitioning, improving efficiency through optimizer enhancements, while facing challenges in pruning and statistical data. Ongoing community discussions aim for further improvements.
pg_duckdb: Splicing Duck and Elephant DNA
MotherDuck launched pg_duckdb, an open-source extension integrating DuckDB with Postgres to enhance analytical capabilities while maintaining transactional efficiency, supported by a consortium of companies and community contributions.
ClickHouse Data Modeling for Postgres Users
ClickHouse acquired PeerDB to enhance PostgreSQL data replication. The article offers data modeling tips, emphasizing the ReplacingMergeTree engine, duplicate management, ordering key selection, and the use of Nullable types.
Crunchy Data Warehouse: Postgres with Iceberg for High Performance Analytics
Crunchy Data has launched Crunchy Data Warehouse, a PostgreSQL-based analytics database that integrates Apache Iceberg, supports ACID transactions, offers over 10x performance improvement, and is available as a managed AWS service.
Reliably Replicating Data Between PostgreSQL and ClickHouse
The article explains how to replicate data from PostgreSQL to ClickHouse using PeerDB, focusing on Change Data Capture for efficient synchronization and offering both self-managed and managed solutions.
We tried using Postgres with large analytics at my previous company https://threekit.com but it is an absolute pain. Basically we started to collected detailed analytics and thus had a rapidly growing table of around 2B records of user events during their sessions. As it grew past a 500 million records it turned out to be impossible to query this table in any thing close to real-time - it was basically untouchable because it was so slow.
I know I could have used some type of daily aggregation combined with a weekly aggregation, etc to roll up the data incrementally. A dev tried this and yeah, it hide the slow queries but then it became inflexible in terms of reporting. And writing and maintaining these cronjobs is a lot of work.
But then I tried using BigQuery on my https://web3dsurvey.com side project and I just recorded raw data and then wrote queries to do real-time aggregation (with a caching layer) in a large variety of ways. And it was near instant and easy and also very cheap.
So then I deployed this strategy over at Threekit.com and it also was easy and fast and cheap. Even more flexible queries than Web3DSurvey at basically no developer cost or infrastructure costs. No more need for aggregating cron-jobs or trying to decide ahead of time how users want to slice and dice the data. Real time, flexible queries on the fly via BigQuery is the way.
Also BigQuery bill for https://web3dsurvey.com is like $0.25 month and it is dealing with millions of records in its 3 month window of stored data. Where as just running the cheapest Postgres SQL server on Google Cloud is like >$25/month and it is a slow one.
I would never go back to traditional SQL for analytics - it was hard, slow, expensive and inflexible. Worst of all worlds.
For analytics on transactional data, it looks like you'd still need to use logical replication (https://github.com/Mooncake-Labs/pg_mooncake/issues/90). Logical replication is somewhat similar to an ETL/CDC experience, though it's more Postgres-native. Managing logical replication at a production grade isn't trivial — it's quite common for customers to use PeerDB for homogeneous replication due to performance, manageability and observability issues with logical replication.
One potential value of extensions is Postgres compatibility for your queries. However, I'm curious about the SQL coverage on columnar tables — for example, advanced constructs like partitioned tables, prepared statements, correlated subqueries, RCTEs, triggers, and more. While it seems there’s ongoing work to improve compatibility, achieving full Postgres compatibility is challenging. In some cases, you might need to fall back to Postgres (instead of DuckDB), sacrificing performance for certain queries.
The biggest challenge we faced at Citus was the constant trade-off between Postgres compatibility and performance — both are complex and constantly moving targets. This was the key reason why Citus couldn't compete with purpose-built databases like Snowflake, ClickHouse, SingleStore. While DuckDB didn’t exist when we built Citus, it's still fundamentally different from Postgres. Even though the DuckDB dialect is somewhat similar to Postgres (as is Snowflake’s), retrofitting two distinct databases — each evolving independently for world's best performance — isn't trivial.
In short, relying on ETL (logical replication) without providing full Postgres compatibility raises the question: is it better to lean on a purpose-built database with a laser focus on performance, or adopt a more generalized approach?
Anyway, I come from a different school of thought — using the right tool for the right job. That said, I love seeing all the progress and evolution in the Postgres community — Postgres will always be my first love!
So is it Postgres or DuckDB that cracked the analytics top ?
I'm trying to understand if there is any potential performance impact on the OLTP workload by including the OLAP in the same process.
Unfortunately using an extension that's not "part" of postgres (like pg_stat_statements) is not trivial for most people since there are both technical and legal issues.
Related
Does PostgreSQL respond to the challenge of analytical queries?
PostgreSQL has advanced in handling analytical queries with foreign data wrappers and partitioning, improving efficiency through optimizer enhancements, while facing challenges in pruning and statistical data. Ongoing community discussions aim for further improvements.
pg_duckdb: Splicing Duck and Elephant DNA
MotherDuck launched pg_duckdb, an open-source extension integrating DuckDB with Postgres to enhance analytical capabilities while maintaining transactional efficiency, supported by a consortium of companies and community contributions.
ClickHouse Data Modeling for Postgres Users
ClickHouse acquired PeerDB to enhance PostgreSQL data replication. The article offers data modeling tips, emphasizing the ReplacingMergeTree engine, duplicate management, ordering key selection, and the use of Nullable types.
Crunchy Data Warehouse: Postgres with Iceberg for High Performance Analytics
Crunchy Data has launched Crunchy Data Warehouse, a PostgreSQL-based analytics database that integrates Apache Iceberg, supports ACID transactions, offers over 10x performance improvement, and is available as a managed AWS service.
Reliably Replicating Data Between PostgreSQL and ClickHouse
The article explains how to replicate data from PostgreSQL to ClickHouse using PeerDB, focusing on Change Data Capture for efficient synchronization and offering both self-managed and managed solutions.