pg_duckdb: Splicing Duck and Elephant DNA
MotherDuck launched pg_duckdb, an open-source extension integrating DuckDB with Postgres to enhance analytical capabilities while maintaining transactional efficiency, supported by a consortium of companies and community contributions.
Read original articleMotherDuck has announced the launch of pg_duckdb, an open-source extension that integrates DuckDB's analytics engine with Postgres, enhancing its analytical capabilities. While Postgres excels in transactional operations, it struggles with complex analytical queries. DuckDB addresses this by utilizing a vectorized query engine, allowing for efficient data processing. The collaboration involves a consortium of companies, including DuckDB Labs, MotherDuck, Hydra, Neon, and Microsoft, each contributing expertise to ensure the extension functions seamlessly within the Postgres environment. The project aims to provide a robust solution for users who require both transactional and analytical functionalities without compromising performance. The pg_duckdb extension will support querying data stored in the cloud and local databases, facilitating easier data management and analytics. Key features in development include compatibility with Postgres data types and functions, as well as improved integration with data lakes. The initiative emphasizes open collaboration and invites community contributions to refine the extension.
- MotherDuck has launched pg_duckdb, an extension combining DuckDB with Postgres.
- The extension aims to enhance Postgres's analytical capabilities while maintaining transactional efficiency.
- A consortium of companies is collaborating to ensure seamless integration and functionality.
- Key features include cloud data querying and compatibility with Postgres data types and functions.
- The project encourages community contributions and feedback for ongoing development.
Related
DuckDB Community Extensions
The DuckDB team launched the DuckDB Community Extensions repository for easy extension installation. Users benefit from a simplified process, while developers can streamline publication tasks. Security measures include code vetting options.
DuckDB Meets Postgres
Organizations shift historical Postgres data to S3 with Apache Iceberg, enhancing query capabilities. ParadeDB integrates Iceberg with S3 and Google Cloud Storage, replacing DataFusion with DuckDB for improved analytics in pg_lakehouse.
Major Developments in Postgres Extension Discovery and Distribution
The article covers advancements in Postgres extension discovery and distribution. Postgres extensions enhance database capabilities with features like query hints and encryption. PGXN facilitates extension access. A summit in Vancouver will address extension challenges, encouraging developer involvement for ecosystem enhancement.
Memory Management in DuckDB
DuckDB optimizes query processing with effective memory management, using a streaming execution engine and disk spilling for large datasets. Its buffer manager enhances performance by caching frequently accessed data.
Does PostgreSQL respond to the challenge of analytical queries?
PostgreSQL has advanced in handling analytical queries with foreign data wrappers and partitioning, improving efficiency through optimizer enhancements, while facing challenges in pruning and statistical data. Ongoing community discussions aim for further improvements.
- Users express enthusiasm for integrating DuckDB with Postgres but highlight concerns about the complexity and maturity of the Postgres extension framework.
- There are requests for improved SQL functionalities in Postgres, such as support for time_bucket and AsOf, which are favored in DuckDB.
- Questions arise regarding the data handling of pg_duckdb, specifically whether it duplicates data or enhances Postgres capabilities.
- Some commenters emphasize the need for stability and production readiness of pg_duckdb, comparing it to other analytical databases.
- Suggestions for future enhancements include better integration with external data formats like Parquet and Iceberg, and the development of columnstore capabilities within Postgres.
I don't and never have enjoyed SQL and I much prefer the ergonomics of time_bucket to date_bin.
For example, I would do this in duckdb:
SELECT
count(*) as y
, time_bucket(interval '2 weeks', at::timestamp) as x
FROM analytics
WHERE some_bool AND some_haystack = 'needle'
GROUP BY x
ORDER BY x
In postgres it looks more like: with counts as (
SELECT
date_bin('1 hour'::interval, at, (now() - interval '2 weeks')::timestamp)
, count(*) c
FROM analytics a
WHERE some_bool AND some_haystack = 'needle'
GROUP BY date_bin
)
select series as x, coalesce(counts.c, 0) as y
from generate_series(
(now() - interval '2 weeks')::timestamp,
now()::timestamp,
interval '1 hour'
) series
LEFT JOIN counts
ON counts.date_bin = series;
We recently submitted our (Crunchy Bridge for Analytics-at most broad level based on same idea) benchmark for clickbench by clickhouse (https://benchmark.clickhouse.com/) which puts us at #6 overall amongst managed service providers and gives a real viable option for Postgres as an analytics database (at least per clickbench). Also of note there are a number of other Postgres variations such as ParadeDB that are definitely not 1000x slower than Clickhouse or DuckDB.
https://github.com/duckdb/pg_duckdb
Sounds like it would be useful for Postgres users to interact with Parquet and CSV data within a single SQL query and in a performant way (due to DuckDB's vectorization).
Here are the scenarios and how to address them
1. Query Parquet and Iceberg from Postgres. When Parquet files are stored in S3 Postgres should be able to run analytical queries on them.
2. Postgres should allow creation of columnstore tables inside Postgres storage subsystem. Analytical queries on top of these table should be FAST. Top 10 on Clickbench fast. This allows to run analytics without S3 and have super low latencies for analytics.
3. Postgres should allow creation of secondary columnstore indexes to speed up analytical queries in mixed workloads. This is super useful for Oracle migrations since Oracle had this feature for a while.
So How do we get there? 10 years ago it would be a MASSIVE project, but today we have @duckdb - super fast analytical engine with an open license. The work is still not trivial, but it is much much simpler.
First you need to integrate an analytical query processor into Postgres and today @duckdblabs announced github.com/duckdb/pg_duck…. Yay and congrats!
This plugin runs duckdb alongside with Postgres and integrated Postgres syntax with the @duckdb query processor (QP)
With that it now can trivially query external files from S3. This addresses scenario 1.
With that it now can trivially query external files from S3. This addresses scenario 1.
Building columnar table requires either implementing columnar storage from scratch or integrating duckdb storage into the Postgres subsystem. You can of course let duckdb create duckdb files on local disk, but then all the Postgres machinery: replication, backup, recovery won't work
Duckdb tables have to mapped into 8kb Postgres pages pushed through the Postgres WAL for replication, recovery and transactionality. This will give us scenario 2
Scenario 3 is even more work. You need secondary index maintenance and it will require hybrid query execution. We will need to modify Postgres executor so that it can mix and match regular Postgres query operators and "vectorized" query operators from duckdb. Or built vectorized operators into Postgres
Scenarios 2 and 3 will take some time, but I'm excited for this roadmap: this will unlock a huge world for millions of Postgres users and simplify the lives of many developers dealing with moving data between transactional and analytical systems.
Being a Postgres fan, Good luck and best wishes with the effort here!
It would be great if one could have a diversity of postgres's in a data mesh but you can execute the same exact sql on them
Related
DuckDB Community Extensions
The DuckDB team launched the DuckDB Community Extensions repository for easy extension installation. Users benefit from a simplified process, while developers can streamline publication tasks. Security measures include code vetting options.
DuckDB Meets Postgres
Organizations shift historical Postgres data to S3 with Apache Iceberg, enhancing query capabilities. ParadeDB integrates Iceberg with S3 and Google Cloud Storage, replacing DataFusion with DuckDB for improved analytics in pg_lakehouse.
Major Developments in Postgres Extension Discovery and Distribution
The article covers advancements in Postgres extension discovery and distribution. Postgres extensions enhance database capabilities with features like query hints and encryption. PGXN facilitates extension access. A summit in Vancouver will address extension challenges, encouraging developer involvement for ecosystem enhancement.
Memory Management in DuckDB
DuckDB optimizes query processing with effective memory management, using a streaming execution engine and disk spilling for large datasets. Its buffer manager enhances performance by caching frequently accessed data.
Does PostgreSQL respond to the challenge of analytical queries?
PostgreSQL has advanced in handling analytical queries with foreign data wrappers and partitioning, improving efficiency through optimizer enhancements, while facing challenges in pruning and statistical data. Ongoing community discussions aim for further improvements.