October 17th, 2024

Pg_parquet: An extension to connect Postgres and parquet

Crunchy Data released pg_parquet, an open-source PostgreSQL extension for reading and writing Parquet files, enabling data export/import, cloud storage integration, and schema inspection for enhanced analytical capabilities.

Read original article

Pg_parquet: An extension to connect Postgres and parquet

Crunchy Data has announced the release of pg_parquet, an open-source extension for PostgreSQL that facilitates the reading and writing of Parquet files directly from the database. This extension allows users to export tables or queries to Parquet files, ingest data from Parquet files into PostgreSQL, and inspect the schema and metadata of existing Parquet files. Parquet is a columnar file format known for its efficient data compression, making it suitable for analytics and data sharing between systems. The pg_parquet extension enhances PostgreSQL's capabilities by enabling seamless integration with Parquet without the need for additional data pipelines. Users can utilize the COPY command to transfer data between PostgreSQL and Parquet files stored locally or in cloud storage like S3. The extension also provides functionalities to describe Parquet schemas and retrieve detailed metadata, which can be beneficial for data management and analytics. With pg_parquet, PostgreSQL aims to expand its role beyond transactional workloads to become a robust solution for analytical tasks.

- pg_parquet is an open-source extension for PostgreSQL to work with Parquet files.

- It allows exporting and importing data between PostgreSQL and Parquet files.

- The extension supports cloud storage integration, particularly with S3.

- Users can inspect Parquet file schemas and metadata directly from PostgreSQL.

- pg_parquet enhances PostgreSQL's capabilities for both transactional and analytical workloads.

DuckDB Meets Postgres

Organizations shift historical Postgres data to S3 with Apache Iceberg, enhancing query capabilities. ParadeDB integrates Iceberg with S3 and Google Cloud Storage, replacing DataFusion with DuckDB for improved analytics in pg_lakehouse.

Major Developments in Postgres Extension Discovery and Distribution

The article covers advancements in Postgres extension discovery and distribution. Postgres extensions enhance database capabilities with features like query hints and encryption. PGXN facilitates extension access. A summit in Vancouver will address extension challenges, encouraging developer involvement for ecosystem enhancement.

Does PostgreSQL respond to the challenge of analytical queries?

PostgreSQL has advanced in handling analytical queries with foreign data wrappers and partitioning, improving efficiency through optimizer enhancements, while facing challenges in pruning and statistical data. Ongoing community discussions aim for further improvements.

PostGIS Meets DuckDB: Crunchy Bridge for Analytics Goes Spatial

Crunchy Data's update to Crunchy Bridge for Analytics introduces geospatial analytics, allowing users to create analytics tables from datasets via URLs, supporting formats like GeoParquet, and integrating with DuckDB and QGIS.

Show HN: Squey, an open-source GPU-accelerated data visualization software

Squey 5.0 is an open-source visualization software that introduces a Parquet plugin, enabling data import/export, real-time visualizations, and tools for data quality assessment and anomaly detection across various fields.

11 comments

By @RMarcus - 6 months

This is awesome, thanks for creating this. I've had to write some absolutely wonky scripts to dump a PostgreSQL database into Parquet, or read a Parquet file into PostgreSQL. Normally some terrible combination of psycopg and pyarrow, which worked, but it was ad-hoc and slightly different every time.

A lot of other commenters are talking about `pg_duckdb` which maybe also could've solved my problem, but this looks quite simple and clean.

I hope for some kind of near-term future where there's some standardish analytics-friendly data archival format. I think Parquet is the closest thing we have now.

By @linuxhansl - 6 months

Parquet itself is actually not that interesting. It should be able to read (and even write) Iceberg tables.

Also, how does it compare to pg_duckdb (which adds DuckDB execution to Postgres including reading parquet and Iceberg), or duck_fdw (which wraps a DuckDB database, which can be in memory and only pass-through Iceberg/Parquet tables)?

By @whalesalad - 6 months

I wish RDS made it easy to add custom extensions like this.

By @oulipo - 6 months

Cool, would this be better than using a clickhouse / duckdb extension that reads postgres and saves to Parquet?

What would be recommended to output regularly old data to S3 as parquet file? To use a cron job which launches a second Postgres process connecting to the database and extracting the data, or using the regular database instance? doesn't that slow down the instance too much?

By @aamederen - 6 months

Congratulations! I'm happy to see the PostgreSQL license.

By @drewbitt - 6 months

https://github.com/pgspider/parquet_s3_fdw is the foreign data wrapper alternative

By @jeadie - 6 months

Why not just federate Postgres and parquet files? That way the query planner can push down as much of the query and reduce how much data has to move about?

By @jakozaur - 6 months

It's good for small data, but the Iceberg format would be nicer for bigger data sets.

By @fforflo - 6 months

I can see myself using this as alternative to foreign data wrappers and/or pg_dump even.

Pg_parquet: An extension to connect Postgres and parquet

Related

DuckDB Meets Postgres

Major Developments in Postgres Extension Discovery and Distribution

Does PostgreSQL respond to the challenge of analytical queries?

PostGIS Meets DuckDB: Crunchy Bridge for Analytics Goes Spatial

Show HN: Squey, an open-source GPU-accelerated data visualization software

Related

DuckDB Meets Postgres

Major Developments in Postgres Extension Discovery and Distribution

Does PostgreSQL respond to the challenge of analytical queries?

PostGIS Meets DuckDB: Crunchy Bridge for Analytics Goes Spatial

Show HN: Squey, an open-source GPU-accelerated data visualization software