Reliably Replicating Data Between PostgreSQL and ClickHouse
The article explains how to replicate data from PostgreSQL to ClickHouse using PeerDB, focusing on Change Data Capture for efficient synchronization and offering both self-managed and managed solutions.
Read original articleThis article discusses the process of reliably replicating data from PostgreSQL to ClickHouse using PeerDB, an open-source tool. The series will cover various methods, starting with PeerDB for self-managed setups and later exploring ClickPipes for a fully managed solution. PostgreSQL is favored for transactional workloads, while ClickHouse excels in analytics, making their combination increasingly popular for data warehousing and analytics applications. PeerDB specializes in Change Data Capture (CDC) for PostgreSQL, providing a scalable solution for data replication. It was acquired by ClickHouse in July 2024, enhancing integration between the two platforms. Users can choose between a self-hosted version of PeerDB or a managed version integrated into ClickHouse Cloud. The article outlines the setup process for PeerDB, including creating a PostgreSQL table, generating test data, and configuring PeerDB through a web UI or SQL interface. The initial data replication process is efficient, allowing for near real-time updates of data changes from PostgreSQL to ClickHouse. The article emphasizes the importance of using CDC for lower latency and reduced load on the source database compared to traditional query replication methods.
- PeerDB is an open-source tool for replicating data from PostgreSQL to ClickHouse.
- The series will explore both self-managed and fully managed solutions for data replication.
- PeerDB specializes in Change Data Capture (CDC) for efficient data synchronization.
- Users can configure PeerDB through a web UI or SQL interface for flexibility.
- The initial data replication process is quick, supporting near real-time updates.
Related
ClickHouse acquires PeerDB to expand its Postgres support
ClickHouse has acquired PeerDB to enhance Postgres support, improving speed and capabilities for enterprise customers. PeerDB's team will expand change data capture, while existing services remain available until July 2025.
Change Data Capture (CDC) Tools should be database specialized not generalized
PeerDB focuses on Postgres for Change Data Capture, minimizing pipeline failures and load impacts. Their customers manage data sizes from 300GB to 20TB, necessitating ongoing improvements as Postgres evolves.
ClickHouse Data Modeling for Postgres Users
ClickHouse acquired PeerDB to enhance PostgreSQL data replication. The article offers data modeling tips, emphasizing the ReplacingMergeTree engine, duplicate management, ordering key selection, and the use of Nullable types.
I spent 5 hours learning how ClickHouse built their internal data warehouse
ClickHouse developed an internal data warehouse processing 470 TB from 19 sources, utilizing ClickHouse Cloud, Airflow, and AWS S3, supporting batch and real-time analytics, enhancing user experience and sales integration.
Postgres Meets Analytics: CDC from Neon to ClickHouse via PeerDB
Neon and ClickHouse integration improves real-time analytics by enabling low-latency transactions and fast processing. PeerDB facilitates data replication, supporting various applications like customer analytics and data warehousing.
It works well. Their team is great. I feel a bit spoiled having had as much access to the engineering team during the private beta as we've experienced.
It's great for use cases where it makes sense to sync postgres tables across to clickhouse without denormalizing them. PeerDB can transform rows in a single table sent via CDC using a lua scripting language, but it can't (yet!) denormalize data into clickhouse that is stored in 3NF on Postgres across multiple tables.
On the clickhouse query side, we end up wanting denormalized data for query performance and to avoid JOINs. It's frequently not a great idea to query in clickhouse using the same table structure as you're using in your transactional db.
In our experience we sync a few tables with PeerDB but mostly end up using app-level custom code to sync denormalized data into Clickhouse for our core use-cases. Most of the PeerDB sync'd tables end up as Clickhouse Dictionaries which we then use in our queries.
PeerDB works well and I like it for what it is. Just don't expect to be satisfied with querying in Clickhouse against the same table structure as you've got in Postgres unless your data size is tiny.
Curious to know about how others are using it and the architectures you've developed.
Don't be confused by the timeseries branding.
The parquet file is a columnar friendly friendly that can then be simply inserted to clickhouse or duckdb or even queried directly.
This script and a cron job are enough for my (not very complex) needs on replicating my postgres data on clickhouse for fast queries.
I'm curious if you have data that backs this up, or if it's more of a "gut feeling" sort of thing. At first blush, I agree with you, but at the same time, by doing it at the application level, it opens up so many more possibilities, such as writing "pre-coalesced" data to the data warehouse or pre-enriching the data that goes into the data warehouse.
Related
ClickHouse acquires PeerDB to expand its Postgres support
ClickHouse has acquired PeerDB to enhance Postgres support, improving speed and capabilities for enterprise customers. PeerDB's team will expand change data capture, while existing services remain available until July 2025.
Change Data Capture (CDC) Tools should be database specialized not generalized
PeerDB focuses on Postgres for Change Data Capture, minimizing pipeline failures and load impacts. Their customers manage data sizes from 300GB to 20TB, necessitating ongoing improvements as Postgres evolves.
ClickHouse Data Modeling for Postgres Users
ClickHouse acquired PeerDB to enhance PostgreSQL data replication. The article offers data modeling tips, emphasizing the ReplacingMergeTree engine, duplicate management, ordering key selection, and the use of Nullable types.
I spent 5 hours learning how ClickHouse built their internal data warehouse
ClickHouse developed an internal data warehouse processing 470 TB from 19 sources, utilizing ClickHouse Cloud, Airflow, and AWS S3, supporting batch and real-time analytics, enhancing user experience and sales integration.
Postgres Meets Analytics: CDC from Neon to ClickHouse via PeerDB
Neon and ClickHouse integration improves real-time analytics by enabling low-latency transactions and fast processing. PeerDB facilitates data replication, supporting various applications like customer analytics and data warehousing.