I spent 5 hours learning how ClickHouse built their internal data warehouse
ClickHouse developed an internal data warehouse processing 470 TB from 19 sources, utilizing ClickHouse Cloud, Airflow, and AWS S3, supporting batch and real-time analytics, enhancing user experience and sales integration.
Read original articleClickHouse, a high-performance open-source columnar database, has developed its internal data warehouse to enhance its understanding of customer usage and improve its cloud product. The warehouse processes data from 19 sources, managing a total of 470 TB of compressed data and handling 50 TB of data daily. Initially, internal users relied on manual analysis with Excel, but the new system utilizes ClickHouse Cloud as its core database, with Airflow for scheduling, AWS S3 for data storage, and Superset for business intelligence. The data ingestion process involves collecting metrics from various sources, including AWS and GCP billing, customer information from Salesforce, and marketing data. ClickHouse employs a ReplicatedReplacingMergeTree engine to ensure idempotency and consistency in data processing. Over time, the number of data sources increased, prompting the adoption of dbt for centralized transformation logic. The system now supports both batch and real-time analytics, allowing users to access data through a native SQL console, which has improved user experience compared to previous tools. Additionally, integration with GrowthBook enables A/B testing, and data export to Salesforce facilitates direct access for the sales team. This comprehensive approach has allowed ClickHouse to build a robust internal data warehouse that meets the evolving needs of its stakeholders.
- ClickHouse's internal data warehouse processes data from 19 sources, managing 470 TB of compressed data.
- The system utilizes ClickHouse Cloud, Airflow, AWS S3, and Superset for data management and analytics.
- Adoption of dbt has centralized transformation logic, enhancing efficiency as data sources increased.
- The warehouse supports both batch and real-time analytics, improving user access and experience.
- Integration with GrowthBook allows for A/B testing, and data export to Salesforce aids the sales team.
Related
Materialized views in ClickHouse: The data transformation Swiss Army knife
Materialized views in ClickHouse enhance query performance by storing results on disk and updating automatically. They improve efficiency but increase storage use and risk insert errors. Incremental updates optimize performance.
ClickHouse acquires PeerDB to expand its Postgres support
ClickHouse has acquired PeerDB to enhance Postgres support, improving speed and capabilities for enterprise customers. PeerDB's team will expand change data capture, while existing services remain available until July 2025.
Why Did Databricks Open-Source Unity Catalog?
Databricks has open-sourced Unity Catalog and acquired Tabular, signaling a shift towards open-source solutions in lakehouse architecture, with support from major companies and potential impacts on Apache Iceberg.
Show HN: Storing and Analyzing 160 billion Quotes in ClickHouse
ClickHouse is effective for managing large financial datasets, offering fast query execution, efficient compression, and features like data deduplication and date partitioning, while alternatives like KDB and Shakti are also considered.
ClickHouse Data Modeling for Postgres Users
ClickHouse acquired PeerDB to enhance PostgreSQL data replication. The article offers data modeling tips, emphasizing the ReplacingMergeTree engine, duplicate management, ordering key selection, and the use of Nullable types.
Aspects of the post seem to borrow quite heavily from the original write-ups, which are worth a read.
Related
Materialized views in ClickHouse: The data transformation Swiss Army knife
Materialized views in ClickHouse enhance query performance by storing results on disk and updating automatically. They improve efficiency but increase storage use and risk insert errors. Incremental updates optimize performance.
ClickHouse acquires PeerDB to expand its Postgres support
ClickHouse has acquired PeerDB to enhance Postgres support, improving speed and capabilities for enterprise customers. PeerDB's team will expand change data capture, while existing services remain available until July 2025.
Why Did Databricks Open-Source Unity Catalog?
Databricks has open-sourced Unity Catalog and acquired Tabular, signaling a shift towards open-source solutions in lakehouse architecture, with support from major companies and potential impacts on Apache Iceberg.
Show HN: Storing and Analyzing 160 billion Quotes in ClickHouse
ClickHouse is effective for managing large financial datasets, offering fast query execution, efficient compression, and features like data deduplication and date partitioning, while alternatives like KDB and Shakti are also considered.
ClickHouse Data Modeling for Postgres Users
ClickHouse acquired PeerDB to enhance PostgreSQL data replication. The article offers data modeling tips, emphasizing the ReplacingMergeTree engine, duplicate management, ordering key selection, and the use of Nullable types.