September 24th, 2024

I spent 5 hours learning how ClickHouse built their internal data warehouse

ClickHouse developed an internal data warehouse processing 470 TB from 19 sources, utilizing ClickHouse Cloud, Airflow, and AWS S3, supporting batch and real-time analytics, enhancing user experience and sales integration.

Read original articleLink Icon
I spent 5 hours learning how ClickHouse built their internal data warehouse

ClickHouse, a high-performance open-source columnar database, has developed its internal data warehouse to enhance its understanding of customer usage and improve its cloud product. The warehouse processes data from 19 sources, managing a total of 470 TB of compressed data and handling 50 TB of data daily. Initially, internal users relied on manual analysis with Excel, but the new system utilizes ClickHouse Cloud as its core database, with Airflow for scheduling, AWS S3 for data storage, and Superset for business intelligence. The data ingestion process involves collecting metrics from various sources, including AWS and GCP billing, customer information from Salesforce, and marketing data. ClickHouse employs a ReplicatedReplacingMergeTree engine to ensure idempotency and consistency in data processing. Over time, the number of data sources increased, prompting the adoption of dbt for centralized transformation logic. The system now supports both batch and real-time analytics, allowing users to access data through a native SQL console, which has improved user experience compared to previous tools. Additionally, integration with GrowthBook enables A/B testing, and data export to Salesforce facilitates direct access for the sales team. This comprehensive approach has allowed ClickHouse to build a robust internal data warehouse that meets the evolving needs of its stakeholders.

- ClickHouse's internal data warehouse processes data from 19 sources, managing 470 TB of compressed data.

- The system utilizes ClickHouse Cloud, Airflow, AWS S3, and Superset for data management and analytics.

- Adoption of dbt has centralized transformation logic, enhancing efficiency as data sources increased.

- The warehouse supports both batch and real-time analytics, improving user access and experience.

- Integration with GrowthBook allows for A/B testing, and data export to Salesforce aids the sales team.

Link Icon 4 comments
By @JosephRedfern - 7 months
It's linked to at the end, but I'm surprised that this article didn't feature/mention ClickHouse's own write-up of this undertaking. Part 1 here: https://clickhouse.com/blog/building-a-data-warehouse-with-c..., part 2 here: https://clickhouse.com/blog/building-a-data-warehouse-with-c....

Aspects of the post seem to borrow quite heavily from the original write-ups, which are worth a read.

By @omgwtfusb - 7 months
Looks to me close to typical DWH setup like many large corporations have, except for the ClickHouse specific settings and Docker usage (most DEs seem to be skeptical of anything that isn't SQL in my experience)
By @tlarkworthy - 7 months
If you built this out of AWS managed services what would you use?
By @simplegeek - 7 months
Your drawings, on the blog, are really nice, which tools did you use to make those?