Building and scaling Notion's data lake
Notion copes with rapid data growth by transitioning to a sharded architecture and developing an in-house data lake using technologies like Kafka, Hudi, and S3. Spark is the main processing engine for scalability and cost-efficiency, improving scalability, speed, and cost-effectiveness.
Read original articleNotion has experienced a significant increase in data over the past three years, with its data doubling every 6-12 months due to user growth. To manage this growth and meet the demands of new features like Notion AI, the company built and scaled its data lake. Notion's data model treats all elements as "blocks" stored in a Postgres database, which has grown to hundreds of terabytes. To handle this, Notion transitioned to a sharded architecture and built a dedicated data infrastructure. Challenges arose with Snowflake's ability to handle Notion's update-heavy workload, leading to the development of an in-house data lake using technologies like Kafka, Hudi, and S3. Spark was chosen as the main processing engine for its scalability and cost-efficiency. Incremental ingestion from Postgres to S3 was preferred over full snapshots for fresher data at lower costs. The data lake setup involved Debezium CDC connectors, Kafka, and Hudi for efficient data processing and storage. By offloading heavy workloads to S3 and focusing on offline tasks, Notion improved scalability, speed, and cost-effectiveness in managing its expanding data volume.
Related
Our great database migration
Shepherd, an insurance pricing company, migrated from SQLite to Postgres to boost performance and scalability for their pricing engine, "Alchemist." The process involved code changes, adopting Neon database, and optimizing performance post-migration.
Graph-Based Ceramics
The article explores managing ceramic glazes in a kiln and developing an app. It compares Firebase, Supabase, and Instant databases, highlighting Instant's efficiency in handling complex relational data for ceramic management.
We sped up Notion in the browser with WASM SQLite
Notion improved web performance with WebAssembly SQLite, enhancing navigation by 20% in modern browsers. SharedWorker architecture managed SQLite queries efficiently, overcoming initial challenges for a seamless user experience.
DuckDB Meets Postgres
Organizations shift historical Postgres data to S3 with Apache Iceberg, enhancing query capabilities. ParadeDB integrates Iceberg with S3 and Google Cloud Storage, replacing DataFusion with DuckDB for improved analytics in pg_lakehouse.
Notion about their usage of WASM SQLite
Notion enhanced browser performance by integrating WebAssembly SQLite, OPFS, and Web Workers technologies. Overcoming challenges, they improved page navigation by 20%, optimizing SQLite usage for efficient cross-tab queries and compatibility.
- Notion's EM clarifies that user data is not sold and elaborates on the data lake's use for efficient reindexing of search clusters.
- Comments highlight the potential cost savings from moving away from expensive services like Fivetran and Snowflake.
- Suggestions are made to consider alternative query engines like Trino or StarRocks for better performance in interactive data analysis tasks.
- Questions arise about the practical benefits of the data lake for Notion users and its potential use in training AI models.
- Technical inquiries about data replication formats and updates in S3, as well as comparisons with other data management solutions, are discussed.
Notion does not sell its users' data.
Instead, I want to expand on one of the first use-cases for the Notion data lake, which was by my team. This is an elaboration of the description in TFA under the heading "Use case support".
As is described there, Notion's block permissions are highly normalized at the source of truth. This is usually quite efficient and generally brings along all the benefits of normalization in application databases. However, we need to _denormalize_ all the permissions that relate to a specific document when we index it into our search index.
When we transactionally reindex a document "online", this is no problem. However, when we need to reindex an entire search cluster from scratch, loading every ancestor of each page in order to collect all of its permissions is far too expensive.
Thus, one of the primary needs that my team had from the new data lake is "tree traversal and permission data construction for each block". We rewrote our "offline" reindexer to read from the data lake instead of reading from RDS instances serving database snapshots. This allowed us to dramatically reduce the impact of iterating through every page when spinning up a new cluster (not to mention save a boatload in spinning up those ad-hoc RDS instances).
I hope this miniature deep dive gives a little bit more color on the uses of this data store—as it is emphatically _not_ to sell our users' data!
What does a backing data lake afford a Notion user that can’t be done in a similar product, like Obsidian?
These aren't something I would like to hear if I'm still using Notion. It's very bold to publish something like this on their own website.
Also we see how at self hosting within a startup can make perfect sense. :)
Devops that abstract away things in some cases to the cloud might just add to architectural and technical debt later, without the history of learning from working through the challenges
Still, it might have been a great opportunity to figure out offline first use of notion.
I have been forced to use anytype instead of notion for the offline first reason. Time to checkout to learn how they handle storage from the source code.
Are they using this new data lake to train new AI models on?
Or has Notion signed a deal with another LLM provider to provide customer data as a source for training data?
At ParadeDB, we're seeing more and more users want to maintain the Postgres interface while offloading data to S3 for cost and scalability reasons, which was the main reason behind the creation of pg_lakehouse.
And I'm wondering if it's possible to update the S3 files to reflect latest incoming changes on the db table?
Versus what Notion is doing:
> We ingest incrementally updated data from Postgres to Kafka using Debezium CDC connectors, then use Apache Hudi, an open-source data processing and storage framework, to write these updates from Kafka to S3.
Feels like it would work about the same with Bufstream, replacing both Kafka & Hudi. I've heard great things about Hudi but it does seem to have significantly less adoption so far.
"when we considered them in 2022" is significant here because both Iceberg and Delta Lake have made rapid progress since then. I talk to a lot of companies making this decision and the consensus is swinging towards Iceberg. If they're already heavy Databricks users, then Delta is the obvious choice.
For anyone that missed it, Databricks acquired Tabular[0] (which was founded by the creators of Iceberg). The public facing story is that both projects will continue independently and I really hope that's true.
Shameless plug: this is the same infrastructure we're using at Definite[1] and we're betting a lot of companies want a setup like this, but can't afford to build it themselves. It's radically cheaper then the standard Snowflake + Fivetran + Looker stack and works day one. A lot of companies just want dashboards and it's pretty ridiculous the hoops you need to jump thru to get them running.
We use iceberg for storage, duckdb as a query engine, a few open source projects for ETL and built a frontend to manage it all and create dashboards.
0 - https://www.definite.app/blog/databricks-tabular-acquisition
Related
Our great database migration
Shepherd, an insurance pricing company, migrated from SQLite to Postgres to boost performance and scalability for their pricing engine, "Alchemist." The process involved code changes, adopting Neon database, and optimizing performance post-migration.
Graph-Based Ceramics
The article explores managing ceramic glazes in a kiln and developing an app. It compares Firebase, Supabase, and Instant databases, highlighting Instant's efficiency in handling complex relational data for ceramic management.
We sped up Notion in the browser with WASM SQLite
Notion improved web performance with WebAssembly SQLite, enhancing navigation by 20% in modern browsers. SharedWorker architecture managed SQLite queries efficiently, overcoming initial challenges for a seamless user experience.
DuckDB Meets Postgres
Organizations shift historical Postgres data to S3 with Apache Iceberg, enhancing query capabilities. ParadeDB integrates Iceberg with S3 and Google Cloud Storage, replacing DataFusion with DuckDB for improved analytics in pg_lakehouse.
Notion about their usage of WASM SQLite
Notion enhanced browser performance by integrating WebAssembly SQLite, OPFS, and Web Workers technologies. Overcoming challenges, they improved page navigation by 20%, optimizing SQLite usage for efficient cross-tab queries and compatibility.