July 14th, 2024

Building and scaling Notion's data lake

Notion copes with rapid data growth by transitioning to a sharded architecture and developing an in-house data lake using technologies like Kafka, Hudi, and S3. Spark is the main processing engine for scalability and cost-efficiency, improving scalability, speed, and cost-effectiveness.

Read original articleLink Icon
QuestionsClarificationsSuggestions
Building and scaling Notion's data lake

Notion has experienced a significant increase in data over the past three years, with its data doubling every 6-12 months due to user growth. To manage this growth and meet the demands of new features like Notion AI, the company built and scaled its data lake. Notion's data model treats all elements as "blocks" stored in a Postgres database, which has grown to hundreds of terabytes. To handle this, Notion transitioned to a sharded architecture and built a dedicated data infrastructure. Challenges arose with Snowflake's ability to handle Notion's update-heavy workload, leading to the development of an in-house data lake using technologies like Kafka, Hudi, and S3. Spark was chosen as the main processing engine for its scalability and cost-efficiency. Incremental ingestion from Postgres to S3 was preferred over full snapshots for fresher data at lower costs. The data lake setup involved Debezium CDC connectors, Kafka, and Hudi for efficient data processing and storage. By offloading heavy workloads to S3 and focusing on offline tasks, Notion improved scalability, speed, and cost-effectiveness in managing its expanding data volume.

AI: What people are saying
The article on Notion's transition to a sharded architecture and in-house data lake using Kafka, Hudi, and S3 has sparked various discussions.
  • Notion's EM clarifies that user data is not sold and elaborates on the data lake's use for efficient reindexing of search clusters.
  • Comments highlight the potential cost savings from moving away from expensive services like Fivetran and Snowflake.
  • Suggestions are made to consider alternative query engines like Trino or StarRocks for better performance in interactive data analysis tasks.
  • Questions arise about the practical benefits of the data lake for Notion users and its potential use in training AI models.
  • Technical inquiries about data replication formats and updates in S3, as well as comparisons with other data management solutions, are discussed.
Link Icon 18 comments
By @crux - 4 months
Hi all—I'm the EM for the Search team at Notion, and I want to chime in to clear up one unfortunate misconception I've seen a few times in this thread.

Notion does not sell its users' data.

Instead, I want to expand on one of the first use-cases for the Notion data lake, which was by my team. This is an elaboration of the description in TFA under the heading "Use case support".

As is described there, Notion's block permissions are highly normalized at the source of truth. This is usually quite efficient and generally brings along all the benefits of normalization in application databases. However, we need to _denormalize_ all the permissions that relate to a specific document when we index it into our search index.

When we transactionally reindex a document "online", this is no problem. However, when we need to reindex an entire search cluster from scratch, loading every ancestor of each page in order to collect all of its permissions is far too expensive.

Thus, one of the primary needs that my team had from the new data lake is "tree traversal and permission data construction for each block". We rewrote our "offline" reindexer to read from the data lake instead of reading from RDS instances serving database snapshots. This allowed us to dramatically reduce the impact of iterating through every page when spinning up a new cluster (not to mention save a boatload in spinning up those ad-hoc RDS instances).

I hope this miniature deep dive gives a little bit more color on the uses of this data store—as it is emphatically _not_ to sell our users' data!

By @SOLAR_FIELDS - 4 months
They didn’t say the quiet part out loud, which is almost certainly that the Fivetran and Snowflake bills for what they were doing were probably enormous and those were undoubtedly what got management’s attention about fixing this.
By @adolph - 4 months
They seem to be doing lots of work but I don’t understand what customer value this creates.

What does a backing data lake afford a Notion user that can’t be done in a similar product, like Obsidian?

By @methou - 4 months
> Data lake > Data warehouse

These aren't something I would like to hear if I'm still using Notion. It's very bold to publish something like this on their own website.

By @j45 - 4 months
This was a nice read, interesting to see how far Postgres (largely alone) can get you.

Also we see how at self hosting within a startup can make perfect sense. :)

Devops that abstract away things in some cases to the cloud might just add to architectural and technical debt later, without the history of learning from working through the challenges

Still, it might have been a great opportunity to figure out offline first use of notion.

I have been forced to use anytype instead of notion for the offline first reason. Time to checkout to learn how they handle storage from the source code.

By @hobobaggins - 4 months
> Managing this rapid growth while meeting the ever-increasing data demands of critical product and analytics use cases, especially our recent Notion AI features, meant building and scaling Notion’s data lake.

Are they using this new data lake to train new AI models on?

Or has Notion signed a deal with another LLM provider to provide customer data as a source for training data?

By @philippemnoel - 4 months
This is one of the best blog posts I've seen that showcase the UPDATE-heavy, "surface data lakes data to users" type of workload.

At ParadeDB, we're seeing more and more users want to maintain the Postgres interface while offloading data to S3 for cost and scalability reasons, which was the main reason behind the creation of pg_lakehouse.

By @wejick - 4 months
I'm not familiar with S3 on datalake setup. When replicating a db table to S3, what format will be used?

And I'm wondering if it's possible to update the S3 files to reflect latest incoming changes on the db table?

By @HermitX - 4 months
Great article, thank you for sharing! I have a question I’d like to discuss with the author. Spark SQL is a great product and works perfectly for batch processing tasks. However, for handling ad hoc query tasks or more interactive data analysis tasks, Spark SQL might have some performance issues. If you have such workloads, I suggest trying data lake query engines like Trino or StarRocks, which offer faster speeds and a better query experience.
By @jauntywundrkind - 4 months
Side-ish note, I really enjoyed a submission on Bufstream recently, a Kafka mq replacement. One of the things they mentioned is that they are working on building in Iceberg materialization, so Bufstream can automatically handle building a big analytics data lake out of incoming data. It feels like that could potentially tackle a bunch of the stack here. https://buf.build/blog/bufstream-kafka-lower-cost https://news.ycombinator.com/item?id=40919279

Versus what Notion is doing:

> We ingest incrementally updated data from Postgres to Kafka using Debezium CDC connectors, then use Apache Hudi, an open-source data processing and storage framework, to write these updates from Kafka to S3.

Feels like it would work about the same with Bufstream, replacing both Kafka & Hudi. I've heard great things about Hudi but it does seem to have significantly less adoption so far.

By @whinvik - 4 months
Is there any advantage to having both a Data Lake setup as well as Snowflake. Why would one also want Snowflake after doing such an extensive data lake setup?
By @CyberDildonics - 4 months
What's the difference between a data lake and a database with a filesystem?
By @DataDaemon - 4 months
OK, thanks, when E2EE ?
By @mritchie712 - 4 months
> Iceberg and Delta Lake, on the other hand, weren’t optimized for our update-heavy workload when we considered them in 2022

"when we considered them in 2022" is significant here because both Iceberg and Delta Lake have made rapid progress since then. I talk to a lot of companies making this decision and the consensus is swinging towards Iceberg. If they're already heavy Databricks users, then Delta is the obvious choice.

For anyone that missed it, Databricks acquired Tabular[0] (which was founded by the creators of Iceberg). The public facing story is that both projects will continue independently and I really hope that's true.

Shameless plug: this is the same infrastructure we're using at Definite[1] and we're betting a lot of companies want a setup like this, but can't afford to build it themselves. It's radically cheaper then the standard Snowflake + Fivetran + Looker stack and works day one. A lot of companies just want dashboards and it's pretty ridiculous the hoops you need to jump thru to get them running.

We use iceberg for storage, duckdb as a query engine, a few open source projects for ETL and built a frontend to manage it all and create dashboards.

0 - https://www.definite.app/blog/databricks-tabular-acquisition

1 - https://www.youtube.com/watch?v=7FAJLc3k2Fo

By @alexliu518 - 4 months
Thank you for the clarification! It's great to hear more about the efficient data management practices at Notion. Your team's innovative use of the data lake to streamline the reindexing process while ensuring user data privacy is impressive. Keep up the excellent work!