June 20th, 2024

Lessons Learned from Scaling to Multi-Terabyte Datasets

Insights on scaling to multi-terabyte datasets, emphasizing algorithm evaluation before scaling. Tools like Joblib and GNU Parallel for single machine scaling, transitioning to multiple machines, and comparing performance/cost implications. Recommendations for parallel workloads and analytical tasks using AWS Batch, Dask, and Spark. Considerations for tool selection based on team size and workload.

Read original articleLink Icon
Lessons Learned from Scaling to Multi-Terabyte Datasets

This post provides insights into scaling to multi-terabyte datasets, focusing on lessons learned and tools for efficient scaling. It emphasizes the importance of evaluating algorithms before scaling and offers guidance on scaling on single machines using tools like Joblib and GNU Parallel. Joblib is highlighted for parallelizing workloads, while GNU Parallel is recommended for CLI data processing tasks. The post also delves into scaling to multiple machines, discussing when to transition to multiple machines and the benefits of horizontal scaling. It compares the performance and cost implications of using multiple smaller instances versus a single large instance. Additionally, it touches on different computing models for embarrassingly parallel workloads and analytical workloads, recommending tools like AWS Batch, Dask, and Spark for scaling up computations. The post concludes by suggesting considerations for selecting the appropriate tool based on team size, infrastructure, and workload requirements.

Related

Link Icon 7 comments
By @jakozaur - 7 months
Servers grow much bigger. 256GB of RAM is pretty much a standard rack server on a major cloud. On AWS, ultra memory with 24TB of RAM is one API call away from you :-). The article took note of that, but tooling on one node puzzles me.

So, multi TB can still be a single node with DuckDB. It's also rather a small ClickHouse cluster. It sounds easier to use DuckDB than the proposed tools.

In many use cases, if it's not 24/7, you can use a data lake (e.g. Iceberg) and query it if needed. Databricks seems a way to go, since the author uses Spark.

By @atemerev - 7 months
A “multi-terabyte dataset” is something that fits into a single machine, and can be loaded into Clickhouse in a few minutes.
By @sponaugle - 7 months
When I first read the title I had in my head that it said multi-petabyte. When I think Multi-terabyte, I think 'yea, that is mostly in RAM'. 4TBs of RAM is sort of the new lower norm in DB machines, and GCP and AWS have high RAM machines at 12TB and 24TB. Not that these are not good ideas about scaling to multi-terabytes.. it is just interesting how much data has grown in size.
By @fifilura - 7 months
Since you are already running on AWS, I wonder if you would consider AWS serverless EMR an option?

https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGu...

I guess the pricing will be cheaper than renting entire machines since you are doing something like micro-spot allocations for map or reduce jobs.

We have done a lot of analytical and also batched workflows with AWS Athena. This is really cheap and also, once you get a hang of it, you get surprised how much you can achieve with only SQL.

I am hoping serverless EMR would be roughly equivalent, just with another language...

By @sammysidhu - 7 months
https://github.com/Eventual-Inc/Daft Is also great at these types of workloads since it’s both distributed and vectorized!