Lessons Learned from Scaling to Multi-Terabyte Datasets
Insights on scaling to multi-terabyte datasets, emphasizing algorithm evaluation before scaling. Tools like Joblib and GNU Parallel for single machine scaling, transitioning to multiple machines, and comparing performance/cost implications. Recommendations for parallel workloads and analytical tasks using AWS Batch, Dask, and Spark. Considerations for tool selection based on team size and workload.
Read original articleThis post provides insights into scaling to multi-terabyte datasets, focusing on lessons learned and tools for efficient scaling. It emphasizes the importance of evaluating algorithms before scaling and offers guidance on scaling on single machines using tools like Joblib and GNU Parallel. Joblib is highlighted for parallelizing workloads, while GNU Parallel is recommended for CLI data processing tasks. The post also delves into scaling to multiple machines, discussing when to transition to multiple machines and the benefits of horizontal scaling. It compares the performance and cost implications of using multiple smaller instances versus a single large instance. Additionally, it touches on different computing models for embarrassingly parallel workloads and analytical workloads, recommending tools like AWS Batch, Dask, and Spark for scaling up computations. The post concludes by suggesting considerations for selecting the appropriate tool based on team size, infrastructure, and workload requirements.
Related
So, multi TB can still be a single node with DuckDB. It's also rather a small ClickHouse cluster. It sounds easier to use DuckDB than the proposed tools.
In many use cases, if it's not 24/7, you can use a data lake (e.g. Iceberg) and query it if needed. Databricks seems a way to go, since the author uses Spark.
https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGu...
I guess the pricing will be cheaper than renting entire machines since you are doing something like micro-spot allocations for map or reduce jobs.
We have done a lot of analytical and also batched workflows with AWS Athena. This is really cheap and also, once you get a hang of it, you get surprised how much you can achieve with only SQL.
I am hoping serverless EMR would be roughly equivalent, just with another language...