Amazon's Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2
Amazon's BDT team is migrating BI datasets from Apache Spark to Ray on EC2 to enhance efficiency and reduce costs, achieving significant performance improvements and addressing scalability issues with their data catalog.
Read original articleAmazon's Business Data Technologies (BDT) team is migrating its business intelligence (BI) datasets from Apache Spark to Ray on Amazon EC2 to enhance data processing efficiency and reduce costs. This transition is significant as it involves managing exabytes of data with minimal downtime. The decision to switch to Ray, an open-source framework primarily known for machine learning, stems from its ability to address scalability and performance issues that arose with their existing Apache Spark compactor, which struggled with the growing size of their data catalog.
The migration process began after Amazon's successful transition from Oracle's data warehouse to a more scalable architecture using Amazon S3 for storage and various compute services. However, as their data needs expanded, the limitations of Spark became apparent, prompting BDT to explore alternatives. After evaluating Ray, they conducted a proof-of-concept that demonstrated Ray's superior performance, achieving a 12-fold increase in dataset size handling and a 91% improvement in cost efficiency compared to Spark.
By 2021, BDT had developed a distributed system design for managing jobs on Ray, integrating it with other AWS services for effective job tracking. They tested this system in 2022, focusing on data quality insights across their catalog, which allowed them to identify and resolve issues before implementing Ray in critical data processing tasks. The migration to Ray represents a strategic shift in how Amazon manages its vast data resources, aiming for improved performance and cost-effectiveness in data processing.
Related
Lessons Learned from Scaling to Multi-Terabyte Datasets
Insights on scaling to multi-terabyte datasets, emphasizing algorithm evaluation before scaling. Tools like Joblib and GNU Parallel for single machine scaling, transitioning to multiple machines, and comparing performance/cost implications. Recommendations for parallel workloads and analytical tasks using AWS Batch, Dask, and Spark. Considerations for tool selection based on team size and workload.
Raycast (YC W20) Is Hiring a Senior Product Designer (UTC ± 3 Hours)
Raycast seeks a Senior Product Designer with a €132,000 salary. Responsibilities include UI design for macOS and web, brand enhancement, and contributing to product aspects. Benefits include stock options and health insurance.
Big Tech's playbook for swallowing the AI industry
Amazon strategically hires Adept AI team to sidestep antitrust issues. Mimicking Microsoft's Inflection move, Amazon's "reverse acquihire" trend absorbs AI startups, like Adept, facing financial struggles. Big Tech adapts to regulatory challenges by emphasizing talent acquisition and tech licensing.
Building and scaling Notion's data lake
Notion copes with rapid data growth by transitioning to a sharded architecture and developing an in-house data lake using technologies like Kafka, Hudi, and S3. Spark is the main processing engine for scalability and cost-efficiency, improving scalability, speed, and cost-effectiveness.
The Rise of the Analytics Pretendgineer
The article examines dbt's role in data modeling, highlighting its accessibility for analytics professionals while noting the need for a structured framework to prevent disorganized and fragile projects.
- Several commenters highlight the impressive performance improvements and cost savings achieved with Ray, suggesting it may represent a significant shift in data processing technologies.
- There is a discussion about the suitability of Ray for different types of workloads, with some noting that Ray excels in unstructured data processing and machine learning tasks.
- Users express curiosity about the practical applications of Ray and how it compares to Spark, particularly in terms of scalability and efficiency.
- Some commenters share personal experiences with Ray, emphasizing its Python-native capabilities and real-time query performance.
- Concerns are raised about the long-term sustainability of data processing as data volumes grow, questioning whether current improvements will suffice.
1. This is truly impressive work from AWS. Patrick Ames began speaking about this a couple years ago, though at this point the blog post is probably the best reference. https://www.youtube.com/watch?v=h7svj_oAY14
2. This is not a "typical" Ray use case. I'm not aware of any other exabyte scale data processing workloads. Our bread and butter is ML workloads: training, inference, and unstructured data processing.
3. We have a data processing library called Ray Data for ingesting and processing data, often done in conjunction with training and inference. However, I believe in this particular use case, the heavy lifting is largely done with Ray's core APIs (tasks & actors), which are lower level and more flexible, which makes sense for highly custom use cases. Most Ray users use the Ray libraries (train, data, serve), but power users often use the Ray core APIs.
4. Since people often ask about data processing with Ray and Spark, Spark use cases tend to be more geared toward structured data and CPU processing. If you are joining a bunch of tables together or running SQL queries, Spark is going to be way better. If you're working with unstructured data (images, text, video, audio, etc), need mixed CPU & GPU compute, are doing deep learning and running inference, etc, then Ray is going to be much better.
This is not badmouthing either project just an observation and if you architected one task you would be good at attacking the same problem better .
As a layman, I imagine lots of it loses relevancy very quickly, e.g Amazon sales data from 5 years ago is marginally useful to determining future trends and analyzing new consumer behavior regimes?
I had no idea anything at AWS had that long of an attention span.
It's funny and telling that in the end, it's all backed by CSVs in s3. Long live CSV!
We love Ray, and are excited about the awesome ecosystem of useful + scalable tools that run on it for model training and serving. We hope that Daft can complement the rest of the Ray ecosystem to enable large scale ETL/analytics to also run on your existing Ray clusters. If you have an existing Ray cluster setup, you absolutely should have access to best-in-class ETL/analytics without having to run a separate Spark cluster.
Also, on the nerdier side of things - the primitives that Ray provides gives us a real opportunity to build a solid non-JVM based, vectorized distributed query engine. We’re already seeing extremely good performance improvements here vs Spark, and are really excited about some of the upcoming work to get even better performance and memory stability.
This collaboration with Amazon really battle-tested our framework :) happy to answer any questions if folks have them.
Eg Ray, Databricks, Notebooks etc
I did not even realize it was that viable a substitute for straight data processing tasks.
Does the sales team know about this? /jk
Like, if the current latency is ~60 minutes for 90% of updates, will it ever be better than that? Won’t it just slowly degrade until the next multi-year migration?
PS: this article was infuriating to read on iPad - it kept jumping back to the top of the page and couldn’t figure out why
Related
Lessons Learned from Scaling to Multi-Terabyte Datasets
Insights on scaling to multi-terabyte datasets, emphasizing algorithm evaluation before scaling. Tools like Joblib and GNU Parallel for single machine scaling, transitioning to multiple machines, and comparing performance/cost implications. Recommendations for parallel workloads and analytical tasks using AWS Batch, Dask, and Spark. Considerations for tool selection based on team size and workload.
Raycast (YC W20) Is Hiring a Senior Product Designer (UTC ± 3 Hours)
Raycast seeks a Senior Product Designer with a €132,000 salary. Responsibilities include UI design for macOS and web, brand enhancement, and contributing to product aspects. Benefits include stock options and health insurance.
Big Tech's playbook for swallowing the AI industry
Amazon strategically hires Adept AI team to sidestep antitrust issues. Mimicking Microsoft's Inflection move, Amazon's "reverse acquihire" trend absorbs AI startups, like Adept, facing financial struggles. Big Tech adapts to regulatory challenges by emphasizing talent acquisition and tech licensing.
Building and scaling Notion's data lake
Notion copes with rapid data growth by transitioning to a sharded architecture and developing an in-house data lake using technologies like Kafka, Hudi, and S3. Spark is the main processing engine for scalability and cost-efficiency, improving scalability, speed, and cost-effectiveness.
The Rise of the Analytics Pretendgineer
The article examines dbt's role in data modeling, highlighting its accessibility for analytics professionals while noting the need for a structured framework to prevent disorganized and fragile projects.