July 29th, 2024

Amazon's Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2

Amazon's BDT team is migrating BI datasets from Apache Spark to Ray on EC2 to enhance efficiency and reduce costs, achieving significant performance improvements and addressing scalability issues with their data catalog.

Read original article

ImpressionCuriosityExcitement

Amazon's Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2

Amazon's Business Data Technologies (BDT) team is migrating its business intelligence (BI) datasets from Apache Spark to Ray on Amazon EC2 to enhance data processing efficiency and reduce costs. This transition is significant as it involves managing exabytes of data with minimal downtime. The decision to switch to Ray, an open-source framework primarily known for machine learning, stems from its ability to address scalability and performance issues that arose with their existing Apache Spark compactor, which struggled with the growing size of their data catalog.

The migration process began after Amazon's successful transition from Oracle's data warehouse to a more scalable architecture using Amazon S3 for storage and various compute services. However, as their data needs expanded, the limitations of Spark became apparent, prompting BDT to explore alternatives. After evaluating Ray, they conducted a proof-of-concept that demonstrated Ray's superior performance, achieving a 12-fold increase in dataset size handling and a 91% improvement in cost efficiency compared to Spark.

By 2021, BDT had developed a distributed system design for managing jobs on Ray, integrating it with other AWS services for effective job tracking. They tested this system in 2022, focusing on data quality insights across their catalog, which allowed them to identify and resolve issues before implementing Ray in critical data processing tasks. The migration to Ray represents a strategic shift in how Amazon manages its vast data resources, aiming for improved performance and cost-effectiveness in data processing.

Lessons Learned from Scaling to Multi-Terabyte Datasets

Insights on scaling to multi-terabyte datasets, emphasizing algorithm evaluation before scaling. Tools like Joblib and GNU Parallel for single machine scaling, transitioning to multiple machines, and comparing performance/cost implications. Recommendations for parallel workloads and analytical tasks using AWS Batch, Dask, and Spark. Considerations for tool selection based on team size and workload.

Raycast (YC W20) Is Hiring a Senior Product Designer (UTC ± 3 Hours)

Raycast seeks a Senior Product Designer with a €132,000 salary. Responsibilities include UI design for macOS and web, brand enhancement, and contributing to product aspects. Benefits include stock options and health insurance.

Big Tech's playbook for swallowing the AI industry

Amazon strategically hires Adept AI team to sidestep antitrust issues. Mimicking Microsoft's Inflection move, Amazon's "reverse acquihire" trend absorbs AI startups, like Adept, facing financial struggles. Big Tech adapts to regulatory challenges by emphasizing talent acquisition and tech licensing.

Building and scaling Notion's data lake

Notion copes with rapid data growth by transitioning to a sharded architecture and developing an in-house data lake using technologies like Kafka, Hudi, and S3. Spark is the main processing engine for scalability and cost-efficiency, improving scalability, speed, and cost-effectiveness.

The Rise of the Analytics Pretendgineer

The article examines dbt's role in data modeling, highlighting its accessibility for analytics professionals while noting the need for a structured framework to prevent disorganized and fragile projects.

AI: What people are saying

The comments reflect a mix of insights and opinions regarding Amazon's migration from Apache Spark to Ray for BI datasets.

Several commenters highlight the impressive performance improvements and cost savings achieved with Ray, suggesting it may represent a significant shift in data processing technologies.
There is a discussion about the suitability of Ray for different types of workloads, with some noting that Ray excels in unstructured data processing and machine learning tasks.
Users express curiosity about the practical applications of Ray and how it compares to Spark, particularly in terms of scalability and efficiency.
Some commenters share personal experiences with Ray, emphasizing its Python-native capabilities and real-time query performance.
Concerns are raised about the long-term sustainability of data processing as data volumes grow, questioning whether current improvements will suffice.

25 comments

By @robertnishihara - 9 months

I'm one of the creators of Ray. A few thoughts :)

1. This is truly impressive work from AWS. Patrick Ames began speaking about this a couple years ago, though at this point the blog post is probably the best reference. https://www.youtube.com/watch?v=h7svj_oAY14

2. This is not a "typical" Ray use case. I'm not aware of any other exabyte scale data processing workloads. Our bread and butter is ML workloads: training, inference, and unstructured data processing.

3. We have a data processing library called Ray Data for ingesting and processing data, often done in conjunction with training and inference. However, I believe in this particular use case, the heavy lifting is largely done with Ray's core APIs (tasks & actors), which are lower level and more flexible, which makes sense for highly custom use cases. Most Ray users use the Ray libraries (train, data, serve), but power users often use the Ray core APIs.

4. Since people often ask about data processing with Ray and Spark, Spark use cases tend to be more geared toward structured data and CPU processing. If you are joining a bunch of tables together or running SQL queries, Spark is going to be way better. If you're working with unstructured data (images, text, video, audio, etc), need mixed CPU & GPU compute, are doing deep learning and running inference, etc, then Ray is going to be much better.

By @zitterbewegung - 9 months

I was in a workshop that taught me Ray. It was interesting to know that the people who started Spark were also involved in making Ray.

This is not badmouthing either project just an observation and if you architected one task you would be good at attacking the same problem better .

By @parhamn - 9 months

Im curious, how do data scientists use these massive datasets, especially the old stuff. Is it more of a compliance and need/should-save type thing or is the data actually useful? Im baffled by these numbers having never used a large BI tool, and am genuinely curious how the data is actually used operationally.

As a layman, I imagine lots of it loses relevancy very quickly, e.g Amazon sales data from 5 years ago is marginally useful to determining future trends and analyzing new consumer behavior regimes?

By @quadrature - 9 months

Anyone know enough about ray to comment on what the exact performance unlock was ?. They mention that it gave them enough control over the distribution of work so that they could avoid unnecessary reads/write. That seems like a good win but I would assume that doing compaction in python would be quite slow.

By @whalesalad - 9 months

Video from the author deep diving this. https://www.youtube.com/watch?v=h7svj_oAY14

By @mannyv - 9 months

Crazy that the project took almost 4 years end-to-end, and it's still ongoing.

I had no idea anything at AWS had that long of an attention span.

It's funny and telling that in the end, it's all backed by CSVs in s3. Long live CSV!

By @PeterCorless - 9 months

I remember when everyone shifted from Apache Hadoop to Apache Spark. This seems like a possibly similar sea change. I am not sure if many other users will embrace Ray over Spark, but this is a sign that people are looking to either improve Spark on some fundamental levels, or are going to reach out to new technologies to resolve their problems. Cool stuff.

By @jaychia - 9 months

I work on Daft and we’ve been collaborating with the team at Amazon to make this happen for about a year now!

We love Ray, and are excited about the awesome ecosystem of useful + scalable tools that run on it for model training and serving. We hope that Daft can complement the rest of the Ray ecosystem to enable large scale ETL/analytics to also run on your existing Ray clusters. If you have an existing Ray cluster setup, you absolutely should have access to best-in-class ETL/analytics without having to run a separate Spark cluster.

Also, on the nerdier side of things - the primitives that Ray provides gives us a real opportunity to build a solid non-JVM based, vectorized distributed query engine. We’re already seeing extremely good performance improvements here vs Spark, and are really excited about some of the upcoming work to get even better performance and memory stability.

This collaboration with Amazon really battle-tested our framework :) happy to answer any questions if folks have them.

By @hiyer - 9 months

We chose Ray over Spark in my previous company mostly because we were a Python shop and Ray is Python-native (though it's implemented in C++ I believe). It worked very well for us even for real-time queries - though we were obviously nowhere near the scale that AWS is at.

By @alberth - 9 months

Has anyone found a good ELI5 site for all the different AI toolsets.

Eg Ray, Databricks, Notebooks etc

By @Narhem - 9 months

Absolutely insane work. So much data you’d think they would come up with a custom solution instead of using the “newest available toolkit” but I understand how much of a mess dealing with that much data is.

By @igmor - 9 months

Can you share any data on how big of a cluster is running Ray jobs?

By @LarsDu88 - 9 months

As someone who used Ray for ML, this is impressive as hell.

I did not even realize it was that viable a substitute for straight data processing tasks.

By @maxnevermind - 9 months

Why did it take multiple years to do that I wonder? After a new compaction framework is ironed out it should not be that difficult to onboard/spread it across to all the tables in especially considering they had a parallel setup in Spark still, so they can afford hiccups in a new setup.

By @OutOfHere - 9 months

Can you help us understand how others can use and derive value from Ray DeltaCAT? What would be the specific use cases?

By @whoevercares - 9 months

I wonder if similar performance can be achieved with Spark accelerator like https://github.com/apache/datafusion-comet. Of course it didn’t exist 4 years ago, but would it cheaper to build?

By @jameskraus - 9 months

I wonder if there are any good primers to these technologies. Maybe a DDIA-like book or some lectures?

By @frankjr - 9 months

> From the typical Amazon EC2 customer’s perspective, this translates to saving over $120MM/year on Amazon EC2 on-demand R5 instance charges.

Does the sales team know about this? /jk

By @100pctremote - 9 months

Rather nuts. New challenge: build datacenters quickly enough to support the new platform.

By @whoiscroberts - 9 months

Ray user here, what language actors are they using? Ray support Python Java and cpp actors…

By @esafak - 9 months

Are we talking about big data ETL here? I did not know Ray was suited for it.

By @jgalt212 - 9 months

Slightly flip, but it's interesting that no one believes in or brags about cost savings via statistical sampling techniques these days.

By @e28eta - 9 months

I wonder what they’re doing to combat the growth rate of their data. A 13x speed up, or 82% cost reduction is great, but doesn’t seem significant enough compared to the growth of the business and (my assumption) demand for adding new data sources and added data to existing sources.

Like, if the current latency is ~60 minutes for 90% of updates, will it ever be better than that? Won’t it just slowly degrade until the next multi-year migration?

PS: this article was infuriating to read on iPad - it kept jumping back to the top of the page and couldn’t figure out why

By @uptownfunk - 9 months

Are they that bored over there at amazon.

Amazon's Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2

Related

Lessons Learned from Scaling to Multi-Terabyte Datasets

Raycast (YC W20) Is Hiring a Senior Product Designer (UTC ± 3 Hours)

Big Tech's playbook for swallowing the AI industry

Building and scaling Notion's data lake

The Rise of the Analytics Pretendgineer

Related

Lessons Learned from Scaling to Multi-Terabyte Datasets

Raycast (YC W20) Is Hiring a Senior Product Designer (UTC ± 3 Hours)

Big Tech's playbook for swallowing the AI industry

Building and scaling Notion's data lake

The Rise of the Analytics Pretendgineer