September 19th, 2024

We accidentally burned through 200GB of proxy bandwidth in 6 hours

Skyvern's AI agent consumed 200GB of proxy bandwidth in six hours, costing $500, due to repeated downloads of a Google machine learning model. Solutions include local caching and URL blocking.

Read original article

FrustrationConcernCuriosity

We accidentally burned through 200GB of proxy bandwidth in 6 hours

Skyvern, an AI agent designed to automate browser workflows, experienced a significant issue when it unexpectedly consumed 200GB of proxy bandwidth in just six hours, costing approximately $500. The incident was discovered when the founder noticed a spike in failure rates and bandwidth alerts. Initial concerns about potential account abuse were dismissed after reviewing usage stats. Further investigation revealed that repeated calls to a Google URL, specifically for downloading a machine learning model, were responsible for the excessive bandwidth usage. The problem stemmed from Skyvern not persisting browser state between sessions, causing the system to repeatedly download the model. To address this, the team decided to implement two solutions: running Chrome locally to save the user data directory, which would cache the model, and blocking the specific Google URL to prevent future downloads. These measures aimed to mitigate the issue and ensure more efficient bandwidth usage moving forward.

- Skyvern consumed 200GB of proxy bandwidth in six hours, costing around $500.

- The excessive usage was due to repeated downloads of a Google machine learning model.

- The lack of persistent browser state led to continuous uncached downloads.

- Solutions included caching the model locally and blocking the problematic URL.

- The incident highlights the importance of monitoring and managing proxy bandwidth effectively.

AI crawlers need to be more respectful

Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.

iFixit CEO takes shots at Anthropic for hitting servers a million times in 24h

iFixit CEO Kyle Wiens criticized Anthropic for making excessive requests to their servers, violating terms of service. This incident highlights concerns about AI companies ignoring website policies and ethical data scraping issues.

Tracking supermarket prices with Playwright

In December 2022, the author created a price tracking website for Greek supermarkets, utilizing Playwright for scraping, cloud services for automation, and Tailscale to bypass IP restrictions, optimizing for efficiency.

We survived 10k requests/second: Switching to signed asset URLs in an emergency

Hardcover experienced a surge in Google Cloud expenses due to unauthorized access to their public storage. They implemented signed URLs via a Ruby on Rails proxy, reducing costs and enhancing security.

My Cloud Billing Screw-Up

Matt Gowie recounts a cloud billing error from a Dockerfile change that led to nearly $1000 in AWS charges. He emphasizes validating changes before deployment and suggests using a terraform module for cost management.

AI: What people are saying

The discussion around Skyvern's bandwidth issue reveals several key themes and insights.

Many commenters suggest exploring alternative proxy solutions, including unlimited bandwidth options and static residential ISP proxies.
There is a consensus on the need for better management of external dependencies, particularly regarding reliance on Google services.
Some users express skepticism about the cost of bandwidth, questioning the pricing models of cloud services.
Several comments highlight the importance of implementing measures to prevent unauthorized downloads and manage bandwidth usage effectively.
Technical misunderstandings about bandwidth and data measurement are noted, with calls for clearer definitions.

21 comments

By @patmcc - 7 months

I'm now expecting we'll see a couple things in the next few years:

1. An explosion of residential proxy networks and other stuff to circumvent blocking of cloud IP ranges, for all the various AI scraping tools to use.

2. A corresponding explosion of countermeasures to the above. Instead of blocking suspicious IPs, maybe they get a 3GB file on their request to /scrape-target.html

By @metadat - 7 months

200GB is nothing since 2018 when AT&T mass introduced their 1-gig symmetric fiber. Any single common gigabit link can run 200GB in 15 minutes.

On any gig link, over the course of 6 hours you can transmit a little more than 4TB one way.. which is 40x more.

By @omoikane - 7 months

The discussion linked in the post is from 2022, and the corresponding issue has already been fixed:

https://issues.chromium.org/issues/40220332

I wonder if there is a more recent bug related to this?

By @sam0x17 - 7 months

Gosh I regularly burn through that much just updating games in steam :D. Not proxy bandwidth of course but isn't it funny that the the line between regular usage and $$$ can be what is using the bandwidth. Or rather, isn't it funny that regular consumers expect to be able to use multiple terabytes of data for < $100/mo but the same can still be thousands in other enterprise domains

By @perks_12 - 7 months

200GB for $500? What cloud is this?

By @tristor - 7 months

I would have liked to see a bit more of 5 Whys here. It seems like a consistent lesson that startups have to learn over and over is how to manage external dependencies, and particularly the dangers of having Google as a dependency. This is new Chrom(e|ium) behavior, and it has a real cost, both for this company and for users, which may or may not be worth the ROI, but this is what happens when you have a large scale external dependency: stuff moves without your knowledge, consent, or control.

Instead of Always. Be. Closing. it should be Always. Be. Mitigating. Dependencies. for startups.

By @8organicbits - 7 months

What infrastructure is this using? Bandwidth seems pretty pricy

By @dusted - 7 months

" 200GB of proxy bandwidth was approximately $500 burned over the course of 6 hours"

The fuck ? So Internet is literally more expensive than buying a drive at amazon, paying for shipping, filling it up putting it on a truck towards a destination anywhere in the world.

By @hkon - 7 months

Literally means cloudprotection in Norwegian. Thought for a second we had gotten our own cloudflare.

By @tcfhgj - 7 months

Please, gigabyte isn't a unit of bandwidth.

Bandwidth is measured in data/time

By @bradley13 - 7 months

"We run leverage proxy networks and run headful browser instances"

Um...say what? I'm pretty broadly based in IT, and I have no idea what that means.

By @elphinstone - 7 months

Skyvern is a great name, very evocative. Typical arrogant Google, downloading trash to the user without consent.

By @olliej - 7 months

Honestly given many of these stories, $500 seems to be getting off pretty lightly.

It’s still absurd to me that many (most?) of these hosting/bandwidth providers don’t seems to allow automatic cut offs and such

By @tim_at_ping - 7 months

Hello,

A (different) proxy company owner here. This sucks! Sorry that you lost out on so much bandwidth.

Feel free to reach out to me at tim@pingproxies.com and I'd be happy to get you set up on our service and credit you with 100GB of free bandwidth to help soften the blow. I'll also be able to get you pricing alittle better than you're currently on if you are interested ;)

Within the next few months we're also releasing a bunch of tools to help stop things like this happening on our residential network such as some intelligent routing logic, spend controls and a few other things.

You may also want to look into Static Residential ISP Proxies - we charge these per IP address rather than bandwidth and they often end up more economical. We work with carriers like Spectrum, Comcast & AT&T directly to get IP addresses on their networks so they look like residential connections but host them in datacenters - this way you get 99.99%+ availability, 1G+ throughput, stable IP addresses and have unlimited bandwidth.

@ everyone else in the thread; if you run a start-up and need proxies then email me - happy to credit you with 50GB free residential bandwidth + give some advice on infra if needed.

Cheers, Tim at Ping

By @ang_cire - 7 months

Blocking Google from downloading anything onto your computer without consent is always a good idea.

By @meindnoch - 7 months

>200GB of proxy bandwidth

Gigabyte is a measure of information.

Bandwidth is information transmitted over time.

By @keepamovin - 7 months

you shouldn’t be paying by the terabyte. Colocate and just pay for the maximum throughout. Far better rates

We accidentally burned through 200GB of proxy bandwidth in 6 hours

Related

AI crawlers need to be more respectful

iFixit CEO takes shots at Anthropic for hitting servers a million times in 24h

Tracking supermarket prices with Playwright

We survived 10k requests/second: Switching to signed asset URLs in an emergency

My Cloud Billing Screw-Up

Related

AI crawlers need to be more respectful

iFixit CEO takes shots at Anthropic for hitting servers a million times in 24h

Tracking supermarket prices with Playwright

We survived 10k requests/second: Switching to signed asset URLs in an emergency

My Cloud Billing Screw-Up