We accidentally burned through 200GB of proxy bandwidth in 6 hours
Skyvern's AI agent consumed 200GB of proxy bandwidth in six hours, costing $500, due to repeated downloads of a Google machine learning model. Solutions include local caching and URL blocking.
Read original articleSkyvern, an AI agent designed to automate browser workflows, experienced a significant issue when it unexpectedly consumed 200GB of proxy bandwidth in just six hours, costing approximately $500. The incident was discovered when the founder noticed a spike in failure rates and bandwidth alerts. Initial concerns about potential account abuse were dismissed after reviewing usage stats. Further investigation revealed that repeated calls to a Google URL, specifically for downloading a machine learning model, were responsible for the excessive bandwidth usage. The problem stemmed from Skyvern not persisting browser state between sessions, causing the system to repeatedly download the model. To address this, the team decided to implement two solutions: running Chrome locally to save the user data directory, which would cache the model, and blocking the specific Google URL to prevent future downloads. These measures aimed to mitigate the issue and ensure more efficient bandwidth usage moving forward.
- Skyvern consumed 200GB of proxy bandwidth in six hours, costing around $500.
- The excessive usage was due to repeated downloads of a Google machine learning model.
- The lack of persistent browser state led to continuous uncached downloads.
- Solutions included caching the model locally and blocking the problematic URL.
- The incident highlights the importance of monitoring and managing proxy bandwidth effectively.
Related
AI crawlers need to be more respectful
Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.
iFixit CEO takes shots at Anthropic for hitting servers a million times in 24h
iFixit CEO Kyle Wiens criticized Anthropic for making excessive requests to their servers, violating terms of service. This incident highlights concerns about AI companies ignoring website policies and ethical data scraping issues.
Tracking supermarket prices with Playwright
In December 2022, the author created a price tracking website for Greek supermarkets, utilizing Playwright for scraping, cloud services for automation, and Tailscale to bypass IP restrictions, optimizing for efficiency.
We survived 10k requests/second: Switching to signed asset URLs in an emergency
Hardcover experienced a surge in Google Cloud expenses due to unauthorized access to their public storage. They implemented signed URLs via a Ruby on Rails proxy, reducing costs and enhancing security.
My Cloud Billing Screw-Up
Matt Gowie recounts a cloud billing error from a Dockerfile change that led to nearly $1000 in AWS charges. He emphasizes validating changes before deployment and suggests using a terraform module for cost management.
- Many commenters suggest exploring alternative proxy solutions, including unlimited bandwidth options and static residential ISP proxies.
- There is a consensus on the need for better management of external dependencies, particularly regarding reliance on Google services.
- Some users express skepticism about the cost of bandwidth, questioning the pricing models of cloud services.
- Several comments highlight the importance of implementing measures to prevent unauthorized downloads and manage bandwidth usage effectively.
- Technical misunderstandings about bandwidth and data measurement are noted, with calls for clearer definitions.
1. An explosion of residential proxy networks and other stuff to circumvent blocking of cloud IP ranges, for all the various AI scraping tools to use.
2. A corresponding explosion of countermeasures to the above. Instead of blocking suspicious IPs, maybe they get a 3GB file on their request to /scrape-target.html
On any gig link, over the course of 6 hours you can transmit a little more than 4TB one way.. which is 40x more.
https://issues.chromium.org/issues/40220332
I wonder if there is a more recent bug related to this?
Instead of Always. Be. Closing. it should be Always. Be. Mitigating. Dependencies. for startups.
The fuck ? So Internet is literally more expensive than buying a drive at amazon, paying for shipping, filling it up putting it on a truck towards a destination anywhere in the world.
Bandwidth is measured in data/time
Um...say what? I'm pretty broadly based in IT, and I have no idea what that means.
It’s still absurd to me that many (most?) of these hosting/bandwidth providers don’t seems to allow automatic cut offs and such
A (different) proxy company owner here. This sucks! Sorry that you lost out on so much bandwidth.
Feel free to reach out to me at tim@pingproxies.com and I'd be happy to get you set up on our service and credit you with 100GB of free bandwidth to help soften the blow. I'll also be able to get you pricing alittle better than you're currently on if you are interested ;)
Within the next few months we're also releasing a bunch of tools to help stop things like this happening on our residential network such as some intelligent routing logic, spend controls and a few other things.
You may also want to look into Static Residential ISP Proxies - we charge these per IP address rather than bandwidth and they often end up more economical. We work with carriers like Spectrum, Comcast & AT&T directly to get IP addresses on their networks so they look like residential connections but host them in datacenters - this way you get 99.99%+ availability, 1G+ throughput, stable IP addresses and have unlimited bandwidth.
@ everyone else in the thread; if you run a start-up and need proxies then email me - happy to credit you with 50GB free residential bandwidth + give some advice on infra if needed.
Cheers, Tim at Ping
Gigabyte is a measure of information.
Bandwidth is information transmitted over time.
Related
AI crawlers need to be more respectful
Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.
iFixit CEO takes shots at Anthropic for hitting servers a million times in 24h
iFixit CEO Kyle Wiens criticized Anthropic for making excessive requests to their servers, violating terms of service. This incident highlights concerns about AI companies ignoring website policies and ethical data scraping issues.
Tracking supermarket prices with Playwright
In December 2022, the author created a price tracking website for Greek supermarkets, utilizing Playwright for scraping, cloud services for automation, and Tailscale to bypass IP restrictions, optimizing for efficiency.
We survived 10k requests/second: Switching to signed asset URLs in an emergency
Hardcover experienced a surge in Google Cloud expenses due to unauthorized access to their public storage. They implemented signed URLs via a Ruby on Rails proxy, reducing costs and enhancing security.
My Cloud Billing Screw-Up
Matt Gowie recounts a cloud billing error from a Dockerfile change that led to nearly $1000 in AWS charges. He emphasizes validating changes before deployment and suggests using a terraform module for cost management.