September 10th, 2024

My Cloud Billing Screw-Up

Matt Gowie recounts a cloud billing error from a Dockerfile change that led to nearly $1000 in AWS charges. He emphasizes validating changes before deployment and suggests using a terraform module for cost management.

Read original article

Matt Gowie shares a personal experience of a significant cloud billing mistake he made while working as a solo consultant. He describes how a code change to a Dockerfile led to a failure in starting a container within an AWS ECS cluster. Due to the repeated attempts to pull the container from a private subnet to the internet through a NAT Gateway, he incurred nearly $1000 in data processing charges over a weekend. Fortunately, he was able to explain the situation to his client, who was understanding, and he successfully obtained a credit from AWS to cover the unexpected costs. Gowie emphasizes the importance of validating changes before deployment and shares a tip about managing cloud costs, particularly in test environments, by using a terraform module designed to remove unnecessary resources.

- Matt Gowie experienced a significant cloud billing error due to a code change that caused repeated container failures.

- The incident resulted in nearly $1000 in charges from AWS for data processing over a weekend.

- Gowie was able to resolve the issue with his client and received a credit from AWS.

- He highlights the importance of validating code changes before deployment to avoid similar mistakes.

- Gowie offers a solution for managing cloud costs in test environments through a terraform module.

Is Cloudflare overcharging us for their images service?

Jérôme Petazzoni reported unexpectedly high charges for Cloudflare's Images service, exceeding $400 instead of the anticipated $110, due to confusing billing practices. He is considering alternatives like Amazon S3.

How to save $13.27 on your SaaS bill

The author discusses managing costs with Vercel's analytics, converting images to reduce charges, and building a custom API using SQLite. They faced deployment challenges but plan future enhancements.

How HashiCorp evolved its cloud infrastructure

Michael Galloway discusses HashiCorp's cloud infrastructure evolution, emphasizing the need for clear objectives, deadlines, and executive buy-in to successfully redesign and expand their services amid growing demands.

We survived 10k requests/second: Switching to signed asset URLs in an emergency

Hardcover experienced a surge in Google Cloud expenses due to unauthorized access to their public storage. They implemented signed URLs via a Ruby on Rails proxy, reducing costs and enhancing security.

Admins wonder if the cloud was such a good idea after all

Many organizations find cloud services from major providers have not met cost-saving expectations, with significant price increases attributed to rising electricity and labor costs, prompting calls for better ROI assessments.

11 comments

By @aliasxneo - 8 months

At one company, we used Grafana Cloud for the full monitoring stack. They charge by unique Prometheus series for metrics. I wrote a rather small Go API to allow users to access some otherwise hidden data and added rate-limiting because the data itself was large. To figure out costs, I wanted to add monitoring to the handlers, so I added a middleware that caught all requests and logged things like the request path, response time, etc.

Sounds perfectly fine until you realize the internet is a vast space for people constantly scraping. I too left it over the weekend and came back to 70k unique series in our cloud account, pushing the bill well over $1k. What's worse is that Grafana is kind enough to not charge for these spikes, if you catch them before 48hrs. I caught it approx 50 hours later.

Like the OP, though, Grafana was nice enough to make it fall off after I explained the situation. Lesson learned!

By @miningape - 8 months

I was in my first year of uni, I was coding an online multiplayer game in C++ and I wanted to test play with one of my friends mostly for shits and giggles to see how badly it would break.

I had deployed basic websites / servers with more managed platforms before, but I needed? more control to be able to host the C++ server.

So I found GCP, created a docker image, and got the server up and running somehow. We played for maybe 10 minutes before we ran out of stuff to do, and stopped playing. What I didn't realise at the time was that auto-scaling was a concept. I thought when there was no traffic then the server wouldn't work, and I forgot I ever deployed it.

Anyways, a month later I got a $400 bill, not nearly as much as some people have lost but for a broke college student it was a lot - especially considering I only used it for 10 minutes.

By @ezekg - 8 months

One time I was a happy customer of Raygun for error monitoring, and I enabled Raygun's new-at-the-time APM product on my API to monitor performance. I can't remember if the high sample-rate was the default or if I was just dumb, but I had a really high sample-rate configured. I was super happy to get so much visibility into my application's performance, especially query performance! However, I wasn't happy when I saw I racked up a $14k bill in just a few days.

Thankfully, they forgave the bill (thanks jdt!), but it still scared me. I was still at the point back then where a bill like that could've killed my company, or at the very least got me into a lot of trouble.

After that, I pretty much ruled out usage-based billing for my company as too risky. This was quite a few years ago, but to this day I still have no major dependencies that offer usage-based billing.

By @hdjjhhvvhga - 8 months

I see a couple of problems here:

1. "It was late and I was done for the evening so I didn't validate the change." - if I could use one sentence to explain what is my value as a DevOps engineer, it could be putting these safety pins in place. You shouldn't need to validate anything - it should be a part of the pipeline.

2. AWS is using extortion fees for things like NAT Gateway processing, egress traffic etc. Knowing that, and being aware that container images need to be pulled frequently, it does make sense to use ECR or any other internally hosted container registry. If you don't do that, you will spend that $1000 anyway, just over a longer period than a weekend.

3. Any changes on Friday evening - just don't.

By @m_ke - 8 months

I wasted two days a few months ago when Amazon randomly started charging us $1K a day for static s3 buckets with a few GB of data that hasn't been touched in like 4 months.

Turned out it was a billing error on their side that they would have probably completely ignored if we didn't notice it.

By @hdjjhhvvhga - 8 months

This is not really a screw-up story, this is a "NAT Gateway is a racket" story.

By @delduca - 8 months

One time I had the brilliant idea of writing recursive cloud functions, and boom! $4,000 in just a few hours due to a bug.

Luckily, my credit card had a reduced limit, and later Google Cloud forgave the debt as long as I promised not to do it again.

By @DataDaemon - 8 months

Strange. My Hetzner never Screw-Up.

By @minkles - 8 months

I can't wait until we're all on IPv6 then NGW can go away!

By @eschneider - 8 months

Yeah...If age has taught me anything it's that I it's rarely a great idea to push changes at the end of the day. There's always that motivation to shortcut the validation a bit and every so often...this happens. If schedules permit, I find giving changes a final look first thing in the morning let's me catch things before they get committed.

My Cloud Billing Screw-Up

Related

Is Cloudflare overcharging us for their images service?

How to save $13.27 on your SaaS bill

How HashiCorp evolved its cloud infrastructure

We survived 10k requests/second: Switching to signed asset URLs in an emergency

Admins wonder if the cloud was such a good idea after all

Related

Is Cloudflare overcharging us for their images service?

How to save $13.27 on your SaaS bill

How HashiCorp evolved its cloud infrastructure

We survived 10k requests/second: Switching to signed asset URLs in an emergency

Admins wonder if the cloud was such a good idea after all