September 10th, 2024

My Cloud Billing Screw-Up

Matt Gowie recounts a cloud billing error from a Dockerfile change that led to nearly $1000 in AWS charges. He emphasizes validating changes before deployment and suggests using a terraform module for cost management.

Read original articleLink Icon
My Cloud Billing Screw-Up

Matt Gowie shares a personal experience of a significant cloud billing mistake he made while working as a solo consultant. He describes how a code change to a Dockerfile led to a failure in starting a container within an AWS ECS cluster. Due to the repeated attempts to pull the container from a private subnet to the internet through a NAT Gateway, he incurred nearly $1000 in data processing charges over a weekend. Fortunately, he was able to explain the situation to his client, who was understanding, and he successfully obtained a credit from AWS to cover the unexpected costs. Gowie emphasizes the importance of validating changes before deployment and shares a tip about managing cloud costs, particularly in test environments, by using a terraform module designed to remove unnecessary resources.

- Matt Gowie experienced a significant cloud billing error due to a code change that caused repeated container failures.

- The incident resulted in nearly $1000 in charges from AWS for data processing over a weekend.

- Gowie was able to resolve the issue with his client and received a credit from AWS.

- He highlights the importance of validating code changes before deployment to avoid similar mistakes.

- Gowie offers a solution for managing cloud costs in test environments through a terraform module.

Link Icon 11 comments
By @aliasxneo - 8 months
At one company, we used Grafana Cloud for the full monitoring stack. They charge by unique Prometheus series for metrics. I wrote a rather small Go API to allow users to access some otherwise hidden data and added rate-limiting because the data itself was large. To figure out costs, I wanted to add monitoring to the handlers, so I added a middleware that caught all requests and logged things like the request path, response time, etc.

Sounds perfectly fine until you realize the internet is a vast space for people constantly scraping. I too left it over the weekend and came back to 70k unique series in our cloud account, pushing the bill well over $1k. What's worse is that Grafana is kind enough to not charge for these spikes, if you catch them before 48hrs. I caught it approx 50 hours later.

Like the OP, though, Grafana was nice enough to make it fall off after I explained the situation. Lesson learned!

By @miningape - 8 months
I was in my first year of uni, I was coding an online multiplayer game in C++ and I wanted to test play with one of my friends mostly for shits and giggles to see how badly it would break.

I had deployed basic websites / servers with more managed platforms before, but I needed? more control to be able to host the C++ server.

So I found GCP, created a docker image, and got the server up and running somehow. We played for maybe 10 minutes before we ran out of stuff to do, and stopped playing. What I didn't realise at the time was that auto-scaling was a concept. I thought when there was no traffic then the server wouldn't work, and I forgot I ever deployed it.

Anyways, a month later I got a $400 bill, not nearly as much as some people have lost but for a broke college student it was a lot - especially considering I only used it for 10 minutes.

By @ezekg - 8 months
One time I was a happy customer of Raygun for error monitoring, and I enabled Raygun's new-at-the-time APM product on my API to monitor performance. I can't remember if the high sample-rate was the default or if I was just dumb, but I had a really high sample-rate configured. I was super happy to get so much visibility into my application's performance, especially query performance! However, I wasn't happy when I saw I racked up a $14k bill in just a few days.

Thankfully, they forgave the bill (thanks jdt!), but it still scared me. I was still at the point back then where a bill like that could've killed my company, or at the very least got me into a lot of trouble.

After that, I pretty much ruled out usage-based billing for my company as too risky. This was quite a few years ago, but to this day I still have no major dependencies that offer usage-based billing.

By @hdjjhhvvhga - 8 months
I see a couple of problems here:

1. "It was late and I was done for the evening so I didn't validate the change." - if I could use one sentence to explain what is my value as a DevOps engineer, it could be putting these safety pins in place. You shouldn't need to validate anything - it should be a part of the pipeline.

2. AWS is using extortion fees for things like NAT Gateway processing, egress traffic etc. Knowing that, and being aware that container images need to be pulled frequently, it does make sense to use ECR or any other internally hosted container registry. If you don't do that, you will spend that $1000 anyway, just over a longer period than a weekend.

3. Any changes on Friday evening - just don't.

By @m_ke - 8 months
I wasted two days a few months ago when Amazon randomly started charging us $1K a day for static s3 buckets with a few GB of data that hasn't been touched in like 4 months.

Turned out it was a billing error on their side that they would have probably completely ignored if we didn't notice it.

By @hdjjhhvvhga - 8 months
This is not really a screw-up story, this is a "NAT Gateway is a racket" story.
By @delduca - 8 months
One time I had the brilliant idea of writing recursive cloud functions, and boom! $4,000 in just a few hours due to a bug.

Luckily, my credit card had a reduced limit, and later Google Cloud forgave the debt as long as I promised not to do it again.

By @DataDaemon - 8 months
Strange. My Hetzner never Screw-Up.
By @minkles - 8 months
I can't wait until we're all on IPv6 then NGW can go away!
By @eschneider - 8 months
Yeah...If age has taught me anything it's that I it's rarely a great idea to push changes at the end of the day. There's always that motivation to shortcut the validation a bit and every so often...this happens. If schedules permit, I find giving changes a final look first thing in the morning let's me catch things before they get committed.