July 12th, 2024

We saved $5k a month with a single Grafana query

Checkly's platform team optimized costs by switching to Kubernetes pods from AWS ECS, reducing pod startup times by 300ms. This led to a 25% decrease in pod usage, saving $5.5k monthly.

Read original articleLink Icon
We saved $5k a month with a single Grafana query

In 2024, Checkly's platform team aimed to optimize costs by reducing compute time. Transitioning from AWS ECS to Kubernetes pods, they focused on improving startup times. By optimizing the AWS SDK usage and analyzing CPU time through log lines, they identified inefficiencies. Switching to modular AWS SDK v3 versions and aligning SDK dependencies significantly reduced pod startup times. Additionally, rearranging the order of operations further enhanced performance. Leveraging Grafana Loki for log analysis, they transformed unstructured logs into actionable metrics. These optimizations led to a 300ms reduction in pod startup times, resulting in a 25% decrease in pod usage and saving approximately $5.5k monthly. The journey showcased the importance of hands-on code optimization and the impact of minor adjustments on overall efficiency and cost-effectiveness.

Related

Cubernetes

Cubernetes

Justin Garrison built "Cubernetes," a visually appealing Kubernetes hardware lab for training and content creation. The $6310 setup included unique parts like Mac Cube cases and LP-179 computers with Intel AMT support. Creative solutions like 3D printing and magnetic connectors were used. Lights were controlled by attiny85 and Raspberry Pi Pico for visualizations. The project prioritized functionality and education.

Meta Sees ~5% Performance Gains to Optimizing the Linux Kernel with Bolt

Meta Sees ~5% Performance Gains to Optimizing the Linux Kernel with Bolt

Facebook's Meta uses BOLT to enhance Linux kernel layout, yielding 5% performance boost. Benefits vary based on kernel usage, with tasks like databases and networks benefiting most. Engineer Maksim Panchenko shares optimization guide.

From Cloud Chaos to FreeBSD Efficiency

From Cloud Chaos to FreeBSD Efficiency

A client shifted from expensive Kubernetes setups on AWS and GCP to cost-effective FreeBSD jails and VMs, improving control, cost savings, and performance. Real-world tests favored FreeBSD over cloud solutions, emphasizing efficient resource management.

How we tamed Node.js event loop lag: a deepdive

How we tamed Node.js event loop lag: a deepdive

Trigger.dev team resolved Node.js app performance issues caused by event loop lag. Identified Prisma timeouts, network congestion from excessive traffic, and nested loop inefficiencies. Fixes reduced event loop lag instances, aiming to optimize payload handling for enhanced reliability.

Runs-on: Self hosted, on-premise, GitHub Action runners

Runs-on: Self hosted, on-premise, GitHub Action runners

A new self-hosted runner solution, RunsOn, integrates with AWS to offer GitHub Actions users cost-effective and efficient CI/CD processes. Users benefit from faster builds, cost savings up to 80%, and customizable runner specifications.

Link Icon 10 comments
By @Ekrekr - 4 months
I really enjoyed this read!

One thing that wasn't clear to me, is that if running NPM to install dependencies on pod startup is slow, why not pre build an image with dependencies already installed, and deploy that instead?

By @mrits - 4 months
Without proper telemetry and performance metrics you will get to do this in a few more months again
By @throwthrow5643 - 4 months
The 'one weird trick' could've been spotted in a graphical bundle analyser. But are they not caching npm packages somewhere, seems like an awful waste downloading from the npm registry over and over? I would think it would be parsing four different versions of the AWS sdk that was so slow.
By @roboben - 4 months
Sadly Grafana (cloud) comes at a cost too. Anyone struggles with this horrible active metrics based pricing too? Not only Grafana Cloud but others do it like that too.

We moved shitloads to self hosted Thanos. While this comes with its own drawbacks obv, I think it was worth it.

By @zug_zug - 4 months
I'm really surprised that 300ms at startup would result in 25% fewer pods.... What % reduction in the total startup time is that?

Is it possible the prior measurement happened during a high traffic period and the post measurement happened in a low traffic period?

By @sebstefan - 4 months
I really don't understand spinning up a whole pod just for a request

Wouldn't it be cheaper to just keep a pod up with a service running?

If scaleability is an issue just plop a load balancer in front of it and scale them up with load but surely you can't need a whole pod for every single one of those millions of requests right?

> Checkly is a synthetic monitoring tool that lets teams monitor their API’s and sites continually, and find problems faster.

>With some users sending *millions of request a day*, that 300ms added up to massive overall compute savings

No shit, right?

By @BobbyTables2 - 4 months
I do not understand how cloud proponents talk about the he costs of self hosting but then get into situations like this.

Spending serious engineering time to wrangle with the complexities of cloud orchestration is not something that should be taken lightly.

Cloud services should be required to have a black-box Surgeon’s General warning.

By @dxbydt - 4 months
many of the tricks we learned in the late 90s - 2000s can no longer be pulled off. We used to download jar files over the net. Running a major prop trading platform meant 1000s of dependencies. You’d have swing and friends for front end tables, sax xml parsers, various numerical libraries, logging modules- all of this shit downloaded in the jar when the customer impatiently waited to trade some 100MM worth of fx. We learned how to cut down on dependencies. Built tools to massively compress class files. Tradeoff 1 jar with lots of little jars that downloaded on demand. Better yet, cache most of these jars so they wouldn’t need to download every single time. It became a fine art at one point - the difference between a rookie and a professional was that the latter could not just write a spiffy java frontend, but actually deploy it in prod so customers wouldn’t even know there was a startup time - it would just start like instantly. then that whole industry just vanished overnight- poof!

now i write ml code and deploy it on a docker in gcp and the same issues all over again. you import pandas gbq and pretty much the entire google bq set of libraries is part of the build. throw in a few stadard ml libs and soon you are looking at upwards of 2 seconds in Cloud Run startup time. You pay premium for autoscaling, for keeping one instance warm at all times, for your monitoring and metrics, on and on. i am yet to see startup times below 500ms. you can slice the cake any which way, you still pay the startup cost penalty. quite sad.