September 17th, 2024

We are self-hosting our GPUs

Gumlet has shifted from cloud GPU rentals to self-hosting due to increased demand, building custom machines for cost efficiency and planning future expansion to a dedicated data center.

Read original article

Gumlet, a SaaS provider for image and video processing, has transitioned from cloud-based GPU rentals to self-hosting their GPUs due to an eightfold increase in demand since January 2024. The company began utilizing GPUs for video processing in 2020, recognizing their superior performance compared to CPUs. After evaluating their options, Gumlet built custom machines using commodity hardware, including an AMD 5700x processor and Nvidia RTX 4000 ADA SFF GPUs, which are suitable for both video encoding and machine learning tasks. The total cost of each machine is approximately $2,300, significantly lower than the monthly rental cost of comparable cloud services. Initially, they faced challenges in sourcing the required GPUs, which were not readily available in India, but managed to import them through a supplier. Instead of using a traditional data center, Gumlet opted to host their servers at a WeWork co-working space, which provided necessary infrastructure at a lower cost. They utilize Kubernetes for deployment and Grafana for monitoring, ensuring efficient management of their operations. Looking ahead, Gumlet plans to expand their server capacity and may eventually move to a dedicated data center.

- Gumlet has shifted from cloud GPU rentals to self-hosting due to increased demand.

- Custom-built machines were created using commodity hardware for cost efficiency.

- The Nvidia RTX 4000 ADA SFF GPU was chosen for its performance and compatibility.

- Hosting is done at a WeWork location, providing necessary infrastructure at a lower cost.

- The company plans to expand its server capacity and may transition to a dedicated data center in the future.

Show HN: We made glhf.chat – run almost any open-source LLM, including 405B

The platform allows running various large language models via Hugging Face repo links using vLLM and GPU scheduler. Offers free beta access with plans for competitive pricing post-beta using multi-tenant model running.

Tensorfuse (YC W24) Is Hiring

Tensorfuse, a Y Combinator-backed startup in Bengaluru, seeks a Systems Engineer to develop a serverless GPU runtime. The role offers ₹2M - ₹3M salary, requiring skills in Rust or Go and Kubernetes.

Show HN: Attaching to a Virtual GPU over TCP

Thunder Compute provides a flexible, cost-efficient cloud-based GPU service with instant scaling, pay-per-use billing, high utilization rates, and strong security, benefiting enterprises by minimizing idle GPU time.

We're Cutting L40S Prices in Half

Fly.io has reduced L40S GPU prices to $1.25 per hour, targeting developers for AI workloads. The L40S offers A100-like performance, focusing on inference tasks and integrating with fast networking and storage.

dstack (K8s alternative) adds support for AMD accelerators on RunPod

dstack has introduced support for AMD accelerators on RunPod, enabling efficient AI container orchestration with MI300X GPUs, which offer higher VRAM and memory bandwidth, enhancing model deployment capabilities.

19 comments

By @godelski - 7 months

As a ML person who's also worked on HPC stuff, you will most certainly save money by doing this and there are plenty of benefits. It is generally a good idea, but there is a bit more barrier to entry and you need in house expertise.

So important piece of advice. If you can, hire an admin with HPC experience. If you can't, find ML people with HPC experience. Things you can ask about are slurm, environment modules (this clear sign!), what a flash buffer is, zfs, what they know about pytorch DDP, their linux experience, if they've built a cluster before, adminning linux, and so on. If you need a test, ask them to write a simple bash script to run some task and see if everything has functions and if they know how to do variable defaults. With these guys, they won't know everything but they'll be able to pick up the slack and probably enjoy it. As long as you have more than one. Adminning is a shitty job so if you only have one they'll hate their life.

There are plenty of ML people who have this experience[0], and you'll really reap rewards for having a few people with even a bit of this knowledge. Without this knowledge it is easy to buy the wrong things or have your system run far from efficient and end up with frustrated engineers/researchers. Even with only a handful of people running experiments schedulers (like slurm) still have huge benefits. You can do more complicated sweeps than wandb, batch submit jobs, track usage, allocate usage, easily cut up your nodes or even a single machine into {dev,prod,train,etc} spaces, and much more. Most importantly, a scheduler (slurm) will help prevent your admin from quitting as it'll help prevent them from going into a spiral of frustration.

[0] At least in my experience these tend to be higher quality ML people too, but not always. I think we can infer why there would be a correlation (details).

By @perryh2 - 7 months

> We however found that our co-working space - WeWork has an excellent server hosting solution. We could put the servers on the same floor as our office and they would provide redundant power supply, cooling and internet connection. This entire package is available at a much cheaper rate and we immediately jumped on this. Right now all servers are securely running in our office.

Nice! How much does this cost?

By @CommieBobDole - 7 months

I think generally the benefit of cloud is either where your demands are very elastic, or if you are essentially a fractional user - a single server or GPU would be overkill.

Once you have heavy and/or unconventional compute needs, it's likely cheaper to self-host or colo purchased hardware.

By @ThinkBeat - 7 months

This does not make sense to me.

They are processing 2.5 Billion images and videos in a single day. They decided to self host their GPUs.

The solution uses off-the-shelf hardware, with GPU per "server", add it all together into a single rack? And that is the GPU compute needed to process all the videos 24/7?

Then they have this rack in the office, but they cant find a place to put it. That might be a decent thing to start out with, before the build. Where do we put it?

But no. Planning for multiple network links, multiple redundant power, cooling, security, monitoring, and backup generators, handling backups, fire suppression, and failover to a different region if something fails was not necessary.

Because Google book?

But our (insert ad here) WeWork let us put our servers in a room on the same floor, (their data centerish capabilities seem limited)

There are so many additional costs that are not factored into the article.

I am sure once they accrue serious downtime a few times and irate customers, then paying for hosting in a proper data center might start making sense.

Now I am basing this comment on the assumption that the company is providing continuous real-time operations for their clients.

If it is more batch operated, where downtime is fine as long as results are delivered let us say within 12 hours.

By @dangoodmanUT - 7 months

How did you expose the servers to the internet, if at all?

I'd personally have these on tailscale, not exposed to the internet, but at some point in self hosting, clients have to be able to talk to something.

I know tailscale has their endpoints but I can't expect this to be able to server a production API at scale.

By @rorra - 7 months

It would be nice if you can add numbers, like what would be the cost in your cloud provider, what was the total investment made, how much are you saving, which other options did you have in mind and why were discarded Still it was a nice post to read

By @teaearlgraycold - 7 months

At my last job we did the same thing but for AI training hardware. It was definitely the right call cost-wise, with our little cluster breaking even after 8 months. We found a cheap data center in Texas.

By @not_your_vase - 7 months

  > AMD 5700x processor

I find it to be an odd choice. I mean the CPU itself is perfectly fine (typing this myself on a 5600G, which I very much like), but AM4 socket is pretty much over - there is no upgrade path anymore once it starts getting long on the tooth. (Unlike the other parts, which can be bumped: RAM, GPU, storage...)

By @p0w3n3d - 7 months

Shouldn't they be named VPU (vector processing units) as they are no longer to produce graphics?

By @rurban - 7 months

We also do, and you'd need to add a couple more zero's for the cost. For administration it paid out that I'm a trained architect, because all the work is in cooling the room. Lots of temperature shielding and air and water flow, monitors, ...

By @kendallgclark - 7 months

https://www.stardog.com/blog/skathe-is-a-private-gpu-cloud/

By @rkwasny - 7 months

RTX 4000 ADA? That's a very under powered card: https://github.com/mag-/gpu_benchmark

By @BonoboIO - 7 months

Hetzner has RTX 4000 for 185€ per month. Is your solution cheaper?

By @qmarchi - 7 months

Tangential to the post:

Was going to toss an application your way since it sounds like interesting work, but it looks like the Google Form on your Careers page was deleted.

By @drio - 7 months

Do you mind sharing the details of the rack mount you use?

By @LarsDu88 - 7 months

How many GPU servers are we talking about here exactly?

By @erichileman - 7 months

Why not run something like 8 x L40's for $4,750 a month from a bare metal provider like latitude.sh? This seems far more cost efficient and flexible.

By @briandilley - 7 months

I skimmed to the part about "We host it in our WeWork office" and thought WTF?

We are self-hosting our GPUs

Related

Show HN: We made glhf.chat – run almost any open-source LLM, including 405B

Tensorfuse (YC W24) Is Hiring

Show HN: Attaching to a Virtual GPU over TCP

We're Cutting L40S Prices in Half

dstack (K8s alternative) adds support for AMD accelerators on RunPod

Related

Show HN: We made glhf.chat – run almost any open-source LLM, including 405B

Tensorfuse (YC W24) Is Hiring

Show HN: Attaching to a Virtual GPU over TCP

We're Cutting L40S Prices in Half

dstack (K8s alternative) adds support for AMD accelerators on RunPod