August 23rd, 2024

Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands

An analysis by Backprop shows the Nvidia RTX 3090 can effectively serve large language models to thousands of users, achieving 12.88 tokens per second for 100 concurrent requests.

Read original articleLink Icon
Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands

A recent analysis by the Estonian GPU cloud startup Backprop reveals that an older Nvidia RTX 3090 graphics card can effectively serve large language models (LLMs) to thousands of users. The RTX 3090, which debuted in late 2020, demonstrated the capability to handle 100 concurrent requests for the Llama 3.1 8B model, achieving a throughput of 12.88 tokens per second. This performance is slightly above the average human reading speed and meets the minimum acceptable generation rate for AI chatbots. Backprop's testing indicates that, due to the low number of simultaneous requests from users, a single RTX 3090 could potentially support thousands of end users. While the card's 24GB memory limits it from running larger models, it remains a viable option for smaller applications. The analysis also highlights that quantizing models could further enhance throughput, although this may impact accuracy. Backprop's findings challenge the notion that only high-end enterprise GPUs are necessary for scaling LLMs, suggesting that older consumer-grade hardware can be sufficient for many applications. The company is also exploring the deployment of A100 PCIe cards for users needing higher performance.

- An Nvidia RTX 3090 can serve LLMs to thousands of users effectively.

- The card achieved 12.88 tokens per second for 100 concurrent users.

- It is suitable for smaller models due to its 24GB memory limit.

- Quantization can improve throughput but may reduce accuracy.

- Backprop's findings challenge the need for high-end enterprise GPUs for LLM scaling.

Related

Link Icon 6 comments
By @Cordiali - 3 months
> "even an old Nvidia RTX 3090"

Saying that like it's mediocre... Maybe I'll have to benchmark my old 1050, see what it can do!

By @metadat - 3 months
How does 12 tokens/sec equate to satisfactorily severing thousands of end-customere?

I did enjoy the headline, and the register has formerly been an incredibly good news outlet. Prior to the founder passing away last week.

By @stevenhuang - 3 months
This is wildly misleading as the benchmarks make use of batching. It will entirely fall apart in real workloads where each prompt is different. If you're doing batch processing with a fixed prompt, the results will be more applicable.
By @fooblaster - 3 months
Guess what Nvidia won't let you deploy in a data center!
By @iAkashPaul - 3 months
Pretty sure this was never questioned for batched requests, sg-lang/lmdeploy/tensorRT-LLM will have nearly twice as reported speeds with INT8 (fp16 A100 benched here https://github.com/sgl-project/sglang?tab=readme-ov-file#ben...)
By @Havoc - 3 months
Bought a 3090 because they are good value for this, but this logic is frankly a little ridiculous:

> Since only a small fraction of users are likely to be making requests at any given moment

So what if 5 out of the thousands happen to coincide?