Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands
An analysis by Backprop shows the Nvidia RTX 3090 can effectively serve large language models to thousands of users, achieving 12.88 tokens per second for 100 concurrent requests.
Read original articleA recent analysis by the Estonian GPU cloud startup Backprop reveals that an older Nvidia RTX 3090 graphics card can effectively serve large language models (LLMs) to thousands of users. The RTX 3090, which debuted in late 2020, demonstrated the capability to handle 100 concurrent requests for the Llama 3.1 8B model, achieving a throughput of 12.88 tokens per second. This performance is slightly above the average human reading speed and meets the minimum acceptable generation rate for AI chatbots. Backprop's testing indicates that, due to the low number of simultaneous requests from users, a single RTX 3090 could potentially support thousands of end users. While the card's 24GB memory limits it from running larger models, it remains a viable option for smaller applications. The analysis also highlights that quantizing models could further enhance throughput, although this may impact accuracy. Backprop's findings challenge the notion that only high-end enterprise GPUs are necessary for scaling LLMs, suggesting that older consumer-grade hardware can be sufficient for many applications. The company is also exploring the deployment of A100 PCIe cards for users needing higher performance.
- An Nvidia RTX 3090 can serve LLMs to thousands of users effectively.
- The card achieved 12.88 tokens per second for 100 concurrent users.
- It is suitable for smaller models due to its 24GB memory limit.
- Quantization can improve throughput but may reduce accuracy.
- Backprop's findings challenge the need for high-end enterprise GPUs for LLM scaling.
Related
Saying that like it's mediocre... Maybe I'll have to benchmark my old 1050, see what it can do!
I did enjoy the headline, and the register has formerly been an incredibly good news outlet. Prior to the founder passing away last week.
> Since only a small fraction of users are likely to be making requests at any given moment
So what if 5 out of the thousands happen to coincide?