Kagi LLM Benchmarking Project
The Kagi LLM Benchmarking Project assesses large language models on reasoning, coding, and instruction-following, providing metrics like accuracy and latency, while comparing costs and performance across various models.
Read original articleThe Kagi LLM Benchmarking Project evaluates major large language models (LLMs) based on their reasoning, coding, and instruction-following capabilities. It employs a unique benchmark that frequently changes and includes diverse, challenging tasks to avoid overfitting and provide a rigorous assessment of the models. The project aims to measure the adaptability and potential of LLMs, focusing on features essential for Kagi Search, particularly reasoning and instruction-following abilities.
The benchmarking results include various metrics such as accuracy, total tokens, cost, median latency, and speed in tokens per second. For instance, OpenAI's GPT-4o achieved an accuracy of 52% with a median latency of 1.60 seconds, while other models like Meta's Llama-3.1 and Anthropic's Claude-3.5 showed varying performance levels. The benchmarks are designed to be challenging, with example questions testing knowledge and reasoning skills.
Additionally, the project compares the pricing of contemporary LLMs, detailing costs per input and output tokens. This information is regularly updated to reflect the latest data. The Kagi LLM Benchmarking Project draws inspiration from other benchmarking initiatives and aims to provide a comprehensive evaluation of LLM capabilities in a competitive landscape.
The gold standard for LLM evaluation would have the following qualities:
1. Categorized (e.g. coding, reasoning, general knowledge)
2. Multimodal (at least text and image)
3. Multiple difficulties (something like "GPT-4 saturates or scores >90%", a la MMLU, "GPT-4 scores 20-80%", and "GPT-4 scores < 10%")
4. Hidden (under 10% of the dataset publicly available, enough methodological detail to inspire confidence but not enough to design to the test set)
The standard model card suite with MMLU, HumanEval etc. has already been optimized to the point of diminishing value - Goodhart's law in action. Meanwhile, arena Elo (https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...) is extremely useful, but also has the drawback of reflecting median-voter preferences that will not necessarily correlate with true intelligence as capabilities continue to advance, in the same sense as how the doctor with the best bedside manner is not necessarily the best doctor.
Until that happens, I'll pay attention to every eval I can find, but am also stuck asking "how many r's are in strawberry?" and "draw a 7-sided stop sign" to get a general impression of intelligence independent of gameable or overly general benchmarks.
But all that aside:
Model | Score
---------------------------------------------- GPT-4o | 52
Llama 3.1 405B | 50
Claude 3.5 Sonnet | 46
Mistral Large | 44
Gemini 1.5 Pro | 12
What an incredible contrast to MMLU, where all of these models score in the 80-90% range! For what it's worth, these scores also fall much closer to my impressions from daily use. Gemini is awful, Sonnet and 4o are amazing, and the new Llama puts fine-tunable, open-source 4o in the hands of anyone with a mini-cluster.