June 29th, 2024

Open-LLM performances are plateauing

The blog addresses Open-LLM's stagnant performance, proposing ways to boost competitiveness. It aims to reinvigorate the community by making the leaderboard more challenging, fostering innovation and improvement.

Read original articleLink Icon
Open-LLM performances are plateauing

The blog post discusses the plateauing performance of Open-LLM and suggests strategies to enhance the leaderboard's competitiveness. The author aims to revitalize interest and engagement within the Open-LLM community by proposing measures to make the leaderboard more challenging. The post hints at a desire to inject new energy into the platform and encourage participants to strive for higher performance levels. It emphasizes the importance of maintaining a dynamic and stimulating environment to foster continuous improvement and innovation among users.

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

LLMs on the Command Line

LLMs on the Command Line

Simon Willison presented a Python command-line utility for accessing Large Language Models (LLMs) efficiently, supporting OpenAI models and plugins for various providers. The tool enables running prompts, managing conversations, accessing specific models like Claude 3, and logging interactions to a SQLite database. Willison highlighted using LLM for tasks like summarizing discussions and emphasized the importance of embeddings for semantic search, showcasing LLM's support for content similarity queries and extensibility through plugins and OpenAI API compatibility.

Claude 3.5 Sonnet

Claude 3.5 Sonnet

Anthropic introduces Claude Sonnet 3.5, a fast and cost-effective large language model with new features like Artifacts. Human tests show significant improvements. Privacy and safety evaluations are conducted. Claude 3.5 Sonnet's impact on engineering and coding capabilities is explored, along with recursive self-improvement in AI development.

Large Language Models are not a search engine

Large Language Models are not a search engine

Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.

LLMs now write lots of science. Good

LLMs now write lots of science. Good

Large language models (LLMs) are significantly shaping scientific papers, with up to 20% of computer science abstracts and a third in China influenced by them. Debates persist on the impact of LLMs on research quality and progress.

Link Icon 6 comments
By @azeemba - 4 months
I feel like the post doesn't relate to the title?

The post seems to be about changing the Leaderboard and doesn't comment too much about whether the actual real-life performance of LLMs is plateauing and what can be done about it.

By @aubanel - 4 months
Just to be sure: the post does not say that the performance of Open LLMs is plateauing (because that would be false, e.g. Google just released a Gemma2 that blows out of the water all previous open models from the same sizes).

Its true title is "Performances are plateauing, let's make the leaderboard steep again", which means "on the Open LLM leaderboard, top models have basically reached a point where they've all grokked the benchmarks which makes it harder to distinguish them, so let's change the benchmarks for harder ones to make a difference again."

By @pclmulqdq - 4 months
LLM performance in general is plateauing. That is natural for any technology. Progress slows down as it reaches maturity.
By @cgearhart - 4 months
Even if they’re plateauing there’s still a lot of value to be had in what they already do. I think the mistake so far has been aiming too high or too low—ie, products that require AGI-like LLMs or unimaginative “low-hanging fruit” product ideas which are obvious from the function of an LLM. The former have been wishful thinking, and the latter have no moat. The Goldilocks area is in understanding what current LLMs can do in a way that you can either do something complicated that we couldn’t do reliably without LLMs or do something simple that wasn’t worth doing without LLMs. And in both cases the products need to be built in a way that naturally incorporates the expected failure modes of the tech. (For example, I don’t need it to write all my code; there’s a lot of value in just using ChatGPT to help me write one-off bash scripts.)
By @forgotmypw17 - 4 months
Meta question, does anyone know why arrow keys (and vimium J and K) don't work for scrolling this page until it is clicked?