June 27th, 2024

How to think about creating a dataset for LLM fine-tuning evaluation

Alex Strick van Linschoten emphasizes objective evaluation of LLM fine-tuning, focusing on accuracy, out-of-domain data, information gradations, spelling variations, and structured data tasks. He plans systematic model comparisons for performance enhancement.

Read original article

How to think about creating a dataset for LLM fine-tuning evaluation

Alex Strick van Linschoten discusses the process of creating a dataset for evaluating LLM fine-tuning, focusing on core evaluations for accuracy, handling out-of-domain data, interpreting gradations of information, addressing spelling variations, and dealing with complex stories in structured data generation tasks. He emphasizes the importance of assessing model performance objectively rather than relying on intuition. By detailing various evaluation criteria such as measuring correct predictions, adapting to new data, handling ambiguous information, and ensuring consistency in output, he aims to enhance the accuracy and reliability of fine-tuned language models. Van Linschoten plans to implement these evaluations to compare different models and improve their performance systematically.

Researchers describe how to tell if ChatGPT is confabulating

Researchers at the University of Oxford devised a method to detect confabulation in large language models like ChatGPT. By assessing semantic equivalence, they aim to reduce false answers and enhance model accuracy.

Detecting hallucinations in large language models using semantic entropy

Researchers devised a method to detect hallucinations in large language models like ChatGPT and Gemini by measuring semantic entropy. This approach enhances accuracy by filtering unreliable answers, improving model performance significantly.

LLMs on the Command Line

Simon Willison presented a Python command-line utility for accessing Large Language Models (LLMs) efficiently, supporting OpenAI models and plugins for various providers. The tool enables running prompts, managing conversations, accessing specific models like Claude 3, and logging interactions to a SQLite database. Willison highlighted using LLM for tasks like summarizing discussions and emphasized the importance of embeddings for semantic search, showcasing LLM's support for content similarity queries and extensibility through plugins and OpenAI API compatibility.

Mozilla.ai did what? When silliness goes dangerous

Mozilla.ai, a Mozilla Foundation project, faced criticism for using biased statistical models to summarize qualitative data, leading to doubts about its scientific rigor and competence in AI. The approach was deemed ineffective and compromised credibility.

Claude 3.5 Sonnet

Anthropic introduces Claude Sonnet 3.5, a fast and cost-effective large language model with new features like Artifacts. Human tests show significant improvements. Privacy and safety evaluations are conducted. Claude 3.5 Sonnet's impact on engineering and coding capabilities is explored, along with recursive self-improvement in AI development.

3 comments

By @avgbusinessuser - 10 months

great series of posts, i went down a similar path recently for a slightly different use case - i did not use axolotl though, i was worried about missing out on understanding some details due to potential abstractions. it's great to see documentation on how others tackle similar problems, i documented the process i went through here - https://atredis.com/blog/2024/6/3/how-to-train-your-large-la...

By @msp26 - 10 months

For tasks like data extraction, are people doing full finetunes or training a LoRA? Is it any different for classification?

By @hinkley - 10 months

When you get good enough at filtering the dataset for training, do you still need an AI, or do you understand the problem domain and can use a deterministic system?

How to think about creating a dataset for LLM fine-tuning evaluation

Related

Researchers describe how to tell if ChatGPT is confabulating

Detecting hallucinations in large language models using semantic entropy

LLMs on the Command Line

Mozilla.ai did what? When silliness goes dangerous

Claude 3.5 Sonnet

Related

Researchers describe how to tell if ChatGPT is confabulating

Detecting hallucinations in large language models using semantic entropy

LLMs on the Command Line

Mozilla.ai did what? When silliness goes dangerous

Claude 3.5 Sonnet