July 15th, 2024

eBook on building LLM system evals

The Forest Friends Zine is a $20 digital guide for AI engineers exploring Large Language Model evaluations. Authored by Sridatta Thatipamala and Wil Chung, it offers strategies for effective LLM implementations.

Read original article

The Forest Friends Zine is a digital guide tailored for AI engineers delving into the realm of Large Language Model (LLM) system evaluations. Priced at $20 for pre-orders, the zine offers 30 pages of downloadable content with 50+ colorful illustrations. Set in Brightwood Forest, the zine narrates how forest creatures utilize an LLM named Shoggoth for various tasks, despite challenges in integration. The guide emphasizes a systematic approach to evaluations, advocating for starting simple, using diverse evaluation techniques, designing custom metrics, and creating a golden dataset for comparison. By transforming vague feelings into actionable data, the zine equips readers with strategies to enhance their LLM implementations confidently. Authored by Sridatta Thatipamala and Wil Chung, the zine aims to assist developers and product managers in driving improvements in LLM-powered systems through effective evaluations. It covers topics such as the value of evaluations, designing the first evaluation, reproducibility, testing, and selecting quality measures. Subscribers can stay updated on upcoming issues of the zine.

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

Show HN: Python lib to run evals across providers: OpenAI, Anthropic, etc.

The GitHub repository provides details on LLM Safety Evals, accessible on evals.gg. It features a bar chart, a Twitter post, setup guidelines, and code execution commands. Contact for further support.

Show HN: FiddleCube – Generate Q&A to test your LLM

FiddleCube on GitHub helps create question-answer datasets for Large Language Models. It includes a guide, examples, and details on generating ideal datasets for testing, evaluating, and training LLMs. For more information, visit the GitHub page.

Claude 3.5 Sonnet

Anthropic introduces Claude Sonnet 3.5, a fast and cost-effective large language model with new features like Artifacts. Human tests show significant improvements. Privacy and safety evaluations are conducted. Claude 3.5 Sonnet's impact on engineering and coding capabilities is explored, along with recursive self-improvement in AI development.

RouteLLM: A framework for serving and evaluating LLM routers

RouteLLM is a cost-effective framework for LLM routers, reducing costs by 85% while preserving 95% of GPT-4 performance. It optimizes routing queries between models without sacrificing response quality. Various resources are available on the GitHub URL provided.

3 comments

By @carterdmorgan - 7 months

I love it! One of my big hesitations in using LLMs in any projects is the inherent instability of it, so I'm excited to see some concrete strategies on how to mitigate that.

Actually, I host a podcast called Book Overflow ([YouTube link here](https://www.youtube.com/@BookOverflowPod), but we're on all major platforms). Each week we read and discuss a new software engineering book. We also love to interview the authors when possible. Our [interview with Brian Kernighan](https://youtu.be/_QQ7k5sn2-o?si=bi3omgmNW7bs50NQ) actually went viral here on HN last week, peaking at #3.

If you're willing to provide us with an advance copy and one/some of the authors are willing to sit down for a digital interview, we'd love to devote a discussion episode and bonus interview episode to the book. We could even time the release to line up with the release of the book.

Let me know if you're interested. We can work out the details either here in the thread or you can reach us at contact at bookoverflow.io.

By @sthatipamala - 7 months

One of the coauthors here. I’ll be hanging out in the thread to talk about evals.

I spent >50% of my time designing and advising on them at one point.

eBook on building LLM system evals

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Show HN: Python lib to run evals across providers: OpenAI, Anthropic, etc.

Show HN: FiddleCube – Generate Q&A to test your LLM

Claude 3.5 Sonnet

RouteLLM: A framework for serving and evaluating LLM routers

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Show HN: Python lib to run evals across providers: OpenAI, Anthropic, etc.

Show HN: FiddleCube – Generate Q&A to test your LLM

Claude 3.5 Sonnet

RouteLLM: A framework for serving and evaluating LLM routers