A pragmatic guide to LLM evals for devs
Deep Dives
Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:
-
Grounded theory
14 min read
Linked in the article (30 min read)
-
Likert scale
13 min read
Linked in the article (13 min read)
-
Test-driven development
14 min read
The article discusses how traditional TDD falls short for LLM applications due to non-deterministic outputs. Understanding the formal methodology of TDD - its red-green-refactor cycle, origins in extreme programming, and assumptions about deterministic correctness - provides essential context for why LLM evals represent a paradigm shift in software quality assurance.
One word that keeps cropping up when I talk with software engineers who build large language model (LLM)-based solutions is “evals”. They use evaluations to verify that LLM solutions work well enough because LLMs are non-deterministic, meaning there’s no guarantee they’ll provide the same answer to the same question twice. This makes it more complicated to verify that things work according to spec than it does with other software, for which automated tests are available.
Evals feel like they are becoming a core part of the AI engineering toolset. And because they are also becoming part of CI/CD pipelines, we, software engineers, should understand them better — especially because we might need to use them sooner rather than later! So, what do good evals look like, and how should this non-deterministic-testing space be approached?
For directions, I turned to an expert on the topic, Hamel Husain. He’s worked as a Machine Learning engineer at companies including Airbnb and GitHub, and teaches the online course AI Evals For Engineers & PMs — the upcoming cohort starts in January. Hamel is currently writing a book, Evals for AI Engineers, to be published by O’Reilly next year.
In today’s issue, we cover:
Vibe-check development trap. An agent appears to work well, but as soon as it is modified, it can’t be established that it’s working correctly.
Core workflow: error analysis. Error analysis has been a key part of machine learning for decades and is useful for building LLM applications.
Building evals: the right tools for the job. Use code-based evals for deterministic failures, and an LLM-as-judge for subjective cases.
Building an LLM-as-judge. Avoid your LLM judge memorizing answers by partitioning your data and measuring how well the judge generalizes to unfamiliar data.
Align the judge, keep trust. The LLM judge’s expertise needs to be validated against human expertise. Consider metrics like True Positive Rate (TPR) and True Negative Rate (TNR).
Evals in practice: from CI/CD to production monitoring. Use evals in the CI/CD pipeline, but use production data to continuously validate that they work as expected, too.
Flywheel of improvement. Analyze → measure → Improve → automate → start again
With that, it’s over to Hamel:
1. Vibe-check development trap
Organizations are embedding LLMs into applications from customer service to content creation. Yet, unlike traditional software, LLM pipelines don’t produce deterministic outputs; their responses are often subjective ...
This excerpt is provided for preview purposes. Full article content is available on the original publication.