Evals

Systematic measurement of model quality. Offline + online + LLM-as-judge. The discipline that turns 'feels better' into 'measurably better.'

Every model change is justified by a measured improvement. Eval suites run in CI. Golden sets evolve. Quality has a number. Status: STUB — promoted to OUTLINE in Y5 Phase 41.

What this pattern is

Evals are the systematic measurement of model quality. Three layers compose: offline evals (against held-out test sets with ground truth — accuracy, F1, BLEU, ROUGE, custom task metrics); online evals (A/B tests in production — click-through rate, user satisfaction, downstream business metrics); LLM-as-judge evals (a different LLM scores outputs against rubrics — useful for open-ended generation where ground truth doesn’t exist). Promptfoo and Inspect are emerging OSS evaluation frameworks; internal tools at frontier AI labs follow similar shapes.

The pattern is the difference between ML systems that improve deliberately and those that improve by vibes. Without evals: “this model feels better” is the justification; regressions go unnoticed. With evals: every change has a measured delta; promotions require eval pass; regressions block deploy. The discipline is most load-bearing for LLM systems where the output space is open and subjective.

Evals shape the entire ML development workflow. They gate promotion in model-registry. They inform experiment iteration (“this hyperparameter change improved F1 by 0.03”). They provide the baseline for drift-detection (compare production distribution to training eval distribution). They anchor conversations with stakeholders (“the model improved from 0.72 to 0.78 on the benchmark”). Without evals, none of these are grounded.

For LLM systems specifically, evals become both more important and more difficult. More important because output space is open (there’s no single correct answer); more difficult because ground truth is fuzzy or unavailable. LLM-as-judge partially fills this gap — a stronger model scores a weaker model’s outputs against rubrics. This isn’t perfect (judges have biases too) but is dramatically better than “no evals.” The frontier is moving toward richer eval frameworks with human-in-the-loop for the most important decisions.

Concrete instances in the wild

  • Promptfoo. OSS LLM eval framework. YAML-based test definitions. Common for LLM apps.
  • Inspect. OSS eval framework from UK AI Safety Institute. Powerful for structured LLM evals.
  • LangSmith Evals. LangChain’s commercial eval tooling.
  • OpenAI Evals. OpenAI’s OSS eval framework. Original inspiration for many later tools.
  • Anthropic’s model-evaluation guides. Anthropic publishes best practices for LLM evaluation.
  • HELM (Stanford CRFM). Holistic evaluation benchmark for language models.
  • MMLU, BBH, HumanEval, TruthfulQA. Standard benchmark datasets used across LLM research.
  • Traditional ML metrics. Accuracy, precision, recall, F1, AUC-ROC, RMSE, MAE — the classical ML metric library.
  • Weights & Biases evaluation tracking. Commercial tooling for tracking eval metrics across runs.
  • basecamp eval setup (Y5 Phase 41). Promptfoo + custom eval sets + CI integration.

Why this pattern matters

Without evals, ML development is guessing. “This change makes the model better” is unfalsifiable. Regressions accumulate silently. The team that ships models never really knows if they’re improving or degrading, because there’s no measurement. The organizational cost is engineering time spent on changes that don’t help, and worse, changes that actively hurt.

With evals, ML development becomes empirical. Every change has a hypothesis and a measurement. Regressions block promotion automatically. The team can articulate what “better” means in numbers, and they can defend that number to stakeholders. Engineering time compounds because every experiment produces knowledge (this worked / this didn’t) that the team keeps.

The pattern also enables the CI/CD discipline that modern MLops requires. Model changes go through PRs. PRs run eval suites. Passing evals is a merge requirement. Failing evals block the change. This mirrors software CI — you don’t merge code that fails tests, you don’t merge model changes that fail evals. The parallel is what makes MLops feel operationally like DevOps.

For LLM systems, evals are especially important because there’s no other check on quality. Software has type systems, static analysis, and tests that catch errors. LLM outputs are just strings; nothing prevents them from being wrong. Evals are the only mechanism for measuring whether an LLM change made things better. Skipping them means flying blind.

The pattern also matters for stakeholder communication. Executives don’t understand model architecture; they understand numbers. “The model improved from 0.72 to 0.78 on our benchmark” is a defensible statement. “The model feels better to the team” is not. Evals give ML teams the vocabulary to communicate progress in terms other stakeholders can evaluate.

Modern tooling makes evals easier than they used to be. Promptfoo makes LLM evals declarative. LangSmith integrates evals into LLM apps automatically. HELM and similar benchmarks provide off-the-shelf evaluation on standard tasks. What used to require internal tooling investment is now largely available OSS.

The failure modes to know: eval sets that don’t reflect production distribution (evals pass, production fails); eval sets that get memorized by the model over iterations (Goodhart’s Law); eval suites that are too slow to run frequently (developers skip them); eval judgments that are too subjective (LLM-as-judge biases). Each has known mitigations, but operating evals means owning them.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y5 Phase 41 stands an eval framework on basecamp.
DEEP     Promoted after Y5 end — eval suites operational across multiple models,
         CI-gated promotion working, at least one regression caught pre-deploy.

Preview: what OUTLINE will answer

When Y5 Phase 41 promotes this entry to OUTLINE, it will name:

  • PROBLEM. How do you measure model quality systematically and gate deployment on measured improvement?
  • PRINCIPLES. Offline evals catch obvious issues before deployment. Online evals catch what offline misses. LLM-as-judge fills the gap for open-ended generation. Evals gate promotion. Eval sets evolve with the product. Both quantitative and qualitative.
  • TRADE-OFFS. Offline (fast, static, may not reflect prod) vs online (slow, dynamic, reflects prod). LLM-as-judge (scalable, biased) vs human eval (accurate, expensive). Standard benchmarks (comparable) vs custom (relevant). CI-integrated (safe) vs manual (flexible).
  • TOOLS (time-stamped as of 2026-06): Promptfoo, Inspect, LangSmith Evals, OpenAI Evals, HELM, MMLU/HumanEval/TruthfulQA benchmarks, Weights & Biases, custom eval scripts, traditional ML metric libraries.

The DEEP promotion, after Y5 end with CI-gated eval promotion and a caught regression, will add MASTERY (operating eval suites across basecamp models), COMPARE (Promptfoo vs Inspect vs custom evals), OPERATE (a specific regression caught by evals), and CONTRIBUTE (an eval framework contribution or public case study).

Canonical references

  • Chip Huyen, Designing Machine Learning Systems, chapter on evaluation.
  • OpenAI’s “How we evaluate GPT” blog series. Free.
  • Anthropic’s evaluation guides. Free.
  • Stanford CRFM’s HELM paper and benchmark. Free at crfm.stanford.edu/helm.
  • Eugene Yan’s blog on ML evaluation patterns. Free at eugeneyan.com.

Cross-references