Why AI evals are the hottest new skill for product builders

Summary of Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar .

most important concepts (quick view)

Evals = systematic measurement + improvement of an AI product using your own logs/traces, not just benchmarks.
Start with manual error analysis (read real traces, write one concise note per trace), then cluster notes into failure modes and count them to prioritize.
Prefer code-based evaluators; use LLM-as-judge only for narrow, subjective failure modes — and make the judge binary (pass/fail), avoid Likert scales (1-5 ratings).
Validate the judge against human labels (confusion matrix), then run evaluators in CI and on production samples (weekly/daily) to watch drift.
Treat evals as living PRDs: explicit behavior rules you enforce continuously.

core summary

What evals are: a lightweight, repeatable way to quantify how well your AI app behaves in the wild and to improve it without relying on “vibes.”
Begin with data, not tests: sample ~40–100 traces, write one “open code” (the first upstream error) per trace, and stop at theoretical saturation (when new issues stop appearing).
Synthesize failure modes: cluster open codes into a handful of actionable axial codes (e.g., “missed human handoff,” “hallucinated feature,” “conversation flow break”) and count them to pick the top problems to fix first.
Fix obvious issues fast: prompt/format/engineering bugs may not need evaluators.
Automate evaluators where it matters:
- Code-based checks (cheap/deterministic) for structure/format/guardrails.
- LLM-as-judge for a single, specific failure mode with a binary verdict.
Align the judge to humans: compare judge vs human labels; iterate the judge prompt until disagreements shrink (especially on rare errors). Don’t trust raw “% agreement” alone.
Operationalize: run evaluators in unit tests/CI and on real production samples to monitor and catch regressions and drift with ~30 minutes/week maintenance.
Relationship to PRDs & A/B: evaluators become enforceable PRDs; A/B tests complement them by measuring product/business impact at runtime.

a simple 6-step checklist (you can run this week)

Sample traces: pull 60–100 recent, diverse conversations/runs.
Open code: write one short note per trace (first upstream error only).
Axial code + count: cluster notes into 5–8 failure modes; pivot/count to prioritize.
Quick fixes first: repair obvious prompt/UX/engineering issues immediately.
Add evaluators: code-based where possible; add 1–3 binary LLM-judges for subjective modes.
Validate + automate: align judges to human labels, then run in CI + nightly/weekly prod sampling; review dashboards weekly.

implementation tips

Keep judges binary; avoid “1–5” ratings. They’re slower, fuzzier, and harder to interpret.
Label only the first upstream error per trace to stay fast and consistent.
Create a “none of the above” bucket while clustering; it reveals missing failure modes.
Small team? Appoint a benevolent dictator (domain expert) to make final labeling calls quickly.
Build a minimal data review UI (or use existing observability tools) to remove friction from trace review.

glossary

Likert scale: a psychometric response scale with ordered options (e.g., 1–5 from “strongly disagree” to “strongly agree”). In evals, avoid such graded scores for judges; prefer binary pass/fail to make results crisp, comparable, and automatable.
Open coding: free-form note taking on traces to capture observed issues without a predefined taxonomy.
Axial coding: grouping open-code notes into a small set of actionable failure modes used for counting and prioritization.
Theoretical saturation: the point in analysis when reviewing more traces stops producing new issue types—your cue to move on.
LLM-as-judge: using a model to make a narrow, binary evaluation about a specific failure mode (e.g., “Should this have been handed to a human? TRUE/FALSE”).
Code-based evaluator: a deterministic check written in code (e.g., JSON validity, length limits, schema adherence).
Benevolent dictator: a single domain expert empowered to make fast, consistent labeling and taxonomy decisions.

minimal judge prompt (example, binary)

Given the full trace, output only TRUE if a human handoff should have occurred; else FALSE. Handoff required if any of: (1) user explicitly requests a human; (2) policy-mandated topics; (3) sensitive complaints/escalations; (4) missing/failed tool data; (5) same-day tour scheduling. Return exactly TRUE or FALSE.

references & further reading

Why AI evals are the hottest new skill for product builders - YouTube
Lenny’s Newsletter post
Shankar et al., Who Validates the Validator? Criterion Drift in Eval Rubrics for LLMs (UIST 2024) arXiv preprint
Building eval systems that improve your AI product (guest post): Lenny’s Newsletter article

Why AI evals are the hottest new skill for product builders#

most important concepts (quick view)#

core summary#

a simple 6-step checklist (you can run this week)#

implementation tips#

glossary#

minimal judge prompt (example, binary)#

references & further reading#

Why AI evals are the hottest new skill for product builders

most important concepts (quick view)

core summary

a simple 6-step checklist (you can run this week)

implementation tips

glossary

minimal judge prompt (example, binary)

references & further reading