Why AI evals are the hottest new skill for product builders

Summary of Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar .

most important concepts (quick view)

  • Evals = systematic measurement + improvement of an AI product using your own logs/traces, not just benchmarks.
  • Start with manual error analysis (read real traces, write one concise note per trace), then cluster notes into failure modes and count them to prioritize.
  • Prefer code-based evaluators; use LLM-as-judge only for narrow, subjective failure modes — and make the judge binary (pass/fail), avoid Likert scales (1-5 ratings).
  • Validate the judge against human labels (confusion matrix), then run evaluators in CI and on production samples (weekly/daily) to watch drift.
  • Treat evals as living PRDs: explicit behavior rules you enforce continuously.

core summary

  • What evals are: a lightweight, repeatable way to quantify how well your AI app behaves in the wild and to improve it without relying on “vibes.”
  • Begin with data, not tests: sample ~40–100 traces, write one “open code” (the first upstream error) per trace, and stop at theoretical saturation (when new issues stop appearing).
  • Synthesize failure modes: cluster open codes into a handful of actionable axial codes (e.g., “missed human handoff,” “hallucinated feature,” “conversation flow break”) and count them to pick the top problems to fix first.
  • Fix obvious issues fast: prompt/format/engineering bugs may not need evaluators.
  • Automate evaluators where it matters:
    • Code-based checks (cheap/deterministic) for structure/format/guardrails.
    • LLM-as-judge for a single, specific failure mode with a binary verdict.
  • Align the judge to humans: compare judge vs human labels; iterate the judge prompt until disagreements shrink (especially on rare errors). Don’t trust raw “% agreement” alone.
  • Operationalize: run evaluators in unit tests/CI and on real production samples to monitor and catch regressions and drift with ~30 minutes/week maintenance.
  • Relationship to PRDs & A/B: evaluators become enforceable PRDs; A/B tests complement them by measuring product/business impact at runtime.

a simple 6-step checklist (you can run this week)

  1. Sample traces: pull 60–100 recent, diverse conversations/runs.
  2. Open code: write one short note per trace (first upstream error only).
  3. Axial code + count: cluster notes into 5–8 failure modes; pivot/count to prioritize.
  4. Quick fixes first: repair obvious prompt/UX/engineering issues immediately.
  5. Add evaluators: code-based where possible; add 1–3 binary LLM-judges for subjective modes.
  6. Validate + automate: align judges to human labels, then run in CI + nightly/weekly prod sampling; review dashboards weekly.

implementation tips

  • Keep judges binary; avoid “1–5” ratings. They’re slower, fuzzier, and harder to interpret.
  • Label only the first upstream error per trace to stay fast and consistent.
  • Create a “none of the above” bucket while clustering; it reveals missing failure modes.
  • Small team? Appoint a benevolent dictator (domain expert) to make final labeling calls quickly.
  • Build a minimal data review UI (or use existing observability tools) to remove friction from trace review.

glossary

  • Likert scale: a psychometric response scale with ordered options (e.g., 1–5 from “strongly disagree” to “strongly agree”). In evals, avoid such graded scores for judges; prefer binary pass/fail to make results crisp, comparable, and automatable.
  • Open coding: free-form note taking on traces to capture observed issues without a predefined taxonomy.
  • Axial coding: grouping open-code notes into a small set of actionable failure modes used for counting and prioritization.
  • Theoretical saturation: the point in analysis when reviewing more traces stops producing new issue types—your cue to move on.
  • LLM-as-judge: using a model to make a narrow, binary evaluation about a specific failure mode (e.g., “Should this have been handed to a human? TRUE/FALSE”).
  • Code-based evaluator: a deterministic check written in code (e.g., JSON validity, length limits, schema adherence).
  • Benevolent dictator: a single domain expert empowered to make fast, consistent labeling and taxonomy decisions.

minimal judge prompt (example, binary)

Given the full trace, output only TRUE if a human handoff should have occurred; else FALSE. Handoff required if any of: (1) user explicitly requests a human; (2) policy-mandated topics; (3) sensitive complaints/escalations; (4) missing/failed tool data; (5) same-day tour scheduling. Return exactly TRUE or FALSE.

references & further reading