Design a tiered AI eval program for a launched feature
You shipped an LLM-backed feature on vibe checks and now real users are surfacing failures you never tested for. This builds a tiered eval program for the feature: human review for the judgment calls, code-based evals for deterministic checks, and an LLM-judge for the open-ended outputs you cannot grade by hand.
Vibe Checks Are a Launch Strategy, Not an Operating Strategy
When teams ship AI features on vibe checks, a handful of internal demos and a Slack thread of "looks good," the failures show up later, in production, where they cost more and erode trust. Evals are how teams replace vibes with a measurement contract that survives launch. Eval design is rapidly becoming a defining skill for AI PMs, because prompts can be debated indefinitely while evals turn the debate into numbers.
The ask is not perfect coverage. The ask is enough signal to decide what to ship and what to roll back. A working eval program tells you whether the change you just made improved the feature or quietly broke an existing case.
Why a single tier never covers the surface
Most teams reach for one eval style and try to make it cover everything. Human reviews scale poorly. Code evals miss subjective failures. LLM judges drift if not calibrated. The fix is not to pick the best tier; the fix is to assign each failure mode to the tier that actually fits.
Human evals catch judgment. Tone, brand voice, "is this actually helpful," edge case appropriateness. Inline thumbs up/down plus a small weekly expert review. The data is rich but expensive and sparse, so save it for the questions only humans can answer.
Code evals catch deterministic failures. Format breaks, schema mismatches, latency budgets, cost ceilings, PII patterns. Cheap, fast, hard pass and fail. They cover a narrow slice of the problem space and they cover it perfectly.
LLM judges catch open-ended quality. Hallucination, retrieval relevance, summary coherence. They scale, they explain their grades in natural language, and they are calibrated against a small labeled set so the score actually means something.
Anthropic's evaluation tool documentation walks through the LLM-judge pattern, including how to validate the judge against human labels before trusting its scores at scale. Anthropic's research page collects related work on alignment and evaluation methodology that informs how production teams design grading rubrics.
How the Design a tiered AI eval program prompt works
The prompt runs in six steps. Step 1 forces the team to list 5-8 ways the feature can be wrong from a user's perspective, with one production example per failure mode. The "from a user's perspective" framing matters; teams that list failure modes from the model's perspective end up with abstract metrics that no user complained about.
Step 2 assigns a tier to each failure mode. Tone and brand voice get human review. Format and latency get code evals. Hallucination and helpfulness get an LLM judge. The justification line per assignment is the discipline; teams that skip it tend to default to whichever tier they already had infrastructure for.
Step 3 builds the golden set: 50-150 inputs that mirror real traffic, weighted 60 percent typical, 25 percent known-failure, 15 percent adversarial. Sourcing from production logs (anonymized) is non-negotiable. Synthetic-only golden sets test the team's imagination, not the product's reality.
Step 4 writes the LLM judge rubric. The prompt forces explicit pass and fail definitions, a fixed score scale, and the exact text snippet fed to the judge. Vague rubrics produce inconsistent grades, and inconsistent grades make every score a debate.
Step 5 wires the cadence. Code evals on every PR, weekly human review of 30 production samples, monthly golden-set refresh. The cadence is what keeps the program alive; teams that set up an eval suite once and walk away watch it decay within a quarter.
Step 6 sets the release-gate thresholds. Without a number, every regression turns into a judgment call, and most judgment calls favor shipping.
When code evals beat LLM judges
It is tempting to use the LLM judge for everything because it scales. The cheaper move is often a code eval. If the failure mode has a deterministic correctness signal (valid JSON, response length under a cap, latency below a budget, no PII pattern matched), a code eval catches it for free and runs in milliseconds. Reserve the LLM judge for the cases where deterministic logic cannot decide.
Amplitude's writing on SaaS metrics frames the same principle for product analytics: instrument what is cheap to measure first, and save expensive measurement for the questions cheap signals cannot answer. The same heuristic applies to AI quality.
When to use it
- You shipped an AI feature on vibe checks and you are now hearing about failures you did not test for.
- Two consecutive PRs broke production behavior that was working last week and nothing flagged it.
- A new model version is available and you cannot tell whether to upgrade because you have no comparison data.
- Cost per request is climbing and you have no eval gate that catches the spike before release.
- A regulator, customer, or exec is asking how you measure quality and you do not have a one-page answer.
Common pitfalls
- Single-tier coverage. One tier never covers all failure modes. Assign tiers per mode.
- Synthetic-only golden set. Without production logs, the eval tests imagination, not reality. Anonymize and include real traffic.
- Unlabeled LLM judge. A judge that has not been calibrated against 30-50 human-labeled examples is producing numbers, not signal.
- Eval suite without a kill switch. If a regression cannot block a release, the suite is decoration.
Sources
- Anthropic eval tool documentation - Anthropic
- Anthropic Research - Anthropic
- SaaS metrics that matter - Amplitude
- The most important PM skill - Silicon Valley Product Group
- Retention engagement growth: the silent killer - Reforge
Sources
- Anthropic eval tool documentation — Anthropic
- Anthropic Research — Anthropic
- SaaS metrics that matter — Amplitude
- The most important PM skill — Silicon Valley Product Group
- Retention engagement growth: the silent killer — Reforge
Prompt details
Ready to try the prompt?
Open the live prompt detail page for the full workflow.