Back to Prompts

Design a tiered AI eval program for a launched feature

AI & Automation
2 uses
Updated 5/8/2026

Description

You shipped an LLM-backed feature on vibe checks and now real users are surfacing failures you never tested for. This builds a tiered eval program for the feature: human review for the judgment calls, code-based evals for deterministic checks, and an LLM-judge for the open-ended outputs you cannot grade by hand.

Example Usage

You are a senior PM building a tiered eval program for {{feature_name}} after shipping it on vibe checks. Current pain: {{user_pain}}. Stack: {{ai_stack}}.

## Step 1. Map the failure modes
List 5-8 ways the feature can be wrong from a user's perspective. Write a one-line production example for each:
- Wrong answer (factually incorrect output)
- Hallucination (output not grounded in provided context)
- Tone or safety issue
- Format or schema break
- Latency or timeout
- Refusal when the answer is available
- Cost spike per request
- Privacy or PII leak

## Step 2. Pick the eval tier per failure mode
Three tiers, each with a clear job:
1. Human eval. Inline thumbs up/down plus a weekly 30-sample expert review. Use for tone, brand voice, edge case judgement.
2. Code eval. Deterministic checks (regex, schema validators, latency budget, cost cap). Use for format breaks, latency, cost, PII patterns.
3. LLM judge. A grader prompt that scores outputs on a rubric. Use for hallucination, retrieval relevance, helpfulness.

For each failure mode, assign a tier and justify the choice in one line.

## Step 3. Build the golden set
Assemble 50-150 input examples that mirror real production traffic:
- 60 percent typical cases
- 25 percent known failure cases
- 15 percent adversarial or edge cases

Source from production logs (anonymized) plus 10-20 synthetic examples for cases users have not yet hit.

## Step 4. Write the LLM judge rubric
For each LLM-judged eval, write the prompt with explicit pass criteria:
- Definition of pass (positive examples)
- Definition of fail (negative examples)
- Score scale (binary or 1-5)
- The exact text snippet to feed the judge

Calibrate the judge against 30-50 human-labeled examples before trusting its scores.

## Step 5. Wire the cadence
- On every PR that touches the AI surface: run code evals plus LLM judge on the golden set
- Weekly: human review of 30 random production samples
- Monthly: refresh golden set with 10 new cases from production logs

## Step 6. Set release-gate thresholds
For each metric, name the number that gates a release:
- Format pass rate at 100 percent
- Hallucination rate below 2 percent on the golden set
- Latency p95 below the budget
- Cost per request below the budget

## Output
1. Failure mode list with one-line examples
2. Tier assignment table (failure mode to tier with one-line justification)
3. Golden set composition (counts per category)
4. LLM judge rubric prompts
5. Eval cadence
6. Release-gate thresholds
7. The single failure mode you currently have no signal on, plus the cheapest test to install for it

Customize This Prompt

Customize Variables0/3
Was this helpful?
Read the full guide
In-depth article with examples, pitfalls, and expert sources
Ready to use this prompt?

Related AI & Automation Prompts