Back to Recipes

AI Feature Launch Readiness

Workflow
5 steps·35 min·advanced

Ship an AI feature without regret — from ROI justification to evals, disclosure, and safety nets.

Your Progress

0/5

Steps

1

ROI Comparison

Conduct an AI vs. non-AI ROI comparison

You are running an AI ROI comparison for {{feature_name}}. Non-AI baseline: {{non_ai_baseline}}.

## Dimensions
| Dimension | Non-AI baseline | AI option | Delta |
|-----------|----------------|-----------|-------|
| Accuracy (correct outcome) | | | |
| Consistency (same output for same input) | | | |
| Cost (per call, per user, per month) | | | |
| Latency (p50, p99) | | | |
| Development cost (eng weeks to ship) | | | |
| Ongoing cost (eval, monitoring, model drift) | | | |
| Failure mode severity (what happens when it's wrong) | | | |
| Reversibility (can user recover from a bad output) | | | |

## Decision rules
- AI wins cleanly: accuracy delta >20%, cost delta 10%, cost delta 10x
- Demo-only: impressive in demo but ROI is negative — ship the non-AI version first

## Quality floor
If AI quality is below non-AI on any load-bearing dimension:
- Is there a constrained scope where AI wins?
- Is there a human-in-the-loop path that hybrids both?

## Output
1. Filled comparison table
2. Recommendation (AI / hybrid / non-AI)
3. The one assumption most likely to change over 12 months (model costs, capability)
4. The reversal trigger (what new data would change the call)
Customize Variables0/2
Before building AI, prove it's actually better than the non-AI version.
2

Eval Rubric

Design an AI feature evaluation rubric before shipping

You are an AI eval designer building the pre-launch rubric for {{feature_name}} on {{product_name}}. User task: {{user_task}}.

## Step 1 — Task set
Build a test set of 30-100 real user inputs spanning:
- Happy path (clear input, obvious intent)
- Edge cases (ambiguous input, unusual phrasing)
- Adversarial (jailbreaks, prompt injections, out-of-scope)
- Representative error modes (typos, unsupported languages, missing context)

## Step 2 — Scoring criteria
Define 3-5 dimensions, each 1-5:
- Correctness (did it answer the right thing?)
- Completeness (did it answer fully?)
- Faithfulness (did it hallucinate?)
- Safety (did it refuse appropriately?)
- User-facing quality (would a user be satisfied?)

## Step 3 — Golden answers
For the 30 happy-path tasks:
- Write the ideal answer (or 2-3 acceptable variants)
- Annotate what makes it good/bad

## Step 4 — Regression guardrails
- Minimum score per dimension to ship
- Automated diff on every model/prompt change
- Human review cadence for random samples

## Step 5 — Output
1. Task set spec
2. Rubric with scoring examples
3. Pass/fail ship criteria
4. The one failure mode we'd tolerate (and why)
5. The monitoring plan for post-launch drift
Customize Variables0/3
Define the quality bar before you ship, not after users complain.
3

Golden Dataset Eval

Run an AI feature eval with golden dataset

You are running a formal eval for {{ai_feature}}. Goal: produce a defensible quality number.

## Step 1 — Dataset
- 100-500 real production inputs
- Balanced across use cases
- Ground-truth labels from domain experts (not PMs)
- Held-out test split (not used for prompt iteration)

## Step 2 — Scoring
### Automatic metrics
- Exact match (where applicable)
- ROUGE/BLEU for text generation
- Custom task-specific metrics

### LLM-judge
- Calibrate LLM judge on 50 labeled examples first
- Spot-check judge agreement with human labels (should be >85%)

### Human scoring
- Blind (judge doesn't know which model/prompt)
- 3 annotators per item, aggregate
- 5-point rubric with anchor examples

## Step 3 — Statistical rigor
- Confidence intervals (bootstrap)
- Sample size for 95% confidence
- Multiple comparison correction if testing >2 variants

## Step 4 — Reporting
- Overall quality score (0-100 or pass rate)
- Per-category scores
- Confidence intervals
- Variant comparison if applicable
- The one category where model is unreliable

## Output
1. Eval spec (dataset, metrics, judge setup)
2. Scoring rubric
3. Report template
4. The one eval decision (metric choice) that most affects the headline number
Customize Variables0/1
Test against a golden dataset so you know what 'good' actually means.
4

User Disclosure

Design an AI product disclosure pattern for users

You are designing AI disclosure for {{ai_feature}}.

## Step 1 — Disclosure mechanisms
- Label output as AI-generated (visual + text)
- Confidence indicator (explicit high/med/low or numeric)
- Source citation (where the output came from)
- Refusal surface (what the AI won't do and why)
- Override affordance (let the user edit or reject)

## Step 2 — Placement
| Mechanism | Where | When |
|-----------|-------|------|
| AI label | Every AI output | Always |
| Confidence | Before user acts on output | When model is uncertain |
| Citation | Inline in output | Whenever claims are made |
| Refusal | When user asks out-of-scope | On attempt |
| Override | On every output | Always available |

## Step 3 — Anti-patterns to avoid
- Over-confident certainty language ("definitely," "always")
- Fake humility (confidence shown low when AI is actually reliable)
- Hidden AI (disguising AI output as human)
- Disclosure fatigue (so many warnings users ignore them)

## Step 4 — Testing
Run 10 users through the flow:
- Did they notice the AI disclosure?
- Did they calibrate trust appropriately?
- Did they know how to override?
- Did the confidence signal match their experience?

## Output
1. Disclosure mechanism spec
2. Placement rules
3. 3 anti-patterns we'd specifically audit for
4. User testing plan
Customize Variables0/1
Tell users when they're seeing AI output, and what they can trust.
5

Safety Net

Design a human-in-the-loop safety net for an AI feature

You are designing the HITL safety net for {{ai_feature}}. Stakes if wrong: {{stakes}}.

## Step 1 — Task taxonomy
Classify tasks into tiers:
- **Tier A (Auto)**: low stakes + high model confidence → AI acts independently
- **Tier B (Review)**: medium stakes OR medium confidence → AI proposes, human approves
- **Tier C (Hybrid)**: high stakes + needs human judgment → AI drafts, human rewrites
- **Tier D (Human-only)**: too risky or judgment-heavy for AI

## Step 2 — Confidence thresholds
Define numeric thresholds per tier:
- Model confidence score
- Output length / complexity heuristic
- User-signal checks (first-time user, sensitive account, etc.)

## Step 3 — Review UI design
For Tier B tasks:
- Surface AI proposal clearly marked
- One-click accept / edit / reject
- Capture feedback signal for training data

## Step 4 — Escalation triggers
- Model flagged uncertain
- User flagged "not confident"
- Pattern detected (3 consecutive rejections) → auto-pause AI
- Adversarial input detected

## Step 5 — Feedback loop
- Reject data → eval set expansion
- Edit data → prompt/model fine-tune candidates
- Accept data → confidence threshold calibration

## Output
1. Task tier table
2. Confidence threshold rules
3. Review UI spec
4. The one task tier we'd expand from B to A first (and why)
5. The one tier we'd keep in C indefinitely
Customize Variables0/2
Design the fallback for when the AI gets it wrong. And it will.

Related Recipes