AI Feature Launch Readiness

Workflow

5 steps·35 min·advanced

Ship an AI feature without regret — from ROI justification to evals, disclosure, and safety nets.

Your Progress

0/5

Steps

ROI Comparison

Conduct an AI vs. non-AI ROI comparison

You are running an AI ROI comparison for {{feature_name}}. Non-AI baseline: {{non_ai_baseline}}.

## Dimensions
| Dimension | Non-AI baseline | AI option | Delta |
|-----------|----------------|-----------|-------|
| Accuracy (correct outcome) | | | |
| Consistency (same output for same input) | | | |
| Cost (per call, per user, per month) | | | |
| Latency (p50, p99) | | | |
| Development cost (eng weeks to ship) | | | |
| Ongoing cost (eval, monitoring, model drift) | | | |
| Failure mode severity (what happens when it's wrong) | | | |
| Reversibility (can user recover from a bad output) | | | |

## Decision rules
- AI wins cleanly: accuracy delta >20%, cost delta 10%, cost delta 10x
- Demo-only: impressive in demo but ROI is negative — ship the non-AI version first

## Quality floor
If AI quality is below non-AI on any load-bearing dimension:
- Is there a constrained scope where AI wins?
- Is there a human-in-the-loop path that hybrids both?

## Output
1. Filled comparison table
2. Recommendation (AI / hybrid / non-AI)
3. The one assumption most likely to change over 12 months (model costs, capability)
4. The reversal trigger (what new data would change the call)

View full prompt

Customize Variables0/2

Feature Name

Non Ai Baseline

Before building AI, prove it's actually better than the non-AI version.

Eval Rubric

From step 1

Design an AI feature evaluation rubric before shipping

You are an AI eval designer building the pre-launch rubric for {{feature_name}} on {{product_name}}. User task: {{user_task}}.

## Step 1 — Task set
Build a test set of 30-100 real user inputs spanning:
- Happy path (clear input, obvious intent)
- Edge cases (ambiguous input, unusual phrasing)
- Adversarial (jailbreaks, prompt injections, out-of-scope)
- Representative error modes (typos, unsupported languages, missing context)

## Step 2 — Scoring criteria
Define 3-5 dimensions, each 1-5:
- Correctness (did it answer the right thing?)
- Completeness (did it answer fully?)
- Faithfulness (did it hallucinate?)
- Safety (did it refuse appropriately?)
- User-facing quality (would a user be satisfied?)

## Step 3 — Golden answers
For the 30 happy-path tasks:
- Write the ideal answer (or 2-3 acceptable variants)
- Annotate what makes it good/bad

## Step 4 — Regression guardrails
- Minimum score per dimension to ship
- Automated diff on every model/prompt change
- Human review cadence for random samples

## Step 5 — Output
1. Task set spec
2. Rubric with scoring examples
3. Pass/fail ship criteria
4. The one failure mode we'd tolerate (and why)
5. The monitoring plan for post-launch drift

View full prompt

Customize Variables0/3

Feature Name

Product Name

User Task

Define the quality bar before you ship, not after users complain.

Golden Dataset Eval

From step 2

Run an AI feature eval with golden dataset

You are running a formal eval for {{ai_feature}}. Goal: produce a defensible quality number.

## Step 1 — Dataset
- 100-500 real production inputs
- Balanced across use cases
- Ground-truth labels from domain experts (not PMs)
- Held-out test split (not used for prompt iteration)

## Step 2 — Scoring
### Automatic metrics
- Exact match (where applicable)
- ROUGE/BLEU for text generation
- Custom task-specific metrics

### LLM-judge
- Calibrate LLM judge on 50 labeled examples first
- Spot-check judge agreement with human labels (should be >85%)

### Human scoring
- Blind (judge doesn't know which model/prompt)
- 3 annotators per item, aggregate
- 5-point rubric with anchor examples

## Step 3 — Statistical rigor
- Confidence intervals (bootstrap)
- Sample size for 95% confidence
- Multiple comparison correction if testing >2 variants

## Step 4 — Reporting
- Overall quality score (0-100 or pass rate)
- Per-category scores
- Confidence intervals
- Variant comparison if applicable
- The one category where model is unreliable

## Output
1. Eval spec (dataset, metrics, judge setup)
2. Scoring rubric
3. Report template
4. The one eval decision (metric choice) that most affects the headline number

View full prompt

Customize Variables0/1

Ai Feature

Test against a golden dataset so you know what 'good' actually means.

User Disclosure

From step 3

Design an AI product disclosure pattern for users

You are designing AI disclosure for {{ai_feature}}.

## Step 1 — Disclosure mechanisms
- Label output as AI-generated (visual + text)
- Confidence indicator (explicit high/med/low or numeric)
- Source citation (where the output came from)
- Refusal surface (what the AI won't do and why)
- Override affordance (let the user edit or reject)

## Step 2 — Placement
| Mechanism | Where | When |
|-----------|-------|------|
| AI label | Every AI output | Always |
| Confidence | Before user acts on output | When model is uncertain |
| Citation | Inline in output | Whenever claims are made |
| Refusal | When user asks out-of-scope | On attempt |
| Override | On every output | Always available |

## Step 3 — Anti-patterns to avoid
- Over-confident certainty language ("definitely," "always")
- Fake humility (confidence shown low when AI is actually reliable)
- Hidden AI (disguising AI output as human)
- Disclosure fatigue (so many warnings users ignore them)

## Step 4 — Testing
Run 10 users through the flow:
- Did they notice the AI disclosure?
- Did they calibrate trust appropriately?
- Did they know how to override?
- Did the confidence signal match their experience?

## Output
1. Disclosure mechanism spec
2. Placement rules
3. 3 anti-patterns we'd specifically audit for
4. User testing plan

View full prompt

Customize Variables0/1

Ai Feature

Tell users when they're seeing AI output, and what they can trust.

Safety Net

From step 4

Design a human-in-the-loop safety net for an AI feature

You are designing the HITL safety net for {{ai_feature}}. Stakes if wrong: {{stakes}}.

## Step 1 — Task taxonomy
Classify tasks into tiers:
- **Tier A (Auto)**: low stakes + high model confidence → AI acts independently
- **Tier B (Review)**: medium stakes OR medium confidence → AI proposes, human approves
- **Tier C (Hybrid)**: high stakes + needs human judgment → AI drafts, human rewrites
- **Tier D (Human-only)**: too risky or judgment-heavy for AI

## Step 2 — Confidence thresholds
Define numeric thresholds per tier:
- Model confidence score
- Output length / complexity heuristic
- User-signal checks (first-time user, sensitive account, etc.)

## Step 3 — Review UI design
For Tier B tasks:
- Surface AI proposal clearly marked
- One-click accept / edit / reject
- Capture feedback signal for training data

## Step 4 — Escalation triggers
- Model flagged uncertain
- User flagged "not confident"
- Pattern detected (3 consecutive rejections) → auto-pause AI
- Adversarial input detected

## Step 5 — Feedback loop
- Reject data → eval set expansion
- Edit data → prompt/model fine-tune candidates
- Accept data → confidence threshold calibration

## Output
1. Task tier table
2. Confidence threshold rules
3. Review UI spec
4. The one task tier we'd expand from B to A first (and why)
5. The one tier we'd keep in C indefinitely

View full prompt

Customize Variables0/2

Ai Feature

Stakes

Design the fallback for when the AI gets it wrong. And it will.

Related Recipes

Workflow

PM Interview Prep Kit

Get PM-interview ready — rewrite your resume, build a portfolio, research the company, run mock interviews, and negotiate offers.

5 prompts·40 min·beginner

View recipe

Workflow

PRD Writing Workflow

Write a complete PRD from scratch in 5 steps — from market analysis to prioritization.

5 prompts·30 min·intermediate

View recipe