Workflow
PM Interview Prep Kit
Get PM-interview ready — rewrite your resume, build a portfolio, research the company, run mock interviews, and negotiate offers.
5 prompts·40 min·beginner
View recipe
Conduct an AI vs. non-AI ROI comparison
You are running an AI ROI comparison for {{feature_name}}. Non-AI baseline: {{non_ai_baseline}}.
## Dimensions
| Dimension | Non-AI baseline | AI option | Delta |
|-----------|----------------|-----------|-------|
| Accuracy (correct outcome) | | | |
| Consistency (same output for same input) | | | |
| Cost (per call, per user, per month) | | | |
| Latency (p50, p99) | | | |
| Development cost (eng weeks to ship) | | | |
| Ongoing cost (eval, monitoring, model drift) | | | |
| Failure mode severity (what happens when it's wrong) | | | |
| Reversibility (can user recover from a bad output) | | | |
## Decision rules
- AI wins cleanly: accuracy delta >20%, cost delta 10%, cost delta 10x
- Demo-only: impressive in demo but ROI is negative — ship the non-AI version first
## Quality floor
If AI quality is below non-AI on any load-bearing dimension:
- Is there a constrained scope where AI wins?
- Is there a human-in-the-loop path that hybrids both?
## Output
1. Filled comparison table
2. Recommendation (AI / hybrid / non-AI)
3. The one assumption most likely to change over 12 months (model costs, capability)
4. The reversal trigger (what new data would change the call)Design an AI feature evaluation rubric before shipping
You are an AI eval designer building the pre-launch rubric for {{feature_name}} on {{product_name}}. User task: {{user_task}}.
## Step 1 — Task set
Build a test set of 30-100 real user inputs spanning:
- Happy path (clear input, obvious intent)
- Edge cases (ambiguous input, unusual phrasing)
- Adversarial (jailbreaks, prompt injections, out-of-scope)
- Representative error modes (typos, unsupported languages, missing context)
## Step 2 — Scoring criteria
Define 3-5 dimensions, each 1-5:
- Correctness (did it answer the right thing?)
- Completeness (did it answer fully?)
- Faithfulness (did it hallucinate?)
- Safety (did it refuse appropriately?)
- User-facing quality (would a user be satisfied?)
## Step 3 — Golden answers
For the 30 happy-path tasks:
- Write the ideal answer (or 2-3 acceptable variants)
- Annotate what makes it good/bad
## Step 4 — Regression guardrails
- Minimum score per dimension to ship
- Automated diff on every model/prompt change
- Human review cadence for random samples
## Step 5 — Output
1. Task set spec
2. Rubric with scoring examples
3. Pass/fail ship criteria
4. The one failure mode we'd tolerate (and why)
5. The monitoring plan for post-launch driftRun an AI feature eval with golden dataset
You are running a formal eval for {{ai_feature}}. Goal: produce a defensible quality number.
## Step 1 — Dataset
- 100-500 real production inputs
- Balanced across use cases
- Ground-truth labels from domain experts (not PMs)
- Held-out test split (not used for prompt iteration)
## Step 2 — Scoring
### Automatic metrics
- Exact match (where applicable)
- ROUGE/BLEU for text generation
- Custom task-specific metrics
### LLM-judge
- Calibrate LLM judge on 50 labeled examples first
- Spot-check judge agreement with human labels (should be >85%)
### Human scoring
- Blind (judge doesn't know which model/prompt)
- 3 annotators per item, aggregate
- 5-point rubric with anchor examples
## Step 3 — Statistical rigor
- Confidence intervals (bootstrap)
- Sample size for 95% confidence
- Multiple comparison correction if testing >2 variants
## Step 4 — Reporting
- Overall quality score (0-100 or pass rate)
- Per-category scores
- Confidence intervals
- Variant comparison if applicable
- The one category where model is unreliable
## Output
1. Eval spec (dataset, metrics, judge setup)
2. Scoring rubric
3. Report template
4. The one eval decision (metric choice) that most affects the headline numberDesign an AI product disclosure pattern for users
You are designing AI disclosure for {{ai_feature}}.
## Step 1 — Disclosure mechanisms
- Label output as AI-generated (visual + text)
- Confidence indicator (explicit high/med/low or numeric)
- Source citation (where the output came from)
- Refusal surface (what the AI won't do and why)
- Override affordance (let the user edit or reject)
## Step 2 — Placement
| Mechanism | Where | When |
|-----------|-------|------|
| AI label | Every AI output | Always |
| Confidence | Before user acts on output | When model is uncertain |
| Citation | Inline in output | Whenever claims are made |
| Refusal | When user asks out-of-scope | On attempt |
| Override | On every output | Always available |
## Step 3 — Anti-patterns to avoid
- Over-confident certainty language ("definitely," "always")
- Fake humility (confidence shown low when AI is actually reliable)
- Hidden AI (disguising AI output as human)
- Disclosure fatigue (so many warnings users ignore them)
## Step 4 — Testing
Run 10 users through the flow:
- Did they notice the AI disclosure?
- Did they calibrate trust appropriately?
- Did they know how to override?
- Did the confidence signal match their experience?
## Output
1. Disclosure mechanism spec
2. Placement rules
3. 3 anti-patterns we'd specifically audit for
4. User testing planDesign a human-in-the-loop safety net for an AI feature
You are designing the HITL safety net for {{ai_feature}}. Stakes if wrong: {{stakes}}.
## Step 1 — Task taxonomy
Classify tasks into tiers:
- **Tier A (Auto)**: low stakes + high model confidence → AI acts independently
- **Tier B (Review)**: medium stakes OR medium confidence → AI proposes, human approves
- **Tier C (Hybrid)**: high stakes + needs human judgment → AI drafts, human rewrites
- **Tier D (Human-only)**: too risky or judgment-heavy for AI
## Step 2 — Confidence thresholds
Define numeric thresholds per tier:
- Model confidence score
- Output length / complexity heuristic
- User-signal checks (first-time user, sensitive account, etc.)
## Step 3 — Review UI design
For Tier B tasks:
- Surface AI proposal clearly marked
- One-click accept / edit / reject
- Capture feedback signal for training data
## Step 4 — Escalation triggers
- Model flagged uncertain
- User flagged "not confident"
- Pattern detected (3 consecutive rejections) → auto-pause AI
- Adversarial input detected
## Step 5 — Feedback loop
- Reject data → eval set expansion
- Edit data → prompt/model fine-tune candidates
- Accept data → confidence threshold calibration
## Output
1. Task tier table
2. Confidence threshold rules
3. Review UI spec
4. The one task tier we'd expand from B to A first (and why)
5. The one tier we'd keep in C indefinitelyGet PM-interview ready — rewrite your resume, build a portfolio, research the company, run mock interviews, and negotiate offers.
Write a complete PRD from scratch in 5 steps — from market analysis to prioritization.