Design an autonomous experiment loop for product optimization
Product Strategy
10 uses
Updated 3/27/2026
Description
You know something in your product could be better — onboarding copy, pricing page layout, notification timing — but running A/B tests manually is slow and you never get through enough variants. Apply Karpathy's autoresearch pattern (https://github.com/karpathy/autoresearch) to set up a structured experiment loop where each iteration builds on the last.
Example Usage
You are a product strategist applying the "Karpathy Loop" — an autonomous experiment pattern where you iteratively modify a single artifact, measure it against one objective metric, and keep or discard each change. The goal is to turn gut-feel product optimization into a structured, repeatable experiment machine.
## Context
- Product: {{product_name}}
- What you want to optimize: {{optimization_target}}
(e.g., onboarding completion rate, landing page conversion, email open rate, feature adoption)
- Current baseline metric: {{current_metric_value}}
- Available measurement method: {{measurement_method}}
(e.g., analytics dashboard, LLM-as-judge scoring, user testing, click-through data)
## Step 1: Define the Experiment Architecture
Map your optimization to the autoresearch framework:
### The Artifact (your "train.py")
- What single artifact will the agent modify each iteration?
- Examples: landing page copy, onboarding flow script, push notification templates, pricing tier descriptions
- Write out the current version of this artifact verbatim
### The Objective Metric (your "val_bpb")
- Define ONE unambiguous number that tells you if a variant is better or worse
- It must be measurable within your time budget per experiment
- If your real metric (e.g., conversion rate) is too slow to measure, define a proxy:
- LLM clarity score (1-10) as proxy for user comprehension
- Time-to-first-action as proxy for onboarding quality
- Engagement prediction score as proxy for retention
- **Critical rule:** If you cannot define a single metric, you are not ready for this pattern. Pick one and commit.
### The Constraints (your "prepare.py")
List everything that CANNOT change across experiments:
- Brand voice and tone guidelines
- Legal/compliance requirements
- Technical constraints (character limits, format requirements)
- Target audience definition
### The Time Budget
- How long does each experiment cycle take? (measurement + analysis)
- How many experiments can you run per day/week?
- What is your total experiment window before you need to ship?
## Step 2: Write Your program.md
Draft the plain-English instructions that would guide an AI agent (or your future self) through the experiment loop:
1. **Hypothesis formation** — What should the agent try changing? List 5 initial experiment directions:
- Variation in tone (formal vs. conversational)
- Structural changes (order of information, length)
- Emphasis shifts (different value props, different pain points)
- Format changes (bullets vs. paragraphs, with/without social proof)
- Radical departures (completely different approach)
2. **Keep/discard rule** — "If the new variant scores higher than the current best on {{metric_name}}, keep it. Otherwise, revert."
3. **Simplicity criterion** — "All else being equal, simpler is better. A small improvement that adds complexity is not worth it."
4. **Persistence rule** — "If you run out of obvious ideas, try combining elements from the top 3 previous variants, or try the opposite of what has been working."
## Step 3: Run the First 5 Experiments
For each experiment:
| # | Hypothesis | Change Made | Metric Before | Metric After | Keep/Discard |
|---|-----------|-------------|---------------|--------------|--------------|
| 1 | | | | | |
| 2 | | | | | |
| 3 | | | | | |
| 4 | | | | | |
| 5 | | | | | |
## Step 4: Analyze the Pattern
After 5+ experiments:
1. Which experiment direction produced the biggest improvement?
2. What do the "kept" variants have in common?
3. What assumptions were proven wrong?
4. Where are diminishing returns setting in?
5. Is the proxy metric still correlating with the real outcome?
## Output
1. **Optimized artifact** — The current best version after all experiments
2. **Experiment log** — Full table of what was tried and what worked
3. **Key insight** — The single most surprising finding from the experiment loop
4. **Next experiment batch** — 3 hypotheses for the next round of iteration
5. **Pattern applicability** — Where else in your product could this same loop be applied?Customize This Prompt
Customize Variables0/5
Was this helpful?