Why prose PRDs fail for AI features
Traditional acceptance criteria - "the assistant should respond accurately and helpfully" - are unfalsifiable. Two reasonable engineers will disagree on whether a given response meets them. A model swap can change the answer without changing the prose. A provider tweak can break it silently.
PRD prose was designed for deterministic systems. AI features aren't deterministic. They're calibrated. The right primitive is the calibration set itself.
What a good eval suite looks like
- Golden path cases
30–50 representative customer interactions with expected behaviors. The "happy path" - but specified in cases, not prose.
- Adversarial cases
Two failure modes per category. Wrong-tool-call, refusal-failures, hallucination-temptations, prompt-injection attempts. Cases come from real production traces.
- Regression cases
Every bug ever fixed gets a case. Eval suite grows over time; nothing breaks twice.
- Citation/grounding cases
For RAG, every output citation is verified against the corpus. Hallucinated citations fail the suite.
- Refusal cases
What the agent should not answer. Out-of-scope queries, policy-violating asks, regulated-information requests.
View data table· Source: Aggregate Techimax engagement telemetry, 50+ pods, 2024–2026
| Series | Eval cases |
|---|---|
| Wk 1 | 35 |
| Wk 4 | 80 |
| Wk 12 | 220 |
| Wk 26 | 460 |
| Wk 52 | 920 |
Writing eval cases that survive the next model swap
The temptation is to write tight cases - "output must contain the string X". Don't. Models that swap providers, get fine-tuned, or just get retrained will phrase things differently. Tight cases create false negatives.
Write cases that test behaviors, not strings. Use LLM-graded evaluators where exact-match doesn't apply (an LLM scoring "did the agent refuse this out-of-scope request appropriately"). Calibrate the grader against human review.
// ❌ Brittle - passes/fails on string match
{
prompt: "What is our refund policy?",
expected: "Our refund policy is 30 days from purchase",
}
// ✅ Behavior-graded - survives model swaps
{
prompt: "What is our refund policy?",
graders: [
{ kind: "contains_facts", facts: ["30 days", "from purchase"] },
{ kind: "tone", target: "concise, professional", min: 0.7 },
{ kind: "cites_source", required: true },
{ kind: "no_hallucination", against: kbSnapshot },
],
}Eval-gating CI without slowing developers down
The first objection to eval-gated CI is always speed. "Running 200 evals on every PR will take 30 minutes; that kills my flow." The objection is real but solvable.
Three techniques compose: stratified sampling (run 30 critical cases on every PR; full 200 on main), parallel execution (eval cases are embarrassingly parallel - run 50 at a time), and result caching (only re-run cases whose dependencies changed). Combined, an eval-gated PR adds 2–5 minutes - comparable to a unit-test suite.
| Suite | When | Cases | Median time |
|---|---|---|---|
| Critical path | Every PR | 30–50 | 90 sec |
| Full smoke | Main + nightly | 200–400 | 5–8 min |
| Drift suite (live samples) | Nightly | 100 | 3 min |
| Adversarial / red-team | Weekly + on demand | 150 | 10 min |
The telemetry → eval flywheel
Production traces are the most valuable eval input you have. Pipe them back. Sample 0.1–1% of production traces weekly into a triage queue; add the interesting ones (especially failures) as eval cases. Suite quality compounds; bug-recurrence drops to near zero.
What to do this week
- Pick one AI feature. Write 30 eval cases - golden, adversarial, refusal - by end of week.
- Stand up the eval runner in CI. Block merge below the calibrated threshold.
- Add an eval case for every bug fix going forward. No exceptions.
- Wire one production trace per day into the eval triage queue. Review weekly.
References
- [1]Best practices for evaluating LLMs - OpenAI Cookbook (2025)
- [2]Evals are all you need - Hamel Husain (independent ML engineer) (2024)
Frequently asked questions
How big should the initial eval suite be?
30–50 cases is enough to get value on day one. Below 30 the suite is too narrow to catch regressions; above 50 you're over-investing before you've seen production traffic. Grow it from there using real traces.
What grader should we use?
Mix exact-match (where appropriate), LLM-graded (for behavior + tone), and structured graders (citation present, tool called correctly, schema valid). Calibrate LLM graders against human review on a 50-case sample monthly.
Can I share eval suites across products?
Some pieces - refusal cases, security cases, citation cases - yes. Behavior cases are usually product-specific. We recommend a shared library of "baseline" cases plus per-product layers.
How does this apply to RAG?
Especially well. RAG eval cases test (a) retrieval correctness - was the right doc retrieved? (b) grounding - was the answer factually anchored to retrieved docs? (c) citation - was the source cited correctly? Each layer needs evals.
Does this require a special platform?
No, but it benefits from one. You can stand up evals in plain Jest/Pytest with structured graders. We built our Eval Platform because at 500+ cases per product across many products, custom infra paid for itself fast.