What does Techimax do?

Techimax embeds forward-deployed engineers inside enterprises, SMBs, and non-tech businesses to ship production agentic AI - and the engineering to make it real. Web, mobile, backend, agents - any tech stack, any platform.

What industries do you serve?

Healthcare, banking and financial services, retail and ecommerce, telecom and media, entertainment and OTT, automotive, travel, education, real estate, energy, legal, manufacturing, and SaaS - across regulated enterprises, SMBs, and public sector.

How fast can you ship?

Forward-deployed engineers ship spec-to-production agents in days for routine work, and 4-6 weeks for full multi-agent platforms. Lightning Pods deliver daily releases by week two of every engagement.

Engineering practiceFor VP EngineeringFor Head of AI

Evals as the product spec: a different way to ship AI features

Stop writing acceptance criteria - write evals. A practical guide to designing eval suites that pull double duty as your product spec, your CI gate, and your trust signal for production.

TTechimax EngineeringForward-deployed engineering team10 min readJanuary 22, 2026Updated March 18, 2026

Why prose PRDs fail for AI features

Traditional acceptance criteria - "the assistant should respond accurately and helpfully" - are unfalsifiable. Two reasonable engineers will disagree on whether a given response meets them. A model swap can change the answer without changing the prose. A provider tweak can break it silently.

PRD prose was designed for deterministic systems. AI features aren't deterministic. They're calibrated. The right primitive is the calibration set itself.

What a good eval suite looks like

The five layers of an enterprise-grade eval suite

Golden path cases
30–50 representative customer interactions with expected behaviors. The "happy path" - but specified in cases, not prose.
Adversarial cases
Two failure modes per category. Wrong-tool-call, refusal-failures, hallucination-temptations, prompt-injection attempts. Cases come from real production traces.
Regression cases
Every bug ever fixed gets a case. Eval suite grows over time; nothing breaks twice.
Citation/grounding cases
For RAG, every output citation is verified against the corpus. Hallucinated citations fail the suite.
Refusal cases
What the agent should not answer. Out-of-scope queries, policy-violating asks, regulated-information requests.

Chart · Eval cases

Eval-suite size and quality lift over a 12-month engagement

View data table· Source: Aggregate Techimax engagement telemetry, 50+ pods, 2024–2026

Series	Eval cases
Wk 1	35
Wk 4	80
Wk 12	220
Wk 26	460
Wk 52	920

Writing eval cases that survive the next model swap

The temptation is to write tight cases - "output must contain the string X". Don't. Models that swap providers, get fine-tuned, or just get retrained will phrase things differently. Tight cases create false negatives.

Write cases that test behaviors, not strings. Use LLM-graded evaluators where exact-match doesn't apply (an LLM scoring "did the agent refuse this out-of-scope request appropriately"). Calibrate the grader against human review.

An eval case that survives model swapsts

// ❌ Brittle - passes/fails on string match
{
  prompt: "What is our refund policy?",
  expected: "Our refund policy is 30 days from purchase",
}

// ✅ Behavior-graded - survives model swaps
{
  prompt: "What is our refund policy?",
  graders: [
    { kind: "contains_facts", facts: ["30 days", "from purchase"] },
    { kind: "tone", target: "concise, professional", min: 0.7 },
    { kind: "cites_source", required: true },
    { kind: "no_hallucination", against: kbSnapshot },
  ],
}

Eval-gating CI without slowing developers down

The first objection to eval-gated CI is always speed. "Running 200 evals on every PR will take 30 minutes; that kills my flow." The objection is real but solvable.

Three techniques compose: stratified sampling (run 30 critical cases on every PR; full 200 on main), parallel execution (eval cases are embarrassingly parallel - run 50 at a time), and result caching (only re-run cases whose dependencies changed). Combined, an eval-gated PR adds 2–5 minutes - comparable to a unit-test suite.

Suite	When	Cases	Median time
Critical path	Every PR	30–50	90 sec
Full smoke	Main + nightly	200–400	5–8 min
Drift suite (live samples)	Nightly	100	3 min
Adversarial / red-team	Weekly + on demand	150	10 min

Stratified eval-gating: where each suite runs

The telemetry → eval flywheel

Production traces are the most valuable eval input you have. Pipe them back. Sample 0.1–1% of production traces weekly into a triage queue; add the interesting ones (especially failures) as eval cases. Suite quality compounds; bug-recurrence drops to near zero.

What to do this week

Pick one AI feature. Write 30 eval cases - golden, adversarial, refusal - by end of week.
Stand up the eval runner in CI. Block merge below the calibrated threshold.
Add an eval case for every bug fix going forward. No exceptions.
Wire one production trace per day into the eval triage queue. Review weekly.

References

[1]Best practices for evaluating LLMs - OpenAI Cookbook (2025)
[2]Evals are all you need - Hamel Husain (independent ML engineer) (2024)

Frequently asked questions

How big should the initial eval suite be?

30–50 cases is enough to get value on day one. Below 30 the suite is too narrow to catch regressions; above 50 you're over-investing before you've seen production traffic. Grow it from there using real traces.

What grader should we use?

Mix exact-match (where appropriate), LLM-graded (for behavior + tone), and structured graders (citation present, tool called correctly, schema valid). Calibrate LLM graders against human review on a 50-case sample monthly.

Can I share eval suites across products?

Some pieces - refusal cases, security cases, citation cases - yes. Behavior cases are usually product-specific. We recommend a shared library of "baseline" cases plus per-product layers.

How does this apply to RAG?

Especially well. RAG eval cases test (a) retrieval correctness - was the right doc retrieved? (b) grounding - was the answer factually anchored to retrieved docs? (c) citation - was the source cited correctly? Each layer needs evals.

Does this require a special platform?

No, but it benefits from one. You can stand up evals in plain Jest/Pytest with structured graders. We built our Eval Platform because at 500+ cases per product across many products, custom infra paid for itself fast.

Talk to engineering

Ready to ship the patterns from this post?

Tell us where you are. A senior forward-deployed engineer replies within 24 hours with a written plan tailored to your stack - never an SDR.

Practical engineering review of your current setup
Eval discipline + observability + cost controls
Free 60-min working session, no sales pitch

Embed an engineer Browse all posts

Senior reply within 24h

Drop your details and we'll match you with an engineer who's shipped in your industry.