A VentureStdio Company
EconomicsFor CIO / CTOFor Data platform lead

Cost models for production agentic AI - what to budget, what to instrument

Per-agent cost isn't "$X per token" - it's a stack of model, retrieval, tool calls, storage, ops. The budgeting framework and the telemetry that catches cost surprises before finance does.

TTechimax EngineeringForward-deployed engineering team13 min readUpdated May 10, 2026

How do you forecast LLM costs before you ship?

Cost forecasting before launch is hard but tractable. We model three drivers: traffic (interactions per month), shape (mean tokens in/out), and routing (which provider per use-case). Multiply by current rate cards [1][2][3]. Add 25% buffer for retrieval bloat and retry overhead. Validate against the first week of canary traffic.

The number CFOs care about is cost-per-resolved-interaction, not cost-per-token. Resolved-interaction is the unit of business value; tokens are a leaky proxy. Build the cost model in resolved-interaction units; show finance the decomposition that gets to that unit.

DriverAssumptionPer-interaction costMonthly @ 500K
Model spend (router + main)1.4 LLM calls × 800 input / 250 output tokens$0.052$26K
Retrieval (vector + re-rank)1.2 retrieval calls × 8 docs$0.014$7K
Tool / API calls0.6 calls × $0.02 average$0.012$6K
Storage + indexing (amortized)1M doc corpus, monthly delta$0.005$2.5K
Ops + observability (amortized)Eval suite + alarms + traces$0.011$5.5K
Total per resolved interaction-$0.094$47K
+ 25% buffer-$0.118$59K
Cost forecasting worksheet - typical customer-care agent (mid-scope)

The cost stack, decomposed

Chart
Per-interaction cost decomposition for a typical customer-care agent (averaged across 8 enterprise rollouts)
View data table· Source: Techimax engagement telemetry, 2024–2026
SeriesValue
Model spend (LLM)52
Retrieval (vector + re-rank)17
Tool / external API calls12
Storage + indexing6
Ops + observability13

What to instrument per layer

LayerMetricsDrift signal
Model spendTokens (in/out), cost, latency, model namep99 spike → loop or context bloat
RetrievalCalls, retrieved tokens, re-rank cost, recall@kRecall drop → corpus stale or chunking changed
Tool callsCalls/action, retry rate, upstream 4xx/5xxRetry rate up → upstream contract drift
StorageIndex size, write volume, embedding costLinear growth OK; stepwise growth → re-embed pipeline
OpsEngineer hours, on-call pages, eval-suite runtimePage volume → drift in production behavior
Cost telemetry per layer

Cost controls: hard vs soft

  • Hard caps at the gateway (per-trace token + cost limits enforced before model call). Application code can't bypass them.
  • Soft warnings at 50% / 80% of the cap with prominent telemetry. Engineer sees them in the same APM as everything else.
  • Per-tenant budgets for multi-tenant systems. One tenant's runaway query can't bill another tenant's account.
  • Spend forecasting based on rolling 7-day averages. Surfaces a 3× spike before it becomes a 30× spike.
  • Per-user spend alarms for high-blast-radius use cases. A single power user can rack up unusual cost; alarm on per-user p99 cost-per-day.

Where production AI costs leak - the top failure modes

Cost leakage in production agents follows a small set of patterns. We've seen each one repeatedly enough to predict where the next leak will appear. The good news: every pattern has a known fix that takes hours, not weeks. The bad news: without the right telemetry the leak runs invisibly until the invoice arrives.

PatternTypical multiplierDetection signalFix
Tool-call retry loop5–40×Retry count distribution per tool spanBound retries; circuit-break
Retrieval context bloat8–15×Retrieved-tokens p95 climbingRe-rank + truncate cap
Streaming abandonment billing2–4×Tokens-billed > tokens-rendered ratioCancel-aware streaming
Background agent loops10–100×Per-trace token cap exceedancesHard kill at threshold
Prompt regression on model swap1.5–3×Token / call ratio after swapEval-gate model swaps
Reviewer-bypass auto-escalation1.5–2×Auto-vs-review ratio driftRecalibrate routing
Top cost-leak patterns and their fixes
Chart · USD per action
Cost-per-action distribution shift after a tool-call regression (production rollout, weeks 1–10)
View data table· Source: Techimax engagement telemetry, anonymized
SeriesUSD per action
Wk 10.09
Wk 20.1
Wk 30.11
Wk 40.13
Wk 50.18
Wk 60.34
Wk 70.41
Wk 80.12
Wk 90.1
Wk 100.1

What's a realistic cost-per-action target?

Cost-per-action targets are workload-specific and should be set against business value, not against arbitrary low-cost ceilings. Customer-care agents typically land at $0.04–$0.20 per resolved interaction; complex workflows (multi-step research, long document drafting, multi-agent orchestration) at $0.30–$2.00; deep-reasoning agents (legal review, code review, financial analysis) up to $5–$15.

The right ceiling is whatever leaves a healthy margin against the value of the resolved action. A customer-care interaction that displaces a $4 human handle is fine at $0.20. A research interaction that produces a $400 piece of work is fine at $5. We caution against arbitrary 'cheap LLM' targets - they push teams toward inferior models, which raises eval failure rates, which raises human-handoff rates, which raises total cost [4].

Cost-per-resolved-interaction is the unit. Cost-per-token is a leaky proxy. Build the model in business units; show finance the decomposition.

Self-host vs hosted: the cost crossover math

Self-hosting open-weight models becomes cost-favorable around 5–10M monthly interactions for a 70B-class workload, depending on traffic shape and model quality requirements [5]. Below that, hosted APIs are cheaper because GPU time is amortized across the provider's customer base. Above that, self-hosted with proper utilization can be 2–5× cheaper.

Operational caveat: self-hosting requires GPU capacity planning, autoscaling, model-update cycles, and eval-gated promotion. Add 0.5–1.0 platform engineer to operate a self-hosted stack. Below the crossover the platform cost dominates; above it, the savings dominate. Run the math against your actual traffic profile - peak-to-mean ratio matters more than total volume.

References

  1. [1]AWS Bedrock pricing - AWS (2025)
  2. [2]Anthropic API pricing + prompt caching - Anthropic (2025)
  3. [3]OpenAI API pricing - OpenAI (2025)
  4. [4]Total cost of ownership for production LLM applications - Andreessen Horowitz (2024)
  5. [5]vLLM: efficient memory management for LLM serving - vLLM project (2025)
  6. [6]FinOps for AI: framework and KPIs - FinOps Foundation (2025)
  7. [7]Google Cloud Vertex AI pricing - Google Cloud (2025)

Frequently asked questions

What's a sensible per-interaction cost ceiling?

Depends on the use case - customer care typically lands at $0.04–$0.20 per resolved interaction; complex workflows (multi-step research, document drafting) at $0.30–$2. Budget against business value, not against "low cost."

Are open-weight models cheaper?

Sometimes - depends on your traffic profile. At low volumes, hosted-API providers are cheaper because GPU time is amortized. At high volumes (millions of interactions/month), self-hosted open-weight models can be 2–5× cheaper. The crossover is workload-specific.

Should we negotiate enterprise pricing with providers?

Yes, at $50K+/month spend. Anthropic, OpenAI, and Bedrock all offer committed-use discounts at scale. Don't pay rack rate at production volume.

How does prompt caching change the cost picture?

Anthropic's prompt caching and OpenAI's prompt caching can drop input-token cost on long, stable system prompts by 50–90% [1][2]. Apply to system prompts and large retrieved corpora that recur across calls; don't cache user-specific or short-lived content. Re-validate cache hit rate weekly - context drift can silently break caching.

What about cost on multimodal (vision, audio) workloads?

Multimodal interactions cost 1.5–4× text-equivalent tokens depending on resolution and modality. Budget separately; instrument modality on the cost span. Voice in particular has streaming considerations that affect billing - read the provider docs carefully on tokens billed during long-running connections.

How do we model fine-tuning cost vs prompting cost?

Fine-tuning has a one-time training cost (typically $1K–$50K depending on model size and dataset) plus ongoing inference at hosted rates. Pays off when you can reduce per-call tokens or model size for a high-volume workload. Below ~1M monthly interactions on the fine-tuned task, prompting is usually cheaper. Run the math; don't fine-tune by default.

What's a reasonable AI infrastructure budget as % of engineering OpEx?

For AI-forward enterprises, model and inference spend lands at 4–8% of engineering OpEx in production. Higher means leaks (instrument); lower often means under-investment in retrieval / evals (which shows up as quality drift later).

How do we track FinOps for AI workloads?

Tag every model call with use-case, agent, tenant, and team. Aggregate weekly into a FinOps dashboard alongside other cloud spend [6]. Monthly rollup for finance; weekly for engineering. Don't let AI spend live in a separate spreadsheet - that's how it grows uninspected.

Talk to engineering

Ready to ship the patterns from this post?

Tell us where you are. A senior forward-deployed engineer replies within 24 hours with a written plan tailored to your stack - never an SDR.

  • Practical engineering review of your current setup
  • Eval discipline + observability + cost controls
  • Free 60-min working session, no sales pitch

Senior reply within 24h

Drop your details and we'll match you with an engineer who's shipped in your industry.

By submitting, you agree to our privacy policy. We'll never share your information.