How do you forecast LLM costs before you ship?
Cost forecasting before launch is hard but tractable. We model three drivers: traffic (interactions per month), shape (mean tokens in/out), and routing (which provider per use-case). Multiply by current rate cards [1][2][3]. Add 25% buffer for retrieval bloat and retry overhead. Validate against the first week of canary traffic.
The number CFOs care about is cost-per-resolved-interaction, not cost-per-token. Resolved-interaction is the unit of business value; tokens are a leaky proxy. Build the cost model in resolved-interaction units; show finance the decomposition that gets to that unit.
| Driver | Assumption | Per-interaction cost | Monthly @ 500K |
|---|---|---|---|
| Model spend (router + main) | 1.4 LLM calls × 800 input / 250 output tokens | $0.052 | $26K |
| Retrieval (vector + re-rank) | 1.2 retrieval calls × 8 docs | $0.014 | $7K |
| Tool / API calls | 0.6 calls × $0.02 average | $0.012 | $6K |
| Storage + indexing (amortized) | 1M doc corpus, monthly delta | $0.005 | $2.5K |
| Ops + observability (amortized) | Eval suite + alarms + traces | $0.011 | $5.5K |
| Total per resolved interaction | - | $0.094 | $47K |
| + 25% buffer | - | $0.118 | $59K |
The cost stack, decomposed
View data table· Source: Techimax engagement telemetry, 2024–2026
| Series | Value |
|---|---|
| Model spend (LLM) | 52 |
| Retrieval (vector + re-rank) | 17 |
| Tool / external API calls | 12 |
| Storage + indexing | 6 |
| Ops + observability | 13 |
What to instrument per layer
| Layer | Metrics | Drift signal |
|---|---|---|
| Model spend | Tokens (in/out), cost, latency, model name | p99 spike → loop or context bloat |
| Retrieval | Calls, retrieved tokens, re-rank cost, recall@k | Recall drop → corpus stale or chunking changed |
| Tool calls | Calls/action, retry rate, upstream 4xx/5xx | Retry rate up → upstream contract drift |
| Storage | Index size, write volume, embedding cost | Linear growth OK; stepwise growth → re-embed pipeline |
| Ops | Engineer hours, on-call pages, eval-suite runtime | Page volume → drift in production behavior |
Cost controls: hard vs soft
- Hard caps at the gateway (per-trace token + cost limits enforced before model call). Application code can't bypass them.
- Soft warnings at 50% / 80% of the cap with prominent telemetry. Engineer sees them in the same APM as everything else.
- Per-tenant budgets for multi-tenant systems. One tenant's runaway query can't bill another tenant's account.
- Spend forecasting based on rolling 7-day averages. Surfaces a 3× spike before it becomes a 30× spike.
- Per-user spend alarms for high-blast-radius use cases. A single power user can rack up unusual cost; alarm on per-user p99 cost-per-day.
Where production AI costs leak - the top failure modes
Cost leakage in production agents follows a small set of patterns. We've seen each one repeatedly enough to predict where the next leak will appear. The good news: every pattern has a known fix that takes hours, not weeks. The bad news: without the right telemetry the leak runs invisibly until the invoice arrives.
| Pattern | Typical multiplier | Detection signal | Fix |
|---|---|---|---|
| Tool-call retry loop | 5–40× | Retry count distribution per tool span | Bound retries; circuit-break |
| Retrieval context bloat | 8–15× | Retrieved-tokens p95 climbing | Re-rank + truncate cap |
| Streaming abandonment billing | 2–4× | Tokens-billed > tokens-rendered ratio | Cancel-aware streaming |
| Background agent loops | 10–100× | Per-trace token cap exceedances | Hard kill at threshold |
| Prompt regression on model swap | 1.5–3× | Token / call ratio after swap | Eval-gate model swaps |
| Reviewer-bypass auto-escalation | 1.5–2× | Auto-vs-review ratio drift | Recalibrate routing |
View data table· Source: Techimax engagement telemetry, anonymized
| Series | USD per action |
|---|---|
| Wk 1 | 0.09 |
| Wk 2 | 0.1 |
| Wk 3 | 0.11 |
| Wk 4 | 0.13 |
| Wk 5 | 0.18 |
| Wk 6 | 0.34 |
| Wk 7 | 0.41 |
| Wk 8 | 0.12 |
| Wk 9 | 0.1 |
| Wk 10 | 0.1 |
What's a realistic cost-per-action target?
Cost-per-action targets are workload-specific and should be set against business value, not against arbitrary low-cost ceilings. Customer-care agents typically land at $0.04–$0.20 per resolved interaction; complex workflows (multi-step research, long document drafting, multi-agent orchestration) at $0.30–$2.00; deep-reasoning agents (legal review, code review, financial analysis) up to $5–$15.
The right ceiling is whatever leaves a healthy margin against the value of the resolved action. A customer-care interaction that displaces a $4 human handle is fine at $0.20. A research interaction that produces a $400 piece of work is fine at $5. We caution against arbitrary 'cheap LLM' targets - they push teams toward inferior models, which raises eval failure rates, which raises human-handoff rates, which raises total cost [4].
Cost-per-resolved-interaction is the unit. Cost-per-token is a leaky proxy. Build the model in business units; show finance the decomposition.
Self-host vs hosted: the cost crossover math
Self-hosting open-weight models becomes cost-favorable around 5–10M monthly interactions for a 70B-class workload, depending on traffic shape and model quality requirements [5]. Below that, hosted APIs are cheaper because GPU time is amortized across the provider's customer base. Above that, self-hosted with proper utilization can be 2–5× cheaper.
Operational caveat: self-hosting requires GPU capacity planning, autoscaling, model-update cycles, and eval-gated promotion. Add 0.5–1.0 platform engineer to operate a self-hosted stack. Below the crossover the platform cost dominates; above it, the savings dominate. Run the math against your actual traffic profile - peak-to-mean ratio matters more than total volume.
References
- [1]AWS Bedrock pricing - AWS (2025)
- [2]Anthropic API pricing + prompt caching - Anthropic (2025)
- [3]OpenAI API pricing - OpenAI (2025)
- [4]Total cost of ownership for production LLM applications - Andreessen Horowitz (2024)
- [5]vLLM: efficient memory management for LLM serving - vLLM project (2025)
- [6]FinOps for AI: framework and KPIs - FinOps Foundation (2025)
- [7]Google Cloud Vertex AI pricing - Google Cloud (2025)
Frequently asked questions
What's a sensible per-interaction cost ceiling?
Depends on the use case - customer care typically lands at $0.04–$0.20 per resolved interaction; complex workflows (multi-step research, document drafting) at $0.30–$2. Budget against business value, not against "low cost."
Are open-weight models cheaper?
Sometimes - depends on your traffic profile. At low volumes, hosted-API providers are cheaper because GPU time is amortized. At high volumes (millions of interactions/month), self-hosted open-weight models can be 2–5× cheaper. The crossover is workload-specific.
Should we negotiate enterprise pricing with providers?
Yes, at $50K+/month spend. Anthropic, OpenAI, and Bedrock all offer committed-use discounts at scale. Don't pay rack rate at production volume.
How does prompt caching change the cost picture?
Anthropic's prompt caching and OpenAI's prompt caching can drop input-token cost on long, stable system prompts by 50–90% [1][2]. Apply to system prompts and large retrieved corpora that recur across calls; don't cache user-specific or short-lived content. Re-validate cache hit rate weekly - context drift can silently break caching.
What about cost on multimodal (vision, audio) workloads?
Multimodal interactions cost 1.5–4× text-equivalent tokens depending on resolution and modality. Budget separately; instrument modality on the cost span. Voice in particular has streaming considerations that affect billing - read the provider docs carefully on tokens billed during long-running connections.
How do we model fine-tuning cost vs prompting cost?
Fine-tuning has a one-time training cost (typically $1K–$50K depending on model size and dataset) plus ongoing inference at hosted rates. Pays off when you can reduce per-call tokens or model size for a high-volume workload. Below ~1M monthly interactions on the fine-tuned task, prompting is usually cheaper. Run the math; don't fine-tune by default.
What's a reasonable AI infrastructure budget as % of engineering OpEx?
For AI-forward enterprises, model and inference spend lands at 4–8% of engineering OpEx in production. Higher means leaks (instrument); lower often means under-investment in retrieval / evals (which shows up as quality drift later).
How do we track FinOps for AI workloads?
Tag every model call with use-case, agent, tenant, and team. Aggregate weekly into a FinOps dashboard alongside other cloud spend [6]. Monthly rollup for finance; weekly for engineering. Don't let AI spend live in a separate spreadsheet - that's how it grows uninspected.