Why this matters less than your team thinks
Provider choice felt high-stakes in 2023 because tooling was provider-specific. In 2026 every serious provider exposes an OpenAI-compatible interface or is one library wrapper away from one. Your gateway can route traffic per-use-case based on eval pass-rate, latency, and cost - and you can swap providers in days, not quarters.
What's left is matching provider strengths to use-case needs. We do this empirically with the eval suite, not via vendor briefings.
How often does the leaderboard actually change?
Across the workloads we monitor in production (n = 30+ enterprise agents), the best-performing provider for a given workload changes roughly every 4–6 months. Sometimes Anthropic's reasoning steps ahead; sometimes OpenAI's tool-call calibration improves; sometimes a Llama or Mistral variant becomes cheaper for a specific eval profile. Locking in a single provider means accepting whichever curve was steeper at signing.
What this means for procurement: don't sign multi-year exclusive commitments. Negotiate volume discounts that reset at quarterly milestones. Keep a second provider warm at 5–10% of traffic so the swap path is always tested. Anthropic, OpenAI, AWS Bedrock, GCP Vertex, and Azure OpenAI all offer enterprise pricing with this flexibility [1][2][3].
| Provider | Strong fit | Watch out |
|---|---|---|
| Anthropic Claude (Opus / Sonnet) | Long-context reasoning, tool use, refusal calibration | Higher cost on long contexts |
| OpenAI GPT-5 / o-series | General-purpose throughput, multilingual coverage | More verbose by default - eval suite must enforce concision |
| Open-weight (Llama, Mistral) | Self-hosting, cost-bound workloads, predictable spend | Ops cost; capability lag for frontier reasoning |
| Google Gemini | Multimodal (vision-heavy workflows), GCP-native deployments | Less mature tool-use ecosystem |
Build the gateway, not the lock-in
- Define a provider-agnostic agent interface in your code (text-in, tools, structured-out). Don't leak vendor SDKs above the gateway.
- Implement adapters for at least two providers from day one. Even if you start with one, the second adapter is the cheap insurance.
- Wire eval-gated provider routing. The eval suite picks the provider for each use-case; you don't.
- Track per-provider pass-rate, latency, and cost in your APM. Your CFO and your AI lead read the same dashboard.
Routing strategies that work in production
A gateway isn't useful unless the routing logic is calibrated. The strategies we deploy fall into four patterns, often layered. The right combination is empirical - driven by eval data - but the patterns themselves are stable.
| Strategy | Mechanism | Best for | Watch out |
|---|---|---|---|
| Static per-use-case | One provider per intent class | Stable workloads with clear capability gaps | Re-test quarterly; capability shifts |
| Bandit routing | Multi-armed bandit on eval pass-rate | Workloads with overlapping providers | Needs eval signal at scale (>10K/day) |
| Cascading fallback | Cheap provider first; escalate on low confidence | Cost-sensitive workloads | Tail latency increases on escalation |
| Self-hosted + cloud burst | Open-weight on owned GPUs; cloud for spike traffic | Predictable-spend high-volume | Ops cost; capacity planning |
View data table· Source: Techimax engagement eval suites; aggregated, anonymized
| Series | % pass rate |
|---|---|
| Claude Sonnet 4.5 | 94 |
| GPT-5 | 92 |
| Claude Opus 4.5 | 96 |
| Gemini 2.5 Pro | 89 |
| Llama 3.3 70B (self-hosted) | 81 |
| Mistral Large 3 | 84 |
// One interface; many providers. The eval suite picks the
// provider per use-case; the orchestrator doesn't care.
export interface ModelProvider {
generate(req: AgentRequest): Promise<AgentResponse>;
capabilities: ProviderCapabilities;
costPerKToken: { input: number; output: number };
}
const providers: Record<string, ModelProvider> = {
"anthropic.claude-opus-4.5": new AnthropicAdapter("claude-opus-4-5"),
"openai.gpt-5": new OpenAIAdapter("gpt-5"),
"google.gemini-2.5-pro": new GeminiAdapter("gemini-2.5-pro"),
"self-hosted.llama-3.3-70b": new VllmAdapter("llama-3.3-70b"),
};
// Routing decision per use-case, driven by eval pass-rate +
// cost-per-resolution + latency budget. Reviewed monthly.
export function routeFor(useCase: UseCaseId): ModelProvider {
return providers[ROUTING_TABLE[useCase].providerId];
}Open-weight vs hosted: when self-hosting wins
Self-hosting open-weight models (Llama 3.x, Mistral, Phi) is a real option for cost-bound, predictable-spend workloads. The crossover point is roughly 5–10M interactions per month for a 70B-class model on a typical inference setup; below that, hosted APIs are cheaper because they amortize GPU time. Above that, self-hosting can be 2–5× cheaper [4].
But: self-hosting carries operational cost - autoscaling, eval-gated promotion, GPU capacity planning, model update cycles. We typically advise customers to start with hosted, cross over to self-hosted on the highest-volume workload first (where the savings are biggest), and gradually expand only after telemetry proves the operational maturity is in place.
The right answer to 'which model provider?' is 'all of them, behind a gateway, picked per use-case by eval pass-rate.' Provider lock-in is a 2023 problem.
References
- [1]Anthropic platform docs - Anthropic (2025)
- [2]OpenAI platform docs - OpenAI (2025)
- [3]AWS Bedrock docs - AWS (2025)
- [4]vLLM: efficient memory management for LLM serving - vLLM project (2025)
- [5]LMSYS Chatbot Arena leaderboard - LMSYS Org (2025)
- [6]Google Cloud Vertex AI model gallery - Google Cloud (2025)
Frequently asked questions
Doesn't a gateway add latency?
5–20ms typically. Negligible against LLM round-trip times of 600–3000ms. The latency cost is real but small; the option value is large.
Should we self-host?
Sometimes - for cost-bound, predictable-spend workloads where capability is met by an open-weight model. Operational cost is real (GPUs, autoscaling, eval-gated promotion); we typically advise self-hosting for one workload first and graduating only once you have telemetry.
What about region / data residency?
Cloud-native deployments (Anthropic on Bedrock, GPT on Azure OpenAI, Gemini on Vertex) cover most data-residency requirements. EU and India residencies are well-covered as of 2026.
How do we negotiate enterprise pricing with providers?
Volume commitments at $50K+/month spend get 10–25% discounts on most providers. Avoid multi-year exclusive commitments - quarterly resets keep your switching cost low. AWS Bedrock and Azure OpenAI bundle these into existing cloud agreements; Anthropic and OpenAI direct contracts are negotiable separately.
What about model versioning and deprecation?
Pin model versions in production. Anthropic, OpenAI, and Bedrock all support pinned versions with deprecation notices typically 6–12 months out. Re-run your eval suite against the new version before promoting; never auto-upgrade frontier models without an eval-gate.
Does Anthropic's MCP or OpenAI's Realtime API change the gateway calculus?
Not fundamentally. MCP standardizes tool discovery; Realtime adds streaming voice. Both fit cleanly behind a provider-agnostic interface - wrap them like any other capability. The gateway pattern absorbs new modalities; the routing logic gets one new variable.
What's the gateway latency budget look like in practice?
We measure 8–15ms p50, 25–40ms p99 for our standard gateway implementation. Within that, eval-tag lookup, provider routing, PII redaction, and cost-cap check. Provider call latency dominates; the gateway is comfortably below the perception threshold.
Do we need different gateways for chat vs agentic workloads?
No - one gateway, two routes. Chat is a single-call subset of the agentic interface. We use the same gateway for both with a different routing-table entry per use-case.