What regulators actually ask
We've sat in dozens of model risk reviews across BFSI and healthcare. The questions are predictable. Not because regulators are reading from the same script - they're not - but because the underlying principle is shared: prove that this system is bounded, observable, and reversible.
The 2024–2026 wave of AI-specific frameworks - EU AI Act [2], NIST AI Risk Management Framework [5], the FDA's draft guidance on AI/ML-enabled medical devices, and the OCC/Fed/FDIC joint guidance applying SR 11-7 [1] to LLMs - all converge on the same deliverables. The regulator vocabulary differs; the engineering artifacts are nearly identical.
- Per-release eval pass-rate logs
Calibrated eval suite re-run on every release. Pass-rate, regression deltas, failure-mode breakdown - versioned and queryable.
- Prompt + retrieval lineage per output
For any agent output, the auditor can reconstruct: what prompt was sent, which docs retrieved, which model version answered, what the cost and outcome were. Stored immutably for the regulatory retention period.
- Reviewer queues with calibrated SLAs
Decisions above a defined risk threshold route to human reviewers. Queue depth, review time, override rate are tracked and reported.
- Immutable audit trail
Append-only log of every decision, model swap, prompt change, eval result. Cryptographic hashes optional but increasingly common in BFSI.
The eight questions every auditor asks
We've codified the eight questions that come up in nearly every model risk review. None of them ask whether the model is correct. All of them ask whether you can prove control. If you can answer all eight with a queryable artifact (not a slide), the audit becomes a routine review rather than a remediation cycle.
- Show me the eval suite. What's the calibration data, who reviewed it, when was it last refreshed?
- For this specific output [auditor picks one from a sample]: reconstruct prompt, retrieved context, model version, and reviewer trail.
- What's the change-management process for prompts? Who approves prompt changes; where's the diff trail?
- What happens when a model provider deprecates a version? How does promotion to a new version get validated?
- Where does PII / PHI / payment data flow? Who has access; how is access logged?
- What's the kill-switch? Who can pull it; under what conditions; how is it tested?
- For high-risk decisions: what's the human review SLA; what's the override rate; what's reviewed if the override rate is anomalous?
- What's the post-incident playbook? When was the last incident; what changed afterwards?
View data table· Source: Techimax compliance engagement data 2023–2026; cross-referenced with public OCC/FDIC bulletins
| Series | % of reviews flagging |
|---|---|
| Missing per-output lineage | 71 |
| Eval suite not calibrated | 64 |
| Reviewer-queue SLA undocumented | 53 |
| Prompt change log incomplete | 47 |
| Sub-processor list stale | 38 |
| Kill-switch untested | 31 |
| Drift alarms absent | 27 |
NIST AI RMF: the framework auditors quietly defer to
NIST's AI Risk Management Framework [5] is voluntary in the US but functions as a de facto baseline. Auditors increasingly cite it when asking how a system was governed; insurance underwriters reference it when pricing AI liability; the EU AI Act's high-risk obligations map cleanly onto its Govern–Map–Measure–Manage structure.
Practical implication: scope your governance documentation against the NIST AI RMF Playbook [5]. The mapping to engineering deliverables (eval suite → Measure; lineage → Manage; reviewer queue → Manage; risk register → Govern) is straightforward and saves you from rebuilding documentation per regulator.
| RMF function | Engineering deliverable | Owner | Audit cadence |
|---|---|---|---|
| Govern | Risk register, model inventory, policy doc | Compliance + engineering | Quarterly review |
| Map | Use-case classification, blast-radius scoring | Product + risk | Per release + annual |
| Measure | Calibrated eval suite, per-release pass-rate | Engineering | Continuous (CI gate) |
| Manage | Reviewer queues, lineage logs, kill-switch | Engineering + operations | Continuous + drill quarterly |
Regulators don't care that your model is right 95% of the time - they care about the 5%. Make the 5% queryable, reviewable, and reversible, and the audit conversation gets shorter every cycle.
| Framework | Domain | Key engineering implication |
|---|---|---|
| HIPAA | Healthcare, US | PHI redaction at SDK boundary; BAA-covered providers; access logs |
| SOX | Public-company financial reporting, US | Immutable audit on financial-data agents; SoD for change approvals |
| EU AI Act (high-risk) | EU regulated decisions | Risk management system; data governance; human oversight; transparency |
| NERC CIP | US/CA bulk electric system | Cyber asset categorization; per-action access controls; change management |
| PCI DSS | Payment data | PAN tokenization before LLM; gateway-level redaction; access logs |
| DPDP | India | Consent management; processing notice; data subject rights |
Model risk management: the SR 11-7 reality
US banking supervisors apply SR 11-7 model risk guidance to every model that affects financial decisions. LLMs are models. The implication: every production LLM in a US bank touches the model risk inventory, gets a model risk rating, and undergoes a periodic model validation [1]. This is not optional and it's not light.
What works: treat the eval suite as the validation artifact. Calibrated, versioned, re-run every release. The model risk team gets a documented validation cadence; the engineering team gets the eval-gated CI they wanted. Same artifact, two stakeholders.
EU AI Act: high-risk systems and what changes
The EU AI Act categorizes systems by risk. High-risk systems (employment, credit, healthcare diagnostics, education) carry obligations: risk management, data governance, transparency, human oversight, and post-market monitoring. By 2026 the high-risk obligations apply [2].
What this means in engineering: the audit deliverables on this page already cover most of it. The remaining work is governance documentation - risk register, impact assessment, conformity declaration. Real work, but not engineering work; we usually pair with the customer's legal team on this rather than try to own it.
Kill-switch design: the control regulators test
Every regulated AI system needs a documented, tested kill-switch. "We can disable the API key" is not a kill-switch - it's a hope. A real kill-switch is a feature flag that disables the agent surface within seconds, fails over to a documented fallback (human queue or static response), and emits a SEV-1 page. It's tested quarterly with a documented drill.
What we ship: gateway-level flag controlling agent traffic, fallback routes for each surface (e.g., "send to human queue with 4-hour SLA"), drill runbook stored in the on-call wiki, last-drill-date field on the model inventory. Auditors love that last field - it's evidence the control is alive [4].
// Gateway-level flag check on every request. App code can't bypass.
// Fallback path is explicit; \"degrade gracefully\" is a behavior we test.
export async function handleAgentRequest(req: AgentRequest) {
const flagState = await flags.get("agent.kill_switch", { agent: req.agentId });
if (flagState.enabled) {
audit.log({
kind: "kill_switch_engaged",
agent: req.agentId,
reason: flagState.reason,
operator: flagState.engagedBy,
});
return await fallback.route(req); // human queue or static path
}
return await agent.invoke(req);
}What 'good' looks like at the model risk meeting
A well-prepared engineering team walks into the model risk meeting with five artifacts on a single page: model inventory entry (with risk rating), eval pass-rate trend chart, lineage query example, reviewer queue stats, and last kill-switch drill date. Total prep time after the first review: under an hour. Total prep time before the first review: 2–3 weeks the first time, then routine.
We've watched the same model risk team go from a 6-week back-and-forth on a customer's first agent to a 45-minute standing review on the fifth. The difference isn't approval threshold - it's that the engineering team learned what artifact answers what question.
References
- [1]SR 11-7: Guidance on Model Risk Management - US Federal Reserve (2011 (active))
- [2]EU AI Act final text - European Commission (2024)
- [3]HIPAA Security Rule guidance - HHS Office for Civil Rights (2024)
- [4]OWASP Top 10 for LLM applications - OWASP (2024)
- [5]AI Risk Management Framework 1.0 + Generative AI Profile - NIST (2024)
- [6]FDA Good Machine Learning Practice for medical devices - FDA / Health Canada / MHRA (2024)
Frequently asked questions
Are LLMs models under SR 11-7?
US supervisors are applying it to LLMs that materially affect financial decisions. We default to assuming yes for any LLM-affected workflow that touches a regulated outcome (credit decision, advice, claim adjudication).
Does HIPAA forbid LLMs?
No. It governs how PHI flows. We deploy in BAA-covered environments (Anthropic, OpenAI, AWS Bedrock all offer BAA tiers); redact PHI when it doesn't need to leave; and log every access. HIPAA workloads are entirely shippable when engineered for the standard.
Are sub-processors a risk?
Yes. Track them. We maintain a per-engagement sub-processor list with the categories of data each touches. Customers get notice before changes.
How does the EU AI Act treat foundation models?
General-purpose AI (GPAI) models carry transparency, technical documentation, and copyright-policy obligations [2]. If you're deploying a third-party foundation model, those obligations sit with the provider; if you fine-tune or significantly modify, you may inherit them. Document who owns what at the contract level before deployment.
What's the audit cost differential between built-in vs retrofit compliance?
Across our regulated engagements, retrofit audit work runs 5–10× the cost of built-in. Building lineage, eval calibration, and kill-switch into the original engineering adds maybe 15% to scope; bolting them on after a failed audit can cost more than the original build.
Do generative AI policies need their own approval cycle?
Yes. Most enterprise AI governance committees now have a separate review track for generative systems - typically faster than traditional model approval but with mandatory red-team and prompt-injection evidence. Build the red-team artifact early; it's gating in 2026.
How do regulators view agentic systems vs single-call LLMs?
More skeptically. Agents carry compounding risk because tool-call chains create state changes the user didn't explicitly approve. We default to higher-risk classification for any agent with state-mutating tools and recommend a human review queue for the top-blast-radius decisions until the eval suite covers them at >99% pass rate.