Model Exfiltration Throttling Governance requires that organisations deploying AI agents implement detection and throttling mechanisms to prevent adversaries from systematically extracting proprietary model behaviour, system prompts, fine-tuning data, decision logic, safety guardrails, or confidential business rules through repeated, structured interactions. Model exfiltration — also called model extraction, model stealing, or model inversion — is an adversarial technique where an attacker submits carefully crafted queries and analyses the agent's responses to reconstruct the agent's internal logic, training data characteristics, or operational constraints. Unlike a single prompt injection (which seeks to cause one harmful action), exfiltration is a sustained campaign requiring hundreds or thousands of queries to extract meaningful intellectual property. This dimension mandates that organisations detect exfiltration patterns, throttle suspicious query sequences, and protect the proprietary elements embedded in their agent deployments.
Scenario A — System Prompt Extraction via Iterative Probing: A competitor targets a customer-facing insurance underwriting agent deployed by a major insurer. The competitor creates 14 fake customer accounts and submits a total of 4,200 carefully crafted queries over 6 weeks. The queries systematically probe the agent's decision boundaries: "Would a 35-year-old male smoker with a BMI of 31 qualify for standard rates?" followed by "What about BMI 30.9?" and "What about BMI 30.5?" — narrowing the exact threshold. Other query sequences probe the agent's pricing model: "What would the monthly premium be for..." with systematically varied parameters. Through this campaign, the competitor reconstructs the insurer's underwriting decision tree with 94% accuracy, including proprietary risk thresholds, pricing coefficients, and exclusion criteria that represent 3 years and £2.8 million of actuarial development. The competitor launches a competing product that precisely undercuts the insurer on every profitable segment while avoiding unprofitable segments — a strategy only possible with knowledge of the proprietary model. The insurer loses £4.2 million in annual premium volume before detecting the competitive anomaly and tracing it to the extraction campaign.
What went wrong: No mechanism detected the systematic probing pattern — 4,200 queries with methodically varied parameters, executed from 14 accounts with identical query structures, over 6 weeks. Each individual query was a legitimate customer inquiry. The extraction was invisible at the single-query level and only detectable through pattern analysis across the full query sequence. No throttling mechanism limited the rate at which decision boundary probing could occur. No anomaly detection flagged the statistical improbability of 14 "customers" independently asking nearly identical parameter-sweep questions.
Scenario B — Safety Guardrail Mapping for Downstream Exploitation: A threat actor targets a financial advisory agent to map its safety guardrails before launching a social engineering campaign against the agent's users. Over 3 weeks, the actor submits 1,800 queries designed to identify the exact boundaries of the agent's safety constraints: "Can you help me transfer funds to this account?" (blocked — identifies money transfer as restricted), "Can you provide the routing number for international transfers?" (allowed — identifies information provision as unrestricted), "Can you initiate a wire transfer of £5,000?" (blocked — confirms the restriction), "Can you explain how to initiate a wire transfer?" (allowed — identifies explanation as unrestricted). The actor systematically maps every safety constraint, identifying 23 specific restrictions and 47 permitted actions. Armed with this map, the actor crafts social engineering attacks that stay within the agent's permitted actions while manipulating users into performing the restricted actions themselves. The resulting fraud campaign affects 31 customers with total losses of £890,000. The agent never violated its safety constraints — the attacker used the extracted guardrail map to work around them.
What went wrong: The safety guardrail mapping was conducted through queries that individually appeared legitimate. No detection mechanism identified the systematic boundary-probing pattern — the attacker tested the same restriction from multiple angles with slight variations, a pattern that is distinctive when viewed holistically but invisible at the single-query level. The 1,800 queries over 3 weeks did not trigger any rate limit because the per-session and per-day rates were within normal bounds when distributed across multiple sessions.
Scenario C — Fine-Tuning Data Extraction from Specialised Medical Agent: A medical AI agent has been fine-tuned on a proprietary dataset of 45,000 anonymised patient treatment outcomes, representing a £1.4 million data collection and curation investment. An adversary submits queries designed to trigger memorisation recall — prompting the agent to reproduce training data verbatim. The adversary uses prefix attacks ("Complete this clinical note: Patient presented with stage IIIA non-small-cell lung..."), membership inference attacks ("Did your training data include a case where a 62-year-old female with HER2-positive breast cancer received trastuzumab and experienced cardiac toxicity at week 12?"), and data extraction attacks ("List 10 examples of treatment outcomes for stage II colorectal cancer patients who received FOLFOX"). Over 8,000 queries, the adversary extracts 2,300 partial patient records — still identifiable despite anonymisation when combined with auxiliary data. The data breach triggers notification obligations under GDPR Article 33, HIPAA breach notification rules, and state-level breach notification laws. Total regulatory, legal, and remediation costs: £3.1 million.
What went wrong: The agent responded to training-data-extraction queries without detection or throttling. No mechanism identified the prefix attack pattern, the membership inference pattern, or the systematic extraction of training data across thousands of queries. The per-query responses each contained only fragments, but the aggregate across 8,000 queries constituted a significant data breach. No anomaly detection flagged the query pattern as unusual — the queries resembled legitimate clinical inquiries when examined individually.
Scope: This dimension applies to every AI agent deployment where the agent embodies proprietary intellectual property — system prompts, fine-tuning, custom training data, decision logic, safety guardrails, pricing models, risk thresholds, business rules, or any other confidential operational knowledge — that would provide competitive, strategic, or adversarial value if extracted. This includes virtually all enterprise agent deployments, because even a vanilla foundation model acquires proprietary characteristics through its system prompt, tool configuration, and operational constraints. The scope covers extraction through the agent's primary interface (conversational queries), through side channels (timing analysis, error message analysis, token probability analysis), and through indirect observation (monitoring the agent's outputs to third-party systems). Organisations that claim their agents contain no proprietary elements must document this assessment and demonstrate that the agent's system prompt, configuration, tools, and operational constraints are publicly available.
4.1. A conforming system MUST implement query-pattern analysis that detects systematic probing patterns indicative of model exfiltration, including but not limited to: parameter-sweep queries (systematically varying one parameter while holding others constant), boundary-probing queries (iteratively narrowing a decision boundary), prefix-completion attacks (prompting the model to complete training data), membership-inference queries (testing whether specific data was in the training set), and guardrail-mapping queries (systematically testing which operations are permitted and which are restricted).
4.2. A conforming system MUST implement throttling mechanisms that reduce the information yield of detected exfiltration patterns, including but not limited to: rate limiting on queries matching exfiltration signatures, response perturbation that adds controlled noise to outputs when exfiltration patterns are detected, progressive delay that increases response latency as exfiltration confidence increases, and session termination for confirmed high-confidence exfiltration attempts.
4.3. A conforming system MUST maintain exfiltration detection state across sessions, accounts, and IP addresses to prevent adversaries from circumventing per-session detection by distributing queries across multiple sessions, accounts, or network origins.
4.4. A conforming system MUST protect system prompts from direct extraction attacks, ensuring that the agent does not reproduce its system prompt, operational constraints, or safety guardrails verbatim in response to queries requesting this information.
4.5. A conforming system MUST log all detected exfiltration patterns with sufficient detail to support forensic analysis, including the query sequence, the detection signature matched, the throttling action taken, the source identity (account, IP, session), and timestamps, in a tamper-evident log consistent with AG-006.
4.6. A conforming system MUST define exfiltration risk thresholds that trigger escalation to human security review when sustained probing is detected, ensuring that automated throttling is supplemented by human analysis for sophisticated campaigns.
4.7. A conforming system MUST test exfiltration detection and throttling mechanisms against known extraction techniques at least quarterly, verifying that the mechanisms detect and throttle current attack methodologies.
4.8. A conforming system SHOULD implement response diversity mechanisms that prevent identical queries from producing identical responses, reducing the signal-to-noise ratio for statistical extraction attacks that rely on response consistency to filter noise.
4.9. A conforming system SHOULD implement economic friction — mechanisms that increase the cost (time, compute, money) of sustained query campaigns, making extraction economically unattractive relative to the value of the extracted information.
4.10. A conforming system MAY implement honeypot responses — deliberately planted false information in responses to detected exfiltration queries — that serve as canary data, enabling detection if the extracted information appears in competitor products or adversarial campaigns.
AI agents in enterprise deployments are not generic — they embody significant proprietary investment. The system prompt alone may contain months of prompt engineering refinement, encoding business rules, compliance constraints, tone guidelines, and operational boundaries that represent competitive advantage. Fine-tuned models encode proprietary training data that may have cost millions to collect and curate. Decision logic derived from years of domain expertise is embedded in the agent's behaviour. This proprietary investment is vulnerable to extraction because the agent's interface — accepting queries and producing responses — is also the extraction vector.
Model exfiltration is not a theoretical concern. Academic research has demonstrated that model extraction attacks can reconstruct decision boundaries of commercial ML models with high fidelity using feasible query budgets (thousands to tens of thousands of queries). In the AI agent context, the attack surface is even larger because agents respond in natural language with rich detail, provide reasoning explanations that reveal decision logic, and maintain conversational context that can be exploited for iterative refinement of extraction queries.
The economic incentive for exfiltration is substantial. An organisation that has invested £2 million in developing a specialised agent — through data collection, model training, prompt engineering, safety testing, and compliance validation — faces a competitor who can extract the essential logic through a £5,000 query campaign (compute costs for 10,000 API calls). This asymmetry between development cost and extraction cost makes exfiltration attacks economically rational for adversaries and existential for the organisations being targeted.
The regulatory dimension is equally significant. When fine-tuned models contain training data derived from personal information (Scenario C), extraction of that data constitutes a data breach subject to GDPR, HIPAA, and other privacy regulations. The organisation is liable not because it disclosed the data intentionally, but because it failed to implement reasonable technical measures to prevent extraction. GDPR Article 32 requires "appropriate technical and organisational measures" to protect personal data — an AI agent that can be queried into reproducing personal training data has inadequate technical measures.
From a security perspective, guardrail mapping (Scenario B) is a reconnaissance technique that enables more severe downstream attacks. An adversary who knows exactly what the agent will and will not do can craft attacks that exploit the gaps between restrictions. Preventing guardrail extraction is therefore not just an intellectual property concern — it is a security prerequisite for the effectiveness of all other agent controls.
The NIST AI Risk Management Framework addresses model theft and intellectual property protection under MANAGE 2.4 (mechanisms for tracking risks from third-party entities) and MAP 5.1 (identifying AI system impacts). DORA Article 9 requires financial entities to protect the confidentiality and integrity of their ICT systems — an agent's decision logic is an ICT asset whose confidentiality must be protected. The EU AI Act Article 15 requires robustness against adversarial manipulation — sustained exfiltration is an adversarial manipulation of the agent's query interface.
Model Exfiltration Throttling Governance requires a layered defence combining query-level analysis, session-level pattern detection, cross-session correlation, and response-level controls. No single mechanism is sufficient because sophisticated adversaries adapt their techniques to evade individual controls. The defence must impose cumulative friction that makes extraction progressively more difficult and expensive as the adversary persists.
Recommended patterns:
Anti-patterns to avoid:
Financial Services. Financial agents embed proprietary pricing models, risk assessment algorithms, credit scoring logic, and trading strategies. Extraction of these models provides direct competitive advantage and can enable market manipulation. Financial firms should implement the most aggressive throttling thresholds — low tolerance for parameter-sweep queries on pricing or risk models, immediate escalation for boundary-probing sequences on credit thresholds, and zero tolerance for system prompt extraction attempts. The economic value of the embedded logic justifies substantial investment in extraction prevention.
Healthcare. Healthcare agents fine-tuned on clinical data face the dual risk of intellectual property theft and personal data breach. Extraction of fine-tuning data may reveal patient information subject to GDPR and HIPAA. Healthcare deployments should implement memorisation detection — monitoring for responses that appear to reproduce training data verbatim — and aggressive throttling on prefix-completion and membership-inference attack patterns. The regulatory liability for training data extraction is severe and includes mandatory breach notification.
Insurance. Insurance underwriting and claims agents embed actuarial models and claims assessment logic that represent core competitive advantage. Scenario A demonstrates the direct competitive harm from extraction. Insurers should implement decision-boundary protection — perturbing or generalising responses that reveal exact thresholds — and monitor for systematic parameter-sweep query patterns.
Technology and SaaS. AI-powered SaaS products where the agent's behaviour is the product face the highest extraction risk because every customer interaction is a potential extraction query. Implement per-customer exfiltration scoring, contractual prohibitions on systematic extraction with technical monitoring for violations, and response diversification to increase the query budget required for meaningful extraction.
Basic Implementation — The organisation has identified the proprietary elements embedded in each agent (system prompts, fine-tuning data, decision logic, business rules). System prompt extraction defences are implemented (instruction-level and output-filtering). Basic rate limiting prevents high-volume query campaigns. Exfiltration detection logs are maintained. Quarterly testing validates that system prompts cannot be extracted through known techniques. This level meets the minimum mandatory requirements.
Intermediate Implementation — All basic capabilities plus: statistical query-pattern analysis detects parameter sweeps, boundary probing, and membership inference patterns. Cross-session correlation links extraction attempts distributed across sessions and accounts. Progressive throttling imposes escalating friction on detected extraction patterns. Response perturbation adds controlled noise to decision-boundary outputs. Human escalation triggers for high-confidence extraction detection. Testing includes known academic extraction attack methodologies.
Advanced Implementation — All intermediate capabilities plus: real-time exfiltration scoring for all active sessions with dashboard visibility. Honeypot responses enable detection of extracted information in the wild. Economic analysis quantifies the cost-to-extract for each agent's proprietary elements, demonstrating that extraction cost exceeds the value of the extracted information. Advanced identity correlation uses query-style fingerprinting to link accounts operated by the same adversary. Memorisation detection identifies responses that reproduce training data. The organisation can demonstrate through independent red-team testing that no known extraction technique succeeds within an economically viable query budget.
Required artefacts:
Retention requirements:
Access requirements:
Test 8.1: System Prompt Extraction Defence
Test 8.2: Parameter-Sweep Detection
Test 8.3: Cross-Session Exfiltration Detection
Test 8.4: Response Perturbation Effectiveness
Test 8.5: Membership Inference Attack Detection
Test 8.6: Escalation to Human Security Review
Test 8.7: Quarterly Extraction Technique Validation
| Regulation | Provision | Relationship Type |
|---|---|---|
| EU AI Act | Article 15 (Accuracy, Robustness and Cybersecurity) | Direct requirement |
| EU AI Act | Article 9 (Risk Management System) | Supports compliance |
| SOX | Section 404 (Internal Controls Over Financial Reporting) | Supports compliance |
| FCA SYSC | 6.1.1R (Systems and Controls) | Supports compliance |
| NIST AI RMF | MANAGE 2.4, MAP 5.1 | Direct requirement |
| ISO 42001 | Clause 6.1 (Actions to Address Risks) | Supports compliance |
| DORA | Article 9 (ICT Risk Management Framework) | Direct requirement |
Article 15 requires that high-risk AI systems are resilient to attempts by unauthorised third parties to exploit system vulnerabilities, including adversarial manipulation. Model exfiltration is adversarial manipulation of the query interface — the adversary exploits the system's vulnerability (responding to queries with information that reveals internal logic) to extract proprietary knowledge. The requirement for cybersecurity measures "appropriate to the circumstances and the risks" directly covers exfiltration defence. Organisations must demonstrate that their AI systems resist systematic extraction attempts, not merely individual prompt injection attacks. The sustained nature of exfiltration campaigns — thousands of queries over weeks — means that cybersecurity measures must include pattern detection and throttling, not just per-query defences.
Financial agents that embed proprietary pricing models, risk assessment logic, or trading strategies represent intellectual property that materially affects the organisation's financial position. Extraction of this IP by competitors directly impacts revenue (Scenario A: £4.2 million annual premium loss). SOX auditors will assess whether the organisation has adequate controls to protect IP-bearing systems from extraction. AG-432's exfiltration detection and throttling constitute these controls. Material weaknesses in IP protection for revenue-generating AI systems are reportable under SOX.
The FCA expects firms to protect their systems from exploitation. A financial agent that can be systematically queried to reveal its credit scoring logic, risk thresholds, or trading strategies is an exploitable system. The FCA's supervisory approach to technology risk includes the expectation that firms protect proprietary algorithms and models from extraction. AG-432 provides the control framework that demonstrates this protection.
MANAGE 2.4 addresses mechanisms for managing AI risks related to third-party entities — model exfiltration is a risk originating from third-party adversaries interacting with the AI system. MAP 5.1 addresses the identification of AI system impacts, including impacts on the organisation's intellectual property. The NIST framework explicitly recognises model theft and extraction as AI-specific risks requiring dedicated risk management measures. AG-432's exfiltration throttling is a direct implementation of the risk management measures the framework calls for.
DORA requires financial entities to protect the confidentiality and integrity of ICT assets. An AI agent's decision logic, system prompt, and fine-tuning data are ICT assets whose confidentiality must be protected. Article 9's requirement for mechanisms to detect and respond to ICT-related incidents covers the detection of sustained exfiltration campaigns — these are ICT security incidents that must be detected, logged, and responded to. DORA's emphasis on resilience testing (Article 25) aligns with AG-432's requirement for quarterly exfiltration testing.
ISO 42001 requires organisations to determine and address risks and opportunities related to AI systems. Model exfiltration is a risk to the organisation's AI investment and, where training data contains personal information, a risk to data subjects. The standard requires risk treatment plans proportionate to the identified risks. AG-432's layered defence — detection, throttling, perturbation, and escalation — constitutes a comprehensive risk treatment plan for the exfiltration risk class.
| Field | Value |
|---|---|
| Severity Rating | High |
| Blast Radius | Organisation-wide — extracted proprietary logic affects competitive position across all markets and products that rely on the compromised agent's embedded knowledge; training data extraction may constitute a data breach affecting all data subjects represented in the training set |
Consequence chain: An adversary initiates a sustained exfiltration campaign against an AI agent, submitting thousands of carefully crafted queries over days or weeks. Without detection and throttling, the agent responds to each query with information that incrementally reveals its internal logic. The immediate technical consequence is information leakage — each response provides a data point that contributes to the adversary's reconstruction of the agent's proprietary elements. The aggregate information leakage over the campaign constitutes extraction of one or more proprietary elements: the system prompt (enabling the adversary to replicate the agent), the decision boundaries (enabling competitive undercutting as in Scenario A), the safety guardrails (enabling adversarial exploitation as in Scenario B), or the training data (constituting a data breach as in Scenario C). The business consequences vary by the element extracted. System prompt extraction enables rapid replication of the agent by competitors, destroying the competitive advantage that justified the development investment. Decision boundary extraction enables precision competitive attacks targeting the most profitable segments. Safety guardrail extraction enables adversarial campaigns that exploit the gaps between restrictions, causing harm to end users. Training data extraction triggers regulatory obligations (GDPR breach notification, HIPAA breach notification) and potential enforcement actions, with costs in the millions. The severity is High rather than Critical because exfiltration is a gradual process — unlike an execution sink failure (AG-431) where a single unvalidated output causes immediate harm, exfiltration requires sustained adversary effort, providing a window for detection and response. However, the cumulative impact of successful exfiltration can exceed the impact of any single execution sink failure, because it affects the organisation's competitive position and intellectual property portfolio rather than a single transaction or operation.
Cross-references: AG-004 (Action Rate Governance), AG-012 (Credential & Secret Lifecycle Governance), AG-095 (Prompt Integrity Governance), AG-404 (Network Egress and DNS Control Governance), AG-430 (Prompt Injection Sink Hardening Governance), AG-434 (Covert Channel Detection Governance), AG-436 (Abuse-at-Scale Detection Governance), AG-437 (Economic Abuse Resistance Governance).