AG-436: Abuse-at-Scale Detection Governance

2. Summary

Abuse-at-Scale Detection Governance requires that organisations operating AI agent estates implement detection capabilities specifically designed to identify coordinated, high-volume, or automated exploitation of agent systems by botnets, coordinated human actor networks, or hybrid attack campaigns that combine automated and human-directed activity. Individual agent abuse — a single attacker probing a single agent — is addressed by prompt integrity, input validation, and rate governance controls. Abuse-at-scale is qualitatively different: it involves systematic exploitation across multiple agents, sessions, accounts, or time periods, often using automation to achieve volume that a single human attacker could not. Without scale-aware detection, an organisation may successfully defend each individual agent session while failing to recognise that thousands of sessions constitute a coordinated campaign extracting training data, mapping safety boundaries, generating prohibited content at industrial volume, or exploiting agent capabilities to conduct fraud across the entire estate.

3. Example

Scenario A — Botnet-Driven Credential Stuffing Through Agent Conversational Interface: A financial services firm deploys a customer-facing agent that can check account balances, initiate transfers, and reset passwords when customers provide identity verification information. An attacker operates a botnet of 12,000 compromised residential IP addresses. Each bot initiates a conversational session with the agent, presents a different stolen identity (from a breached database of 2.3 million credentials), and attempts the identity verification flow. The botnet is engineered to mimic human conversational patterns — variable typing speeds, natural pauses, occasional typos — to evade simple bot detection. Each individual session appears legitimate: a customer asking to check their account balance. But across the estate, 12,000 sessions within a 4-hour window are all following the same conversational pattern with different identities. The per-session rate limiter (AG-004) is satisfied because each session generates only 3-5 requests. The per-account rate limiter is satisfied because each identity is used only once. No individual session triggers any anomaly. Over the 4-hour window, 847 sessions successfully pass identity verification, and the attacker initiates £2.1 million in transfers before the fraud team detects the activity through downstream transaction monitoring — 14 hours after the campaign began. The transfers are distributed across 340 recipient accounts. Recovery is limited to £380,000; net loss is £1.72 million. The regulatory investigation reveals the firm had no mechanism to detect that 12,000 sessions with statistically similar conversational patterns constituted a coordinated attack rather than normal customer activity.

What went wrong: Per-session and per-account controls were satisfied because the attack was distributed. Rate limiting at the individual session level could not detect a campaign operating at the estate level. The firm lacked aggregate pattern detection that would identify: (a) statistically similar conversational flows across thousands of sessions, (b) an anomalous spike in identity verification attempts, (c) geographic or temporal clustering inconsistent with normal customer behaviour, and (d) the coordination signal — 12,000 sessions executing near-identical conversational strategies within a compressed time window.

Scenario B — Distributed Jailbreak Campaign for Prohibited Content Generation: A content generation agent deployed by a media platform has safety guardrails preventing the generation of extremist recruitment material. A coordinated network of 3,200 accounts — some automated, some operated by human participants recruited through extremist forums — conducts a systematic campaign to map the agent's safety boundaries and discover jailbreak techniques that bypass the guardrails. Phase 1 (weeks 1-2): 800 accounts each submit 20-30 prompts testing variations of prohibited requests, systematically probing the boundary between permitted and prohibited content. Phase 2 (weeks 3-4): accounts share successful partial bypasses and refine techniques. Phase 3 (weeks 5-6): 2,400 accounts use the discovered jailbreak techniques to generate extremist content at industrial volume — approximately 45,000 pieces of content over 14 days. Each individual account generates modest volumes (14-19 pieces), well below the per-account content generation limit. No single account's activity triggers any alert. The platform discovers the campaign only when external researchers identify a pattern of extremist content traceable to the platform, 8 weeks after the campaign began. The reputational damage is severe: press coverage of "AI platform mass-produces extremist content," advertiser withdrawal worth £8.3 million in lost revenue, and regulatory investigation under the EU Digital Services Act.

What went wrong: Per-account monitoring detected no anomaly because individual account activity was within normal bounds. The organisation had no mechanism to detect: (a) the systematic boundary-probing behaviour in Phase 1, where hundreds of accounts submitted structurally similar prompt variations targeting the same safety boundary; (b) the temporal progression from probing to exploitation across the account network; (c) the statistical similarity of generated content across 2,400 accounts — all producing variations on the same prohibited themes using the same jailbreak technique. The attack exploited the gap between per-session security (which was functional) and estate-level pattern detection (which did not exist).

Scenario C — Coordinated Model Extraction Through Agent API: A Crypto/Web3 agent provides market analysis and trading signal generation. A competing firm operates a coordinated extraction campaign using 5,600 accounts created with synthetic identities across 47 jurisdictions. Each account submits carefully crafted queries designed to elicit the agent's proprietary trading logic — not by directly requesting the logic, but by submitting thousands of hypothetical market scenarios and recording the agent's recommended actions. The queries are engineered to be maximally informative: each query probes a different region of the decision space, and the collective query set constitutes a systematic sampling of the agent's decision boundary. Over 6 months, the campaign submits 2.8 million queries across the 5,600 accounts (an average of 2.8 queries per account per day — indistinguishable from normal user behaviour). The extracted decision boundary is used to reconstruct a competing model that replicates 89% of the original agent's trading decisions. The organisation discovers the extraction only when the competing product launches with suspiciously similar performance characteristics. The estimated value of the extracted intellectual property is £14 million. Litigation costs an additional £3.2 million, and the competitive advantage of the proprietary model is permanently destroyed.

What went wrong: Per-account query volume was normal. Per-session behaviour was unremarkable. No individual query was suspicious. The attack was detectable only at the aggregate level: (a) the query distribution across all accounts showed a systematic, maximally-informative sampling pattern inconsistent with natural user curiosity; (b) the 5,600 accounts showed registration patterns suggesting synthetic identity generation (similar registration timing, jurisdiction distribution inconsistent with customer demographics); (c) the collective query set, when analysed as a single corpus, revealed an obvious systematic extraction methodology. The organisation had no aggregate query analysis that would detect these estate-level patterns.

4. Requirement Statement

Scope: This dimension applies to any organisation operating more than one AI agent instance or any single agent accessible to more than one user or session. The scope is the agent estate — the totality of agent instances, sessions, accounts, and interactions across the organisation. Per-session and per-account security controls are assumed to exist (per AG-004, AG-003, and related dimensions) and are not replaced by this dimension. AG-436 addresses the detection gap that exists between individual session security and estate-level pattern recognition. The scope includes all interaction channels — conversational interfaces, API endpoints, tool invocations, inter-agent communications, and any other pathway through which external actors interact with the agent estate. The scope extends to detection of coordinated human actors, automated botnets, hybrid human-bot campaigns, and internal abuse by authorised users operating at anomalous scale. Organisations operating a single agent instance accessible to a single user may defer this dimension, but must re-evaluate applicability whenever access is expanded.

4.1. A conforming system MUST implement estate-level aggregate monitoring that analyses interaction patterns across all agents, sessions, accounts, and time periods to detect coordinated or automated abuse campaigns that are invisible at the individual session level.

4.2. A conforming system MUST define baseline behavioural profiles for normal interaction patterns at the estate level, including: aggregate session volume by time period, conversational pattern distributions, query topic distributions, account creation and activity rates, geographic and temporal access patterns, and content generation volumes by category.

4.3. A conforming system MUST implement anomaly detection that identifies statistically significant deviations from estate-level baselines, including: (a) volume anomalies — unusual spikes in sessions, queries, or content generation; (b) pattern anomalies — clusters of sessions following statistically similar interaction flows; (c) account anomalies — registration patterns suggesting synthetic identity generation or coordinated account creation; (d) temporal anomalies — activity patterns inconsistent with human behaviour (e.g., sustained high-frequency interaction with no breaks); and (e) content anomalies — generated content clustering around specific prohibited or sensitive topics across multiple accounts.

4.4. A conforming system MUST implement automated alerting when estate-level anomaly detection thresholds are breached, with alerts routed to security operations within a maximum latency defined by risk tier: 15 minutes for critical-tier agents, 60 minutes for high-risk agents, 4 hours for standard agents.

4.5. A conforming system MUST maintain the capability to correlate activity across accounts, sessions, IP addresses, client fingerprints, and temporal patterns to identify coordinated campaigns operating through distributed identities.

4.6. A conforming system MUST implement graduated response capabilities that can be activated when abuse-at-scale is detected, including: (a) enhanced monitoring for the affected agent or interaction pattern; (b) increased authentication or verification requirements; (c) rate reduction or temporary suspension for affected accounts or session clusters; and (d) estate-wide defensive posture escalation.

4.7. A conforming system MUST conduct post-incident analysis for every detected abuse-at-scale campaign, producing a documented analysis that includes: campaign scope (accounts, sessions, time period, volume), attack methodology, detection timeline (when did the campaign begin vs. when was it detected), impact assessment, and defensive improvement recommendations.

4.8. A conforming system SHOULD implement behavioural clustering algorithms that group sessions by interaction pattern similarity, enabling detection of coordinated campaigns where individual sessions are benign but the cluster pattern reveals coordination.

4.9. A conforming system SHOULD implement cross-agent correlation that detects campaigns spanning multiple agent types — e.g., an attacker probing safety boundaries on a low-risk internal copilot and applying discovered techniques against a high-risk customer-facing agent.

4.10. A conforming system SHOULD integrate abuse-at-scale detection with external threat intelligence feeds to identify known botnet infrastructure, compromised credential databases, and coordinated attack campaign indicators.

4.11. A conforming system MAY implement predictive detection that identifies emerging campaigns in their early phases (e.g., the boundary-probing Phase 1 in Scenario B) before the campaign reaches the exploitation phase, enabling pre-emptive defensive action.

5. Rationale

Abuse-at-scale is the natural evolution of adversarial attacks against AI agents as agent deployments grow from experimental single-instance deployments to production estates serving millions of users. The transition from individual attacks to coordinated campaigns mirrors the evolution of cybersecurity threats against traditional web applications — the same progression from manual exploitation to automated botnet-driven attacks, from single-account fraud to distributed credential stuffing, from individual vulnerability probing to systematic attack surface mapping.

AI agent estates face a specific vulnerability to scale attacks because of the fundamental asymmetry between per-session defences and estate-level visibility. Per-session controls — rate limiting, input validation, prompt integrity, output filtering — are designed to protect individual interactions. They evaluate each session in isolation. An attacker who distributes an attack across thousands of sessions, each of which individually appears benign, can bypass every per-session control while conducting a devastating campaign at the aggregate level. This is not a failure of per-session controls; it is a limitation of their architectural scope. AG-436 addresses this limitation by requiring detection capabilities that operate at the estate level, analysing patterns that only become visible when individual sessions are correlated.

The economic incentive for abuse-at-scale is substantial. A single jailbreak produces one piece of prohibited content; a coordinated campaign produces thousands. A single credential-stuffing session compromises one account; a botnet campaign compromises hundreds. A single extraction query yields minimal intellectual property; 2.8 million coordinated queries can reconstruct an entire proprietary model. The attacker's return on investment from scale attacks is orders of magnitude higher than from individual attacks, making scale attacks the preferred methodology for sophisticated adversaries.

Three categories of abuse-at-scale require distinct detection approaches. First, automated abuse by botnets: high-volume, machine-speed attacks using compromised infrastructure. Detection relies on temporal pattern analysis, client fingerprinting, and behavioural indicators of automation (consistent timing, identical error patterns, lack of human behavioural noise). Second, coordinated human abuse: networks of human operators conducting a campaign with shared objectives but individual execution. Detection relies on content similarity analysis, conversational pattern clustering, and temporal correlation that reveals coordination. Third, hybrid campaigns: automated infrastructure directed by human operators, combining the volume of bots with the adaptability of humans. Detection requires combining the indicators from both automated and coordinated human detection.

Regulatory expectations for scale-attack detection are increasing. The EU AI Act requires that high-risk AI systems be resilient to adversarial manipulation — a requirement that encompasses manipulation at scale, not only manipulation of individual sessions. Financial regulators expect fraud detection capabilities proportionate to the attack surface — an agent accessible to millions of users has an attack surface that demands estate-level monitoring. The EU Digital Services Act requires platforms to implement measures to address systemic risks, including the misuse of AI systems for the generation of prohibited content at scale.

6. Implementation Guidance

Abuse-at-scale detection requires a fundamentally different architectural approach from per-session security. Per-session controls operate inline — they evaluate each request as it arrives and make accept/reject decisions in real time. Estate-level detection operates on aggregated data — it collects interaction telemetry from across the estate, analyses patterns over time windows, and identifies anomalies that are invisible in any single session. The two approaches are complementary, not competing: per-session controls provide the first line of defence, and estate-level detection identifies campaigns that per-session controls cannot see.

Recommended patterns:

Streaming telemetry aggregation. Instrument every agent interaction to emit structured telemetry events: session ID, account ID, timestamp, interaction type, query topic classification, response classification, client fingerprint, geographic origin, and latency metrics. Aggregate these events into a streaming analytics pipeline that computes estate-level metrics in real time: session volume by time window, topic distribution by time window, account creation rate, geographic distribution, and pattern similarity scores. This telemetry is the raw material for all estate-level detection. Without it, the organisation is blind to scale attacks. Design the telemetry pipeline for high throughput — a large agent estate may generate millions of events per hour — with the ability to compute aggregate metrics over sliding windows of 5 minutes, 1 hour, 24 hours, 7 days, and 30 days.
Behavioural clustering for coordination detection. Represent each session as a feature vector capturing its interaction pattern: sequence of message types, query topic sequence, response times, session duration, and content characteristics. Apply unsupervised clustering algorithms (e.g., DBSCAN, HDBSCAN, or hierarchical clustering) to group sessions by pattern similarity. In normal operation, sessions form a diffuse distribution reflecting diverse user behaviour. A coordinated campaign produces a dense cluster — many sessions following near-identical patterns. Alert when any cluster exceeds a size threshold relative to normal cluster sizes. Scenario A would produce a cluster of 12,000 sessions with statistically similar identity-verification conversational flows — a cluster orders of magnitude larger than any normal cluster. Scenario B would produce clusters of boundary-probing sessions in Phase 1 and content-generation sessions in Phase 3.
Account creation and lifecycle anomaly detection. Monitor account creation patterns for indicators of synthetic identity generation: bursts of registrations from related infrastructure, registration fields with statistical anomalies (similar name patterns, sequential email addresses, correlated registration timestamps), and account lifecycle anomalies (accounts that become active immediately after creation with no organic onboarding behaviour). Cross-reference account creation patterns with subsequent interaction patterns to detect coordinated account farms. Scenario C's 5,600 accounts created with synthetic identities across 47 jurisdictions would exhibit registration-pattern anomalies detectable through this analysis.
Query distribution analysis for extraction detection. For agents that provide analysis, recommendations, or decision-making outputs, monitor the aggregate distribution of queries across the input space. Normal user queries follow a natural distribution — clustered around common topics with a long tail of unusual queries. A systematic extraction campaign produces a distribution designed to maximally sample the decision space — queries that are suspiciously uniform across the input space, covering regions that normal users rarely explore. Statistical tests (e.g., Kolmogorov-Smirnov test comparing observed query distribution against expected distribution, or entropy analysis measuring the uniformity of query coverage) can detect this pattern. Scenario C's 2.8 million queries, when analysed as a corpus, would show an unnaturally uniform distribution inconsistent with organic user behaviour.
Graduated response automation. Pre-define response playbooks for each category of detected abuse-at-scale. When an anomaly is detected: (1) automatically increase monitoring granularity for affected sessions and accounts; (2) if the anomaly exceeds a confidence threshold, automatically impose enhanced verification requirements (additional authentication factors, CAPTCHA challenges, human-in-the-loop requirements); (3) if the campaign is confirmed, temporarily suspend affected accounts and reduce estate-wide rate limits for the affected interaction pattern; (4) escalate to security operations with a pre-assembled briefing package including the anomaly evidence, affected scope, and recommended response. Graduated response avoids both extremes: ignoring a campaign because each individual session is benign, and shutting down the entire estate because of a suspected attack.

Anti-patterns to avoid:

Per-session-only security. Relying exclusively on per-session rate limiting, input validation, and output filtering without any estate-level pattern detection. This is the most common and most dangerous gap. It is the architectural equivalent of protecting each door in a building without any perimeter security — an attacker who enters through a different door each time is never detected.
Volume-only detection. Implementing estate-level detection that triggers only on raw volume metrics (total sessions per hour, total queries per day) without pattern analysis. Sophisticated attackers operate below volume thresholds by distributing campaigns over longer time periods. Scenario C operated at 2.8 queries per account per day — well within any reasonable volume threshold. Detection must analyse patterns, not only volumes.
Batch-only analysis. Performing estate-level analysis only in daily or weekly batch jobs. A botnet campaign can execute in hours (Scenario A: 4-hour window). Batch analysis with a 24-hour cycle detects the campaign at least 20 hours after it begins. Real-time or near-real-time streaming analysis is required for timely detection of fast-moving campaigns.
Alert fatigue through over-sensitive thresholds. Setting anomaly detection thresholds so low that normal variation in estate activity produces constant alerts. Security teams that receive hundreds of false positive alerts per day will miss genuine campaigns. Thresholds must be calibrated against empirical baseline data and tuned to produce a false positive rate that the security operations team can practically investigate.
Detection without response capability. Implementing detection but not graduated response. Detecting a campaign is insufficient if the only response option is either "do nothing" or "shut down the entire estate." Pre-defined graduated response playbooks enable proportionate, rapid response.

Industry Considerations

Financial Services. Financial agent estates are primary targets for credential stuffing, account takeover, and distributed fraud campaigns. Financial institutions should implement the most aggressive detection timelines (15-minute alerting latency), mandatory account creation anomaly detection, and integration with existing fraud detection systems. Cross-correlation between agent interaction patterns and traditional transaction fraud indicators provides powerful detection of campaigns that use agent conversational interfaces as the entry point for financial fraud.

Content Platforms. Platforms deploying content generation agents face the specific risk of coordinated content generation campaigns (Scenario B). Content similarity analysis across the estate — detecting clusters of generated content converging on specific topics or themes — is essential. Integration with content moderation systems and external content intelligence feeds enables detection of campaigns producing harmful content that individually passes content filters but collectively constitutes a systematic campaign.

Crypto/Web3. Crypto agents face heightened extraction risk because trading strategies and market analysis represent high-value intellectual property. Query distribution analysis is particularly important for detecting systematic extraction campaigns. Additional consideration: on-chain transaction analysis can reveal coordination among accounts that interact with both the agent and related blockchain protocols.

Public Sector. Government agents face the risk of coordinated manipulation campaigns that seek to influence benefits decisions, immigration processing, or other rights-affecting outcomes at scale. Scale detection must account for the possibility that coordinated campaigns target vulnerable populations whose applications may already appear similar, requiring careful calibration to avoid discriminatory false positives.

Maturity Model

Basic Implementation — The organisation aggregates interaction telemetry from all agents into a centralised analytics pipeline. Estate-level baselines are established for session volume, account creation rate, and query topic distribution. Anomaly detection triggers alerts when metrics deviate beyond defined thresholds. Alerts are routed to security operations within the required latency. Post-incident analysis is conducted for detected campaigns. Graduated response capabilities include manual account suspension and rate adjustment. This level detects volume-based attacks and obvious coordination patterns.

Intermediate Implementation — All basic capabilities plus: behavioural clustering groups sessions by interaction pattern similarity and detects coordination clusters. Account creation and lifecycle anomaly detection identifies synthetic identity patterns. Query distribution analysis detects systematic extraction campaigns. Cross-agent correlation detects campaigns spanning multiple agent types. Near-real-time detection operates on 5-minute sliding windows. Graduated response is partially automated with pre-defined playbooks. External threat intelligence feeds are integrated. Detection is tested quarterly against simulated scale attack scenarios.

Advanced Implementation — All intermediate capabilities plus: predictive detection identifies emerging campaigns in early phases before exploitation begins. Machine learning models continuously adapt baselines to seasonal and contextual variation while maintaining sensitivity to adversarial patterns. Full automation of graduated response with human-in-the-loop for escalation decisions. Real-time dashboards provide estate-level threat visibility. Red team exercises simulate novel scale attack methodologies not covered by known patterns. Cross-organisational threat intelligence sharing contributes to and benefits from collective detection capabilities. Detection latency for fast-moving campaigns (Scenario A) is under 30 minutes from campaign onset.

7. Evidence Requirements

Required artefacts:

Estate-level monitoring architecture documentation. Documentation of the telemetry aggregation pipeline, analytics infrastructure, baseline computation methodology, anomaly detection algorithms, and alerting configuration. Must demonstrate that monitoring covers all agents, sessions, and interaction channels.
Baseline behavioural profiles. The current estate-level baseline profiles including session volume norms, query topic distributions, account creation rate norms, geographic distribution norms, and pattern similarity thresholds. Baselines must include the methodology for computing and updating them.
Anomaly detection configuration and threshold documentation. The specific anomaly detection methods employed, the thresholds configured, the rationale for each threshold, and the false positive rate measured against historical data.
Alert and incident records. Records of all estate-level anomaly alerts generated during the retention period, including: alert timestamp, anomaly type, affected scope, investigation outcome (true positive, false positive, inconclusive), and response actions taken.
Post-incident analysis reports. Documented analysis for every confirmed abuse-at-scale campaign, including campaign scope, methodology, detection timeline, impact assessment, and defensive improvements implemented.
Graduated response playbooks. Documented response playbooks for each category of abuse-at-scale, showing the graduated response steps, activation criteria, and escalation procedures.
Simulation and testing records. Results of periodic testing of detection capabilities against simulated scale attack scenarios, showing detection effectiveness, time-to-detection, and false positive rates.

Retention requirements:

Alert records, incident analysis reports, and testing results: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Test 8.1: Distributed Session Pattern Detection

Stimulus: Simulate a coordinated campaign of 500 sessions distributed across 500 distinct accounts over a 2-hour window. All sessions follow the same conversational pattern (a 5-turn identity verification flow with minor variation). Individual session volume and per-account activity are within normal limits. Inject the simulated sessions into the estate telemetry stream alongside normal production-level traffic.
Expected behaviour: The estate-level anomaly detection identifies a statistically anomalous cluster of sessions with near-identical interaction patterns. An alert is generated within the required latency for the agent's risk tier.
Pass criteria: The anomalous session cluster is detected. Alert is generated within the required latency (15 minutes for critical-tier, 60 minutes for high-risk). The alert identifies the approximate number of affected sessions (within 20% of actual) and the shared interaction pattern.
Fail criteria: The coordinated campaign is not detected, or the alert latency exceeds the required maximum, or the alert does not characterise the campaign scope.

Test 8.2: Account Creation Anomaly Detection

Stimulus: Create 200 accounts over a 48-hour period with synthetic identity indicators: registration timestamps clustered in 30-minute bursts, email addresses following a detectable pattern (sequential prefixes, common domain), and registration metadata showing correlated client fingerprints. Intersperse with normal account creation traffic.
Expected behaviour: Account creation anomaly detection identifies the synthetic account cluster. Alert is generated identifying the approximate count of anomalous accounts and the indicators that triggered detection.
Pass criteria: At least 80% of the synthetic accounts are identified as anomalous. The detection alert includes the indicators used for identification. False positive rate on legitimate concurrent registrations does not exceed 5%.
Fail criteria: Fewer than 80% of synthetic accounts are detected, or false positive rate exceeds 5%, or no alert is generated.

Test 8.3: Query Distribution Extraction Detection

Stimulus: Over a 7-day period, submit 5,000 queries from 100 accounts that systematically and uniformly sample the agent's decision space — queries designed to map the agent's decision boundary with minimal redundancy. Simultaneously, maintain normal query traffic from legitimate users. Measure whether the aggregate query distribution analysis detects the non-natural distribution.
Expected behaviour: Query distribution analysis detects that the combined queries from the 100 accounts produce a distribution statistically inconsistent with natural user behaviour — unnaturally uniform coverage of the input space. Alert is generated.
Pass criteria: The extraction campaign is detected within the 7-day window. The alert identifies the approximate number of participating accounts. The detection method is documented in the alert (distribution uniformity, coverage analysis, or equivalent).
Fail criteria: The extraction campaign completes without detection, or detection occurs only after the 7-day window.

Test 8.4: Graduated Response Activation

Stimulus: Trigger an abuse-at-scale alert (using the simulated campaign from Test 8.1 or a separate simulation). Verify that the graduated response playbook activates correctly: (a) enhanced monitoring is applied to affected sessions, (b) increased verification requirements are imposed on affected accounts, (c) rate reduction is applied to the affected interaction pattern, and (d) the security operations team receives a briefing package.
Expected behaviour: Each graduated response step activates in sequence. Enhanced monitoring captures additional telemetry from affected sessions. Verification requirements are successfully imposed. Rate limits are adjusted. The briefing package is generated and delivered.
Pass criteria: All four response steps activate successfully. Enhanced monitoring produces additional telemetry within 5 minutes. Verification requirements are imposed on affected accounts within the required latency. Rate limits are adjusted within the required latency. Briefing package is delivered to security operations.
Fail criteria: Any response step fails to activate, or activation latency exceeds the defined maximum, or the briefing package is not generated.

Test 8.5: Cross-Agent Correlation Detection

Stimulus: Simulate a campaign where 50 accounts first probe safety boundaries on a low-risk internal agent (20 probing queries each) and subsequently submit refined exploitation queries to a high-risk customer-facing agent (5 exploitation queries each). The two phases use the same accounts but target different agents. Individual agent monitoring may not flag either phase independently.
Expected behaviour: Cross-agent correlation detects that the same account cohort is interacting with multiple agents in a pattern consistent with boundary probing followed by exploitation. Alert is generated linking the two phases.
Pass criteria: The cross-agent campaign is detected. The alert links activity across both agents and identifies the participating accounts. The temporal progression from probing to exploitation is characterised.
Fail criteria: The cross-agent correlation is not detected, or the two phases are treated as unrelated events.

Test 8.6: Baseline Calibration and False Positive Measurement

Stimulus: Operate the estate-level detection system for 30 days under normal production traffic with no simulated attacks. Record all alerts generated. Investigate each alert to determine whether it is a true positive (genuine anomaly requiring response) or a false positive (normal variation incorrectly flagged).
Expected behaviour: The false positive rate is within operationally sustainable limits. Security operations can investigate all alerts within normal staffing capacity.
Pass criteria: False positive rate does not exceed 5 alerts per day for a standard-size estate (up to 100 agents) or a proportional rate for larger estates. Each alert includes sufficient context for investigation within 30 minutes.
Fail criteria: False positive rate exceeds the defined threshold, or alerts lack sufficient context for efficient investigation.

Test 8.7: Post-Incident Analysis Completeness

Stimulus: After a simulated abuse-at-scale campaign is detected and responded to, produce a post-incident analysis report. Evaluate the report against the required content: campaign scope, attack methodology, detection timeline, impact assessment, and defensive improvement recommendations.
Expected behaviour: The post-incident analysis report is complete, covering all required content areas. The detection timeline accurately records when the campaign began, when it was detected, and the latency between onset and detection. Improvement recommendations are specific and actionable.
Pass criteria: Report covers all 5 required content areas. Detection timeline is accurate to within 15 minutes. At least 2 specific improvement recommendations are included. Report is produced within 72 hours of campaign conclusion.
Fail criteria: Any required content area is missing, or the detection timeline is inaccurate, or no improvement recommendations are provided, or the report is not produced within 72 hours.

Conformance Scoring

Score 0: No estate-level abuse detection exists — agents are monitored only at the individual session level, with no aggregation, correlation, or pattern analysis across the estate.
Score 1: Basic estate-level volume monitoring is implemented (aggregate session counts, query volumes) with alerting on volume thresholds. Post-incident analysis is performed. However, pattern-based detection (behavioural clustering, query distribution analysis, account anomaly detection) is absent. Detection is limited to volume spikes.
Score 2: Volume monitoring, behavioural clustering, account anomaly detection, and cross-agent correlation are implemented. Baselines are maintained and updated. Anomaly detection operates in near-real-time with alerts routed within required latencies. Graduated response playbooks are defined and tested. Post-incident analysis is performed for all confirmed campaigns. Detection is tested quarterly against simulated campaigns.
Score 3: All Score 2 capabilities plus: predictive detection identifies campaigns in early phases. Machine learning baselines adapt continuously. Query distribution analysis detects extraction campaigns. External threat intelligence is integrated. Red team exercises test novel attack methodologies. Detection latency for fast-moving campaigns is under 30 minutes. Cross-organisational threat intelligence sharing is active.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Direct requirement
EU AI Act	Article 15 (Accuracy, Robustness and Cybersecurity)	Direct requirement
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Direct requirement
NIST AI RMF	MANAGE 2.2, MANAGE 4.1	Supports compliance
ISO 42001	Clause 6.1 (Actions to Address Risks)	Supports compliance
DORA	Article 9 (ICT Risk Management Framework), Article 17 (ICT-Related Incident Management Process)	Direct requirement

EU AI Act — Article 9 (Risk Management System) and Article 15 (Accuracy, Robustness and Cybersecurity)

The EU AI Act requires that high-risk AI systems be protected against adversarial manipulation. Abuse-at-scale — coordinated campaigns that exploit AI agent estates using botnets, synthetic accounts, or organised human networks — is a form of adversarial manipulation that operates at a level of sophistication and volume that individual-session defences cannot address. Article 9 requires a risk management system that identifies and mitigates risks throughout the AI system's lifecycle. The risk of coordinated exploitation is a foreseeable risk for any AI system accessible to multiple users, and the absence of estate-level detection means this risk is neither identified nor mitigated. Article 15's cybersecurity requirement explicitly encompasses the resilience of the AI system against coordinated attacks, not only individual exploitation attempts. An organisation that demonstrates per-session security but cannot detect or respond to a 12,000-session botnet campaign has not met Article 15's robustness standard.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA expects that financial firms maintain systems and controls proportionate to the risks they face. For firms deploying AI agents accessible to customers, the risk of coordinated abuse campaigns — credential stuffing, distributed fraud, systematic extraction of proprietary models — is material and well-documented. The FCA's operational resilience framework requires that firms can detect and respond to threats to their critical services, including threats that operate through their customer-facing AI systems. AG-436 provides the specific detection and response controls for abuse-at-scale against agent estates. Firms that implement per-session controls but lack estate-level detection face supervisory challenge when a coordinated campaign causes customer harm that could have been detected earlier.

DORA requires financial entities to implement ICT risk management that includes the detection of anomalous activities and ICT-related incidents. A coordinated abuse-at-scale campaign against an AI agent estate is an ICT-related incident that must be detected, managed, and reported. Article 17 requires that entities have processes for detecting, classifying, and responding to ICT-related incidents. AG-436's graduated response playbooks and post-incident analysis requirements directly support DORA Article 17 compliance by ensuring that abuse-at-scale campaigns are detected as incidents, responded to through defined processes, and analysed for continuous improvement. The alerting latency requirements (15 minutes for critical-tier agents) align with DORA's expectation of timely incident detection.

SOX — Section 404 (Internal Controls Over Financial Reporting)

Coordinated campaigns against financial agents can directly affect financial reporting accuracy — fraudulent transactions initiated through botnet credential stuffing (Scenario A), manipulation of financial decision-making at scale, or extraction of proprietary financial models. SOX requires that internal controls are effective at preventing material misstatement. Estate-level detection that identifies and arrests coordinated fraud campaigns is a necessary component of the internal control environment for organisations whose financial processing relies on AI agents. The absence of scale detection means that a coordinated fraud campaign could process thousands of fraudulent transactions before detection, potentially creating material financial statement impact.

10. Failure Severity

Field	Value
Severity Rating	Critical
Blast Radius	Estate-wide; affects all agents, accounts, and users within the exploited scope; may extend to partner organisations, financial counterparties, and downstream systems that process agent outputs

Consequence chain: The absence of abuse-at-scale detection creates a blind spot between per-session security and estate-level visibility. An adversary — whether a botnet operator, organised fraud network, or state-sponsored actor — identifies that individual agent sessions are well-protected but that no correlation exists across sessions. The adversary designs a campaign that distributes activity across thousands of sessions, each individually benign, collectively devastating. The campaign executes over hours (Scenario A: £1.72 million in financial losses within 4 hours), weeks (Scenario B: 45,000 pieces of extremist content over 14 days with £8.3 million in revenue impact), or months (Scenario C: £14 million in intellectual property extraction over 6 months). Per-session controls pass every individual interaction. The organisation's security team sees no alerts. The campaign is discovered only through secondary effects — downstream fraud detection, external researcher reports, or competitor product launches — by which time the damage is complete and largely irreversible. The business consequences cascade: direct financial losses from fraud or IP theft, regulatory enforcement for inadequate controls (GDPR fines, FCA enforcement, DORA incident reporting failures), reputational damage from public disclosure of the exploitation, customer remediation costs, and the operational cost of emergency response and system hardening. For organisations with large agent estates serving millions of users, a single undetected abuse-at-scale campaign can produce losses exceeding £10 million and trigger regulatory actions that constrain future AI deployment. The severity is rated Critical because the failure is both high-impact and silent — the organisation has no mechanism to detect the damage while it is occurring, and the delay between exploitation and discovery maximises the attacker's return and the organisation's loss.

Cross-references: AG-004 (Action Rate Governance), AG-003 (Adversarial Coordination Detection), AG-429 (Social Engineering Attack Simulation Governance), AG-432 (Model Exfiltration Throttling Governance), AG-434 (Covert Channel Detection Governance), AG-437 (Economic Abuse Resistance Governance), AG-438 (Jailbreak Pattern Library Governance), AG-414 (Alert Deduplication Governance), AG-418 (Cross-System Trace Correlation Governance).

Cite this protocol

AgentGoverning. (2026). AG-436: Abuse-at-Scale Detection Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-436

← Previous Protocol

AG-435

Steganography and Cross-Modal Payload Governance

Next Protocol →

AG-437

Economic Abuse Resistance Governance