AG-408: Infrastructure Drift Detection Governance

2. Summary

Infrastructure Drift Detection Governance requires that organisations operating AI agent systems continuously monitor their deployed infrastructure for unauthorised divergence from approved baselines — declared, version-controlled specifications of what the infrastructure should look like. Infrastructure drift occurs when the actual state of compute resources, network configurations, storage parameters, access policies, or runtime environments deviates from the approved state without a corresponding approved change record. Drift can be introduced through manual ad-hoc modifications, failed automation, partial rollbacks, external compromise, or resource provider changes. For AI agent environments, drift is particularly dangerous because it can silently alter the security posture, performance characteristics, or compliance status of the infrastructure on which governance controls depend. This dimension mandates systematic baseline management, continuous drift scanning, and policy-driven remediation workflows that ensure detected drift is classified, reported, and resolved before it undermines agent governance.

3. Example

Scenario A — Firewall Rule Drift Exposes Model Inference Endpoint: A financial-value AI agent serves trade recommendation inference through an internal API. The infrastructure baseline specifies that the inference endpoint is accessible only from the application tier's private subnet (10.0.4.0/24) via port 8443. A network engineer, troubleshooting a connectivity issue at 02:00 during an incident, adds a temporary firewall rule permitting inbound traffic from 0.0.0.0/0 on port 8443. The incident is resolved; the temporary rule is not removed. No drift detection is in place. The inference endpoint is now publicly accessible. Three weeks later, an external actor discovers the endpoint through a port scan, submits crafted inference requests, and extracts proprietary trading signals from the model's responses. Over 19 days of exploitation, the attacker front-runs 340 trades using the extracted signals, generating an estimated £4.7 million in illicit profit at the organisation's clients' expense.

What went wrong: A manual infrastructure change was made outside the change management process — a common occurrence during incident response. Without drift detection, the deviation from the approved baseline was never identified. The infrastructure state diverged from the declared state on day 1 but was not detected for 40 days. Consequence: 19 days of model exploitation, £4.7 million in client losses attributed to front-running, FCA enforcement action for inadequate market abuse controls, and reputational damage requiring client notification.

Scenario B — Resource Scaling Drift Degrades Safety-Critical Latency: A safety-critical AI agent monitors industrial gas pressures and issues emergency shutdown commands when thresholds are exceeded. The infrastructure baseline specifies dedicated compute instances with 32 vCPUs, 128 GB RAM, and a maximum inference latency of 50 milliseconds at the 99th percentile. During a cloud cost optimisation exercise, an automated right-sizing tool reduces the instance type to 8 vCPUs and 32 GB RAM based on average utilisation metrics — it does not account for burst requirements during emergency scenarios. The change is applied through the cloud provider's API but does not update the infrastructure-as-code baseline. No drift detection flags the discrepancy. Four months later, a simultaneous multi-sensor alarm condition generates 47 concurrent inference requests. The under-provisioned infrastructure delivers 99th-percentile latency of 340 milliseconds — nearly seven times the 50-millisecond requirement. Three shutdown commands arrive too late. Pressure exceedance causes a valve failure and a controlled release of industrial gas, requiring facility evacuation. No injuries occur, but the incident results in a £2.1 million remediation cost and a 14-day operational shutdown.

What went wrong: An automated tool modified infrastructure outside the governed change process. The actual infrastructure state diverged from the baseline specification without detection. The drift was not malicious but was equally dangerous: it silently degraded the safety margin on which the agent's real-time performance depended. Consequence: Safety incident with facility evacuation, £2.1 million remediation, 14-day shutdown, regulatory investigation by the safety authority.

Scenario C — IAM Policy Drift Grants Excessive Agent Permissions: A customer-facing AI agent operates with a narrowly scoped identity and access management (IAM) policy permitting read access to the product catalogue and write access to the order queue — and nothing else. During a feature development cycle, a developer grants the agent's service identity temporary administrative access to a data lake containing customer financial records, for the purpose of testing a new recommendation feature. The feature is abandoned, but the IAM policy change persists. Drift detection is not implemented. The agent now has read access to 2.3 million customer financial records — a violation of the principle of least privilege and a data protection breach waiting to happen. Six months later, a prompt injection attack exploits the agent's excessive permissions to exfiltrate 12,000 customer records through the agent's response channel.

What went wrong: A temporary development change to the infrastructure configuration was never reverted. Without drift detection comparing the actual IAM policy to the approved baseline, the excessive permission persisted undetected for 6 months. When the agent was compromised through an unrelated vector (prompt injection), the drifted IAM policy amplified the impact from a prompt injection to a data breach. Consequence: 12,000 customer records exfiltrated, GDPR Article 33 breach notification to supervisory authority, £1.8 million in notification, credit monitoring, and remediation costs, class-action litigation.

4. Requirement Statement

Scope: This dimension applies to all infrastructure supporting AI agent systems, including but not limited to: compute resources (virtual machines, containers, serverless functions), network configurations (firewalls, load balancers, DNS records, VPN tunnels, network segmentation rules), storage systems (databases, object stores, file systems, caching layers), identity and access management policies (IAM roles, service accounts, access control lists), runtime configurations (environment variables, feature flags, resource limits), and platform services (message queues, API gateways, monitoring endpoints). The dimension covers infrastructure managed through any mechanism: infrastructure-as-code, cloud provider consoles, APIs, CLI tools, or automated optimisation systems. It applies to infrastructure in any environment that serves production AI agent traffic, including production, disaster recovery, and any staging environment from which production promotions occur. Organisations using managed or shared infrastructure must ensure drift detection covers the configuration surfaces they control.

4.1. A conforming system MUST maintain a version-controlled, authoritative infrastructure baseline that declares the approved state of all infrastructure components supporting AI agent production environments, with each baseline version linked to a change approval record.

4.2. A conforming system MUST perform automated drift scans comparing the actual infrastructure state to the approved baseline at intervals not exceeding 24 hours for all infrastructure components, and not exceeding 1 hour for security-critical components (network access rules, IAM policies, encryption settings).

4.3. A conforming system MUST classify detected drift by severity — at minimum distinguishing between security-critical drift (changes to access controls, network exposure, encryption, or authentication), performance-critical drift (changes to compute resources, scaling parameters, or latency-affecting configurations), and operational drift (changes to non-security, non-performance configurations) — with documented classification criteria.

4.4. A conforming system MUST generate alerts for all detected drift, with security-critical drift alerts delivered to the security operations function within 15 minutes of detection, and performance-critical drift alerts delivered to the operations function within 60 minutes.

4.5. A conforming system MUST record every drift event in a tamper-evident log (per AG-006), capturing: the component affected, the baseline value, the actual value, the detection timestamp, the classification severity, and the resolution action taken.

4.6. A conforming system MUST enforce a remediation policy requiring that security-critical drift is resolved (either reverted to baseline or formally approved and the baseline updated) within 4 hours of detection, and all other drift within 72 hours.

4.7. A conforming system MUST prevent baseline modifications without a corresponding change approval record — the baseline cannot be updated to match drifted state as a substitute for remediation unless the change has been formally reviewed and approved.

4.8. A conforming system SHOULD implement real-time drift detection through event-driven monitoring of infrastructure change APIs, supplementing periodic scans with immediate detection of changes as they occur.

4.9. A conforming system SHOULD implement automated remediation for predefined low-risk drift categories, with automatic reversion to baseline for changes that match known safe-to-revert patterns (e.g., tag modifications, non-functional metadata changes).

4.10. A conforming system MAY implement predictive drift analysis, identifying infrastructure components that are statistically likely to drift based on historical patterns, and applying enhanced monitoring to those components.

5. Rationale

Infrastructure is the foundation on which all AI agent governance controls rest. Access controls, audit logging, network isolation, encryption, latency guarantees, and resource availability are all infrastructure-layer properties. When the infrastructure drifts from its approved state, these foundational properties may silently change — and every governance control that depends on them is compromised. Drift is particularly dangerous because it is often invisible: the agent continues to operate, the application continues to serve requests, and no error is raised. The system appears normal while its security posture, performance profile, or compliance status has fundamentally changed.

Four forces drive infrastructure drift. First, manual ad-hoc changes made during incidents, debugging, or experimentation. These changes are made with legitimate intent but bypass change management processes. They are the most common source of drift and the hardest to prevent entirely — incident response sometimes requires immediate infrastructure modification. The governance response is not to prevent all manual changes but to detect them promptly. Second, automated optimisation tools that modify resource allocations based on utilisation metrics without consulting governance baselines. Cloud cost optimisation, autoscaling adjustments, and provider-side maintenance can all modify infrastructure configuration. Third, partial failures in infrastructure-as-code application — where an automation run partially succeeds, modifying some components but failing on others, leaving the infrastructure in a state that matches neither the old nor the new baseline. Fourth, malicious modification by external attackers or insider threats, where infrastructure changes are the attack vector rather than the consequence.

The regulatory case for drift detection is direct. DORA Article 9 requires financial entities to identify and manage ICT risks, including risks arising from configuration changes. The EU AI Act Article 15 requires robustness — infrastructure drift that degrades security or performance is a robustness failure. SOX Section 404 requires internal controls including change management and configuration management. ISO 27001 Clause A.8.9 (Configuration Management) and A.8.32 (Change Management) explicitly require that configuration deviations are detected and addressed. The FCA's SYSC 6.1.1R requires adequate systems and controls — infrastructure that silently deviates from its approved state is not adequately controlled.

For AI agent systems, drift has compound effects. A network rule change may expose an inference endpoint (Scenario A). A compute scaling change may violate latency constraints on which safety depends (Scenario B). An IAM drift may amplify the impact of an application-layer attack from a contained incident to a data breach (Scenario C). Each of these failures originated in infrastructure drift but manifested as an AI governance failure. Drift detection is therefore not merely an infrastructure operation — it is a governance control that protects the integrity of all other controls.

6. Implementation Guidance

Infrastructure Drift Detection Governance requires a continuous closed loop: declare the desired state, measure the actual state, compare, classify, alert, and remediate. The core principle is that the infrastructure baseline is the single source of truth, and any divergence from it is a governance event requiring investigation and resolution.

Recommended patterns:

Infrastructure-as-code as the authoritative baseline. Maintain all infrastructure configuration in version-controlled, declarative configuration files. The committed configuration is the approved baseline. Changes to infrastructure are made by modifying the configuration files, submitting them for review and approval, and applying them through automation. Direct infrastructure modifications that bypass the configuration files are treated as drift. The configuration repository serves a dual purpose: it is both the operational deployment mechanism and the governance baseline.
Layered drift scanning. Implement drift detection at multiple layers: the resource layer (are the correct resources provisioned with the correct specifications?), the configuration layer (are resources configured as declared?), the network layer (are firewall rules, routing tables, and DNS records as declared?), and the identity layer (are IAM policies, service accounts, and role bindings as declared?). Each layer may require different scanning tools and different scan frequencies. Security-critical layers (network, identity) should be scanned hourly or in real time; operational layers may be scanned daily.
Event-driven detection supplementing periodic scans. Subscribe to infrastructure change event feeds (cloud provider audit logs, control plane API event streams, change notification webhooks) to detect modifications in real time. When an event indicates an infrastructure change, immediately compare the new state against the baseline. This supplements periodic scans by catching drift at the moment it occurs rather than waiting for the next scan cycle. For security-critical components, event-driven detection is essential to meet the 15-minute alerting requirement.
Drift classification engine. Implement a rules-based classification engine that categorises detected drift by severity and type. Classification rules should be version-controlled and reviewed alongside the infrastructure baseline. Example rules: any change to a firewall rule that increases network exposure is security-critical; any change to an IAM policy that broadens permissions is security-critical; any change to compute resource specifications below minimum thresholds is performance-critical; any change to resource tags is operational. The classification engine should be extensible to accommodate new infrastructure components and new risk categories.
Remediation workflow integration. When drift is detected, automatically create a remediation ticket with the drift details, classification, and SLA (4 hours for security-critical, 72 hours for other). Provide the remediation team with a one-click reversion option that restores the component to its baseline state. For drift that should be accepted (e.g., a legitimate change made during an incident), provide a formalisation workflow that creates a change record and updates the baseline — converting the drift into an approved change.

Anti-patterns to avoid:

Baseline-free monitoring. Monitoring infrastructure changes without maintaining an authoritative baseline of the desired state. Change detection without a baseline can identify that something changed but cannot determine whether the new state is correct or incorrect. Drift detection requires a baseline to compare against — otherwise it is merely change logging.
Scan-only detection with no alerting or remediation. Running drift scans that produce reports but do not trigger alerts or remediation workflows. Drift reports that sit in a dashboard unread provide no governance value. Detection must be connected to alerting and remediation through an automated workflow.
Baseline update without approval. Allowing the baseline to be updated to match the drifted state without a formal change approval process. This converts every instance of drift into an approved change by default, eliminating the governance value of the baseline entirely. Baseline updates must go through the same change approval process as intentional infrastructure modifications.
Exempting incident-response changes from detection. Excluding changes made during incidents from drift detection on the grounds that incidents require rapid response. Incident-response changes are the most common source of persistent drift (Scenario A). They should be detected, classified, and subjected to post-incident formalisation or reversion. The remediation SLA may be extended during active incidents, but detection must not be suspended.
Single-layer scanning. Scanning only one infrastructure layer (e.g., compute resources) while ignoring others (network rules, IAM policies, storage configurations). Drift in any layer can have governance consequences. Comprehensive drift detection must cover all layers.

Industry Considerations

Financial Services. Financial regulators expect firms to maintain controlled, auditable infrastructure environments. Drift in network configurations can expose trading systems or customer data (Scenario A). Drift in compute resources can affect transaction processing latency, potentially violating best execution obligations. Firms should implement real-time drift detection for all security-critical components and daily scanning for all other components. The FCA's operational resilience expectations (PS21/3) specifically require firms to remain within impact tolerances — infrastructure drift that pushes a service outside its impact tolerance is a regulatory breach.

Safety-Critical and Cyber-Physical Systems. Infrastructure drift in safety-critical environments can have life-safety consequences (Scenario B). Compute resource drift may violate real-time processing requirements. Network drift may disrupt safety communication paths. Organisations operating safety-critical AI agents should implement the tightest drift detection intervals (continuous or hourly for all components) and the shortest remediation SLAs (immediate reversion for any drift that could affect safety functions).

Healthcare. Healthcare AI infrastructure must comply with data protection regulations (HIPAA, GDPR Article 32) that require appropriate technical measures. Drift in encryption settings, access controls, or network isolation for systems processing patient data constitutes a technical measures failure. Drift detection must cover all infrastructure supporting healthcare AI agents with particular attention to data access and encryption configurations.

Crypto and Web3. Decentralised agent infrastructure often includes blockchain nodes, key management infrastructure, and bridge components where drift can result in irrecoverable financial loss. A drifted configuration on a validator node or a signing service can expose private keys or enable unauthorised transactions. Real-time drift detection with automated reversion is strongly recommended for all security-critical Web3 infrastructure components.

Maturity Model

Basic Implementation — The organisation maintains a version-controlled infrastructure baseline covering all production AI agent infrastructure components. Automated drift scans run at least daily for all components and at least hourly for security-critical components. Detected drift is classified by severity, alerts are generated, and remediation tickets are created. Drift events are logged in a tamper-evident log. Baseline modifications require change approval. This meets the minimum mandatory requirements.

Intermediate Implementation — All basic capabilities plus: event-driven detection supplements periodic scans, providing near-real-time drift detection through infrastructure change event feeds. A drift classification engine applies version-controlled rules to categorise drift automatically. Remediation workflows include one-click reversion and formalisation options. Automated remediation handles low-risk drift categories. Drift metrics (detection count, mean time to remediation, repeat drift rates) are tracked and reported to governance stakeholders monthly.

Advanced Implementation — All intermediate capabilities plus: continuous real-time monitoring of all infrastructure layers with sub-minute detection latency. Predictive drift analysis identifies high-risk components and applies enhanced monitoring. Drift detection is integrated with the build pipeline attestation system (AG-407) to correlate infrastructure changes with deployment events. Cross-environment baseline comparison detects configuration inconsistencies between production and disaster recovery environments. The organisation can demonstrate through independent audit that no infrastructure drift persists undetected beyond the defined scan interval.

7. Evidence Requirements

Required artefacts:

Infrastructure baseline repository. The version-controlled infrastructure baseline, including all declared configurations for compute, network, storage, identity, and runtime components. Must include version history with change approval records linked to each version.
Drift scan results. Results from all automated drift scans, showing: scan timestamp, components scanned, drift detected (with baseline value and actual value for each drifted component), drift classification, and scan coverage metrics.
Drift event log. Tamper-evident log of all detected drift events, including: component affected, baseline value, actual value, detection timestamp, classification severity, alert delivery timestamp, remediation action, and resolution timestamp.
Remediation records. Records of remediation actions taken for each drift event, including: whether the drift was reverted or formalised, the remediation timestamp (demonstrating SLA compliance), and the identity of the person or system performing remediation.
Drift classification rules. The current version of the drift classification rules, with version history and change approval records.

Retention requirements:

Drift event logs and remediation records: minimum 7 years for regulated financial services; minimum 5 years for safety-critical systems; minimum 3 years otherwise.
Infrastructure baseline version history: retained for the lifetime of the infrastructure plus 2 years.
Drift scan results: retained for minimum 12 months at full detail; summary metrics retained for the same period as drift event logs.

Access requirements:

Producible to regulators or auditors within 48 hours of request.
Drift event logs must be independently accessible — not stored on the infrastructure being monitored, to prevent a compromised component from concealing its own drift.

8. Test Specification

Test 8.1: Baseline Declaration and Version Control

Stimulus: Retrieve the current infrastructure baseline from the version-controlled repository. Verify that it covers all production infrastructure components (compute, network, storage, IAM, runtime). Attempt to modify the baseline without a corresponding change approval record.
Expected behaviour: The baseline is comprehensive, covering all declared infrastructure categories. The modification attempt without approval is rejected by the version control system's policy enforcement.
Pass criteria: Baseline covers 100% of production infrastructure categories. Unapproved baseline modification is blocked. Version history shows change approval records for all previous modifications.
Fail criteria: The baseline omits any production infrastructure category, or a modification is accepted without an approval record.

Test 8.2: Drift Detection — Security-Critical Component

Stimulus: Introduce a deliberate drift in a security-critical component: modify a firewall rule to permit traffic from an additional source IP range (expanding network exposure). Wait for detection.
Expected behaviour: The drift is detected within the 1-hour scan interval (or in real time if event-driven detection is implemented). An alert is delivered to the security operations function within 15 minutes of detection. The drift is classified as security-critical.
Pass criteria: Drift detected within 1 hour. Alert delivered within 15 minutes of detection. Classification is security-critical.
Fail criteria: Drift not detected within 1 hour, alert not delivered within 15 minutes, or drift is misclassified.

Test 8.3: Drift Detection — Performance-Critical Component

Stimulus: Introduce a deliberate drift in a performance-critical component: reduce a compute instance's allocated memory from the baseline-specified 128 GB to 32 GB.
Expected behaviour: The drift is detected within the 24-hour scan interval. An alert is delivered to the operations function within 60 minutes of detection. The drift is classified as performance-critical.
Pass criteria: Drift detected within 24 hours. Alert delivered within 60 minutes of detection. Classification is performance-critical.
Fail criteria: Drift not detected within 24 hours, alert not delivered within 60 minutes, or drift is misclassified.

Test 8.4: Remediation SLA Enforcement — Security-Critical

Stimulus: Trigger a security-critical drift event. Observe the remediation workflow. Verify that the remediation SLA (4 hours) is enforced — either the drift is reverted, or the change is formally approved and the baseline is updated within 4 hours.
Expected behaviour: A remediation ticket is automatically created. The security operations team is alerted. The drift is resolved within 4 hours. If the SLA is at risk, escalation alerts are generated.
Pass criteria: Drift is resolved (reverted or formalised) within 4 hours. Remediation record documents the action taken. Escalation alerts are generated if resolution approaches the SLA deadline.
Fail criteria: Drift remains unresolved after 4 hours without escalation, or remediation is not recorded.

Test 8.5: Drift Event Logging Integrity

Stimulus: Introduce 5 drift events across different infrastructure layers (one network, one IAM, one compute, one storage, one runtime configuration). After detection and remediation, retrieve the drift event log and verify completeness.
Expected behaviour: All 5 drift events appear in the tamper-evident log with complete fields: component affected, baseline value, actual value, detection timestamp, classification, alert timestamp, remediation action, and resolution timestamp.
Pass criteria: All 5 events are logged. All required fields are present and accurate for each event. Log integrity verification passes (no evidence of tampering).
Fail criteria: Any event is missing from the log, any required field is absent, or log integrity verification fails.

Test 8.6: Baseline Modification Without Approval — Rejection

Stimulus: Attempt to modify the infrastructure baseline repository directly (bypassing the change approval workflow) to update a security-critical configuration value. Verify that the modification is rejected.
Expected behaviour: The version control system's policy enforcement blocks the unapproved modification. The attempt is logged as a policy violation.
Pass criteria: Modification is rejected. The rejection and the attempt are logged with the identity of the person making the attempt.
Fail criteria: The baseline is modified without approval, or the attempt is not logged.

Test 8.7: Multi-Layer Drift Coverage Verification

Stimulus: Introduce simultaneous drift in each infrastructure layer: a compute specification change, a network rule change, an IAM policy change, a storage configuration change, and a runtime parameter change. Verify that the drift detection system detects all five.
Expected behaviour: All five drifts are detected, classified, and alerted independently. No layer is missed. Each drift event is logged separately with the correct component and classification.
Pass criteria: All 5 drifts detected. Each correctly classified by layer and severity. Each generates an independent alert and log entry.
Fail criteria: Any drift is undetected, or drifts from different layers are conflated or misclassified.

Conformance Scoring

Score 0: No infrastructure drift detection exists. Infrastructure changes are not compared against any baseline. The organisation cannot determine whether its infrastructure matches its approved state.
Score 1: An infrastructure baseline exists but drift scanning is manual or infrequent (less than daily). Alerts are manual or ad-hoc. Remediation is not tracked against SLAs. Security-critical and operational drift are not distinguished.
Score 2: Automated drift scanning runs at least daily for all components and hourly for security-critical components. Drift is classified by severity. Alerts are generated automatically. Remediation is tracked with defined SLAs. Drift events are logged in a tamper-evident log. Baseline modifications require change approval.
Score 3: All Score 2 capabilities verified through independent audit. Real-time event-driven detection supplements periodic scans. Automated remediation handles low-risk drift. Predictive analysis identifies high-risk components. The organisation can demonstrate that no drift persists undetected beyond the defined scan interval and that all drift events are resolved within SLA.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Supports compliance
EU AI Act	Article 15 (Accuracy, Robustness and Cybersecurity)	Direct requirement
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Direct requirement
NIST AI RMF	GOVERN 1.4, MANAGE 2.2, MANAGE 4.1	Supports compliance
ISO 42001	Clause 6.1 (Actions to Address Risks), Clause 9.1 (Monitoring, Measurement, Analysis and Evaluation)	Supports compliance
DORA	Article 9 (ICT Risk Management Framework), Article 10 (ICT Risk Management Tools)	Direct requirement

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires that high-risk AI systems are resilient as regards attempts by unauthorised third parties to alter their use, outputs, or performance by exploiting system vulnerabilities. Infrastructure drift is a system vulnerability: it creates gaps between the approved security posture and the actual security posture. An AI system running on drifted infrastructure cannot be considered robust because its operational characteristics may differ from those that were tested and approved. Organisations must demonstrate continuous monitoring for infrastructure drift as part of their Article 15 robustness assurance.

DORA — Article 9 and Article 10

DORA Article 9 requires financial entities to establish ICT risk management frameworks that identify and protect against ICT risks. Infrastructure drift is an ICT risk — it represents an uncontrolled change to the ICT environment. Article 10 specifically requires tools for detecting anomalous activities including ICT-related incidents that may materialise. Infrastructure drift detection is a direct implementation of this requirement. Financial entities must demonstrate continuous drift monitoring as part of their DORA compliance.

SOX — Section 404 (Internal Controls Over Financial Reporting)

For AI agents that participate in financial reporting processes, infrastructure drift can undermine the controls that ensure reporting integrity. A drifted access control may permit unauthorised data modification. A drifted compute configuration may degrade processing accuracy. SOX auditors will assess whether infrastructure configurations are monitored and controlled. Drift detection provides the evidence that infrastructure controls remain effective between audit cycles.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA requires firms to take reasonable care to establish and maintain systems and controls appropriate to their business. Infrastructure that silently deviates from approved configurations is not adequately controlled. The FCA's operational resilience framework (PS21/3) expects firms to identify and manage risks to important business services — infrastructure drift is a risk that can push services outside impact tolerances. Drift detection is a core systems and controls requirement for any firm operating AI agents in regulated financial services.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Environment-wide — drift in shared infrastructure components (network rules, IAM policies, platform configurations) can affect every AI agent and service running in the affected environment simultaneously

Consequence chain: Infrastructure silently diverges from its approved baseline, causing the actual operational environment to differ from the governed, tested, and approved environment. The immediate technical failure is a gap between documented state and actual state: the organisation believes its infrastructure matches the baseline, but it does not. The operational impact varies by the nature of the drift: network rule drift can expose internal services to external attack (Scenario A: £4.7 million in client losses from front-running), compute resource drift can degrade real-time safety performance (Scenario B: £2.1 million remediation and 14-day shutdown from latency exceedance), and IAM policy drift can amplify the impact of other attacks by providing excessive permissions (Scenario C: 12,000 customer records exfiltrated due to drifted permissions). The business consequence includes regulatory enforcement for inadequate controls, financial loss from exploited drift, safety incidents from degraded infrastructure, and systemic risk when drift accumulates across multiple components — each individual drift may appear low-risk, but the combination creates emergent vulnerabilities that were never assessed. The compound nature of drift is its most dangerous property: it accumulates silently, each small deviation appearing harmless, until the aggregate state diverges so far from the baseline that fundamental governance assumptions are invalid.

Cross-references: AG-006 (Tamper-Evident Record Integrity), AG-007 (Governance Configuration Control), AG-022 (Behavioural Drift Detection), AG-399 (Infrastructure Identity & Access Governance), AG-401 (Network Segmentation Governance), AG-403 (Secret & Credential Lifecycle Governance), AG-406 (Dependency Supply-Chain Governance), AG-407 (Build Pipeline Attestation Governance).

Cite this protocol

AgentGoverning. (2026). AG-408: Infrastructure Drift Detection Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-408

← Previous Protocol

AG-407

Build Pipeline Attestation Governance

Next Protocol →

AG-409

Critical Event Taxonomy Governance