AG-408

Infrastructure Drift Detection Governance

Infrastructure, Platform & Network ~20 min read AGS v2.1 · April 2026
EU AI Act GDPR SOX FCA NIST HIPAA ISO 42001

2. Summary

Infrastructure Drift Detection Governance requires that organisations operating AI agent systems continuously monitor their deployed infrastructure for unauthorised divergence from approved baselines — declared, version-controlled specifications of what the infrastructure should look like. Infrastructure drift occurs when the actual state of compute resources, network configurations, storage parameters, access policies, or runtime environments deviates from the approved state without a corresponding approved change record. Drift can be introduced through manual ad-hoc modifications, failed automation, partial rollbacks, external compromise, or resource provider changes. For AI agent environments, drift is particularly dangerous because it can silently alter the security posture, performance characteristics, or compliance status of the infrastructure on which governance controls depend. This dimension mandates systematic baseline management, continuous drift scanning, and policy-driven remediation workflows that ensure detected drift is classified, reported, and resolved before it undermines agent governance.

3. Example

Scenario A — Firewall Rule Drift Exposes Model Inference Endpoint: A financial-value AI agent serves trade recommendation inference through an internal API. The infrastructure baseline specifies that the inference endpoint is accessible only from the application tier's private subnet (10.0.4.0/24) via port 8443. A network engineer, troubleshooting a connectivity issue at 02:00 during an incident, adds a temporary firewall rule permitting inbound traffic from 0.0.0.0/0 on port 8443. The incident is resolved; the temporary rule is not removed. No drift detection is in place. The inference endpoint is now publicly accessible. Three weeks later, an external actor discovers the endpoint through a port scan, submits crafted inference requests, and extracts proprietary trading signals from the model's responses. Over 19 days of exploitation, the attacker front-runs 340 trades using the extracted signals, generating an estimated £4.7 million in illicit profit at the organisation's clients' expense.

What went wrong: A manual infrastructure change was made outside the change management process — a common occurrence during incident response. Without drift detection, the deviation from the approved baseline was never identified. The infrastructure state diverged from the declared state on day 1 but was not detected for 40 days. Consequence: 19 days of model exploitation, £4.7 million in client losses attributed to front-running, FCA enforcement action for inadequate market abuse controls, and reputational damage requiring client notification.

Scenario B — Resource Scaling Drift Degrades Safety-Critical Latency: A safety-critical AI agent monitors industrial gas pressures and issues emergency shutdown commands when thresholds are exceeded. The infrastructure baseline specifies dedicated compute instances with 32 vCPUs, 128 GB RAM, and a maximum inference latency of 50 milliseconds at the 99th percentile. During a cloud cost optimisation exercise, an automated right-sizing tool reduces the instance type to 8 vCPUs and 32 GB RAM based on average utilisation metrics — it does not account for burst requirements during emergency scenarios. The change is applied through the cloud provider's API but does not update the infrastructure-as-code baseline. No drift detection flags the discrepancy. Four months later, a simultaneous multi-sensor alarm condition generates 47 concurrent inference requests. The under-provisioned infrastructure delivers 99th-percentile latency of 340 milliseconds — nearly seven times the 50-millisecond requirement. Three shutdown commands arrive too late. Pressure exceedance causes a valve failure and a controlled release of industrial gas, requiring facility evacuation. No injuries occur, but the incident results in a £2.1 million remediation cost and a 14-day operational shutdown.

What went wrong: An automated tool modified infrastructure outside the governed change process. The actual infrastructure state diverged from the baseline specification without detection. The drift was not malicious but was equally dangerous: it silently degraded the safety margin on which the agent's real-time performance depended. Consequence: Safety incident with facility evacuation, £2.1 million remediation, 14-day shutdown, regulatory investigation by the safety authority.

Scenario C — IAM Policy Drift Grants Excessive Agent Permissions: A customer-facing AI agent operates with a narrowly scoped identity and access management (IAM) policy permitting read access to the product catalogue and write access to the order queue — and nothing else. During a feature development cycle, a developer grants the agent's service identity temporary administrative access to a data lake containing customer financial records, for the purpose of testing a new recommendation feature. The feature is abandoned, but the IAM policy change persists. Drift detection is not implemented. The agent now has read access to 2.3 million customer financial records — a violation of the principle of least privilege and a data protection breach waiting to happen. Six months later, a prompt injection attack exploits the agent's excessive permissions to exfiltrate 12,000 customer records through the agent's response channel.

What went wrong: A temporary development change to the infrastructure configuration was never reverted. Without drift detection comparing the actual IAM policy to the approved baseline, the excessive permission persisted undetected for 6 months. When the agent was compromised through an unrelated vector (prompt injection), the drifted IAM policy amplified the impact from a prompt injection to a data breach. Consequence: 12,000 customer records exfiltrated, GDPR Article 33 breach notification to supervisory authority, £1.8 million in notification, credit monitoring, and remediation costs, class-action litigation.

4. Requirement Statement

Scope: This dimension applies to all infrastructure supporting AI agent systems, including but not limited to: compute resources (virtual machines, containers, serverless functions), network configurations (firewalls, load balancers, DNS records, VPN tunnels, network segmentation rules), storage systems (databases, object stores, file systems, caching layers), identity and access management policies (IAM roles, service accounts, access control lists), runtime configurations (environment variables, feature flags, resource limits), and platform services (message queues, API gateways, monitoring endpoints). The dimension covers infrastructure managed through any mechanism: infrastructure-as-code, cloud provider consoles, APIs, CLI tools, or automated optimisation systems. It applies to infrastructure in any environment that serves production AI agent traffic, including production, disaster recovery, and any staging environment from which production promotions occur. Organisations using managed or shared infrastructure must ensure drift detection covers the configuration surfaces they control.

4.1. A conforming system MUST maintain a version-controlled, authoritative infrastructure baseline that declares the approved state of all infrastructure components supporting AI agent production environments, with each baseline version linked to a change approval record.

4.2. A conforming system MUST perform automated drift scans comparing the actual infrastructure state to the approved baseline at intervals not exceeding 24 hours for all infrastructure components, and not exceeding 1 hour for security-critical components (network access rules, IAM policies, encryption settings).

4.3. A conforming system MUST classify detected drift by severity — at minimum distinguishing between security-critical drift (changes to access controls, network exposure, encryption, or authentication), performance-critical drift (changes to compute resources, scaling parameters, or latency-affecting configurations), and operational drift (changes to non-security, non-performance configurations) — with documented classification criteria.

4.4. A conforming system MUST generate alerts for all detected drift, with security-critical drift alerts delivered to the security operations function within 15 minutes of detection, and performance-critical drift alerts delivered to the operations function within 60 minutes.

4.5. A conforming system MUST record every drift event in a tamper-evident log (per AG-006), capturing: the component affected, the baseline value, the actual value, the detection timestamp, the classification severity, and the resolution action taken.

4.6. A conforming system MUST enforce a remediation policy requiring that security-critical drift is resolved (either reverted to baseline or formally approved and the baseline updated) within 4 hours of detection, and all other drift within 72 hours.

4.7. A conforming system MUST prevent baseline modifications without a corresponding change approval record — the baseline cannot be updated to match drifted state as a substitute for remediation unless the change has been formally reviewed and approved.

4.8. A conforming system SHOULD implement real-time drift detection through event-driven monitoring of infrastructure change APIs, supplementing periodic scans with immediate detection of changes as they occur.

4.9. A conforming system SHOULD implement automated remediation for predefined low-risk drift categories, with automatic reversion to baseline for changes that match known safe-to-revert patterns (e.g., tag modifications, non-functional metadata changes).

4.10. A conforming system MAY implement predictive drift analysis, identifying infrastructure components that are statistically likely to drift based on historical patterns, and applying enhanced monitoring to those components.

5. Rationale

Infrastructure is the foundation on which all AI agent governance controls rest. Access controls, audit logging, network isolation, encryption, latency guarantees, and resource availability are all infrastructure-layer properties. When the infrastructure drifts from its approved state, these foundational properties may silently change — and every governance control that depends on them is compromised. Drift is particularly dangerous because it is often invisible: the agent continues to operate, the application continues to serve requests, and no error is raised. The system appears normal while its security posture, performance profile, or compliance status has fundamentally changed.

Four forces drive infrastructure drift. First, manual ad-hoc changes made during incidents, debugging, or experimentation. These changes are made with legitimate intent but bypass change management processes. They are the most common source of drift and the hardest to prevent entirely — incident response sometimes requires immediate infrastructure modification. The governance response is not to prevent all manual changes but to detect them promptly. Second, automated optimisation tools that modify resource allocations based on utilisation metrics without consulting governance baselines. Cloud cost optimisation, autoscaling adjustments, and provider-side maintenance can all modify infrastructure configuration. Third, partial failures in infrastructure-as-code application — where an automation run partially succeeds, modifying some components but failing on others, leaving the infrastructure in a state that matches neither the old nor the new baseline. Fourth, malicious modification by external attackers or insider threats, where infrastructure changes are the attack vector rather than the consequence.

The regulatory case for drift detection is direct. DORA Article 9 requires financial entities to identify and manage ICT risks, including risks arising from configuration changes. The EU AI Act Article 15 requires robustness — infrastructure drift that degrades security or performance is a robustness failure. SOX Section 404 requires internal controls including change management and configuration management. ISO 27001 Clause A.8.9 (Configuration Management) and A.8.32 (Change Management) explicitly require that configuration deviations are detected and addressed. The FCA's SYSC 6.1.1R requires adequate systems and controls — infrastructure that silently deviates from its approved state is not adequately controlled.

For AI agent systems, drift has compound effects. A network rule change may expose an inference endpoint (Scenario A). A compute scaling change may violate latency constraints on which safety depends (Scenario B). An IAM drift may amplify the impact of an application-layer attack from a contained incident to a data breach (Scenario C). Each of these failures originated in infrastructure drift but manifested as an AI governance failure. Drift detection is therefore not merely an infrastructure operation — it is a governance control that protects the integrity of all other controls.

6. Implementation Guidance

Infrastructure Drift Detection Governance requires a continuous closed loop: declare the desired state, measure the actual state, compare, classify, alert, and remediate. The core principle is that the infrastructure baseline is the single source of truth, and any divergence from it is a governance event requiring investigation and resolution.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Financial regulators expect firms to maintain controlled, auditable infrastructure environments. Drift in network configurations can expose trading systems or customer data (Scenario A). Drift in compute resources can affect transaction processing latency, potentially violating best execution obligations. Firms should implement real-time drift detection for all security-critical components and daily scanning for all other components. The FCA's operational resilience expectations (PS21/3) specifically require firms to remain within impact tolerances — infrastructure drift that pushes a service outside its impact tolerance is a regulatory breach.

Safety-Critical and Cyber-Physical Systems. Infrastructure drift in safety-critical environments can have life-safety consequences (Scenario B). Compute resource drift may violate real-time processing requirements. Network drift may disrupt safety communication paths. Organisations operating safety-critical AI agents should implement the tightest drift detection intervals (continuous or hourly for all components) and the shortest remediation SLAs (immediate reversion for any drift that could affect safety functions).

Healthcare. Healthcare AI infrastructure must comply with data protection regulations (HIPAA, GDPR Article 32) that require appropriate technical measures. Drift in encryption settings, access controls, or network isolation for systems processing patient data constitutes a technical measures failure. Drift detection must cover all infrastructure supporting healthcare AI agents with particular attention to data access and encryption configurations.

Crypto and Web3. Decentralised agent infrastructure often includes blockchain nodes, key management infrastructure, and bridge components where drift can result in irrecoverable financial loss. A drifted configuration on a validator node or a signing service can expose private keys or enable unauthorised transactions. Real-time drift detection with automated reversion is strongly recommended for all security-critical Web3 infrastructure components.

Maturity Model

Basic Implementation — The organisation maintains a version-controlled infrastructure baseline covering all production AI agent infrastructure components. Automated drift scans run at least daily for all components and at least hourly for security-critical components. Detected drift is classified by severity, alerts are generated, and remediation tickets are created. Drift events are logged in a tamper-evident log. Baseline modifications require change approval. This meets the minimum mandatory requirements.

Intermediate Implementation — All basic capabilities plus: event-driven detection supplements periodic scans, providing near-real-time drift detection through infrastructure change event feeds. A drift classification engine applies version-controlled rules to categorise drift automatically. Remediation workflows include one-click reversion and formalisation options. Automated remediation handles low-risk drift categories. Drift metrics (detection count, mean time to remediation, repeat drift rates) are tracked and reported to governance stakeholders monthly.

Advanced Implementation — All intermediate capabilities plus: continuous real-time monitoring of all infrastructure layers with sub-minute detection latency. Predictive drift analysis identifies high-risk components and applies enhanced monitoring. Drift detection is integrated with the build pipeline attestation system (AG-407) to correlate infrastructure changes with deployment events. Cross-environment baseline comparison detects configuration inconsistencies between production and disaster recovery environments. The organisation can demonstrate through independent audit that no infrastructure drift persists undetected beyond the defined scan interval.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: Baseline Declaration and Version Control

Test 8.2: Drift Detection — Security-Critical Component

Test 8.3: Drift Detection — Performance-Critical Component

Test 8.4: Remediation SLA Enforcement — Security-Critical

Test 8.5: Drift Event Logging Integrity

Test 8.6: Baseline Modification Without Approval — Rejection

Test 8.7: Multi-Layer Drift Coverage Verification

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Supports compliance
EU AI ActArticle 15 (Accuracy, Robustness and Cybersecurity)Direct requirement
SOXSection 404 (Internal Controls Over Financial Reporting)Supports compliance
FCA SYSC6.1.1R (Systems and Controls)Direct requirement
NIST AI RMFGOVERN 1.4, MANAGE 2.2, MANAGE 4.1Supports compliance
ISO 42001Clause 6.1 (Actions to Address Risks), Clause 9.1 (Monitoring, Measurement, Analysis and Evaluation)Supports compliance
DORAArticle 9 (ICT Risk Management Framework), Article 10 (ICT Risk Management Tools)Direct requirement

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires that high-risk AI systems are resilient as regards attempts by unauthorised third parties to alter their use, outputs, or performance by exploiting system vulnerabilities. Infrastructure drift is a system vulnerability: it creates gaps between the approved security posture and the actual security posture. An AI system running on drifted infrastructure cannot be considered robust because its operational characteristics may differ from those that were tested and approved. Organisations must demonstrate continuous monitoring for infrastructure drift as part of their Article 15 robustness assurance.

DORA — Article 9 and Article 10

DORA Article 9 requires financial entities to establish ICT risk management frameworks that identify and protect against ICT risks. Infrastructure drift is an ICT risk — it represents an uncontrolled change to the ICT environment. Article 10 specifically requires tools for detecting anomalous activities including ICT-related incidents that may materialise. Infrastructure drift detection is a direct implementation of this requirement. Financial entities must demonstrate continuous drift monitoring as part of their DORA compliance.

SOX — Section 404 (Internal Controls Over Financial Reporting)

For AI agents that participate in financial reporting processes, infrastructure drift can undermine the controls that ensure reporting integrity. A drifted access control may permit unauthorised data modification. A drifted compute configuration may degrade processing accuracy. SOX auditors will assess whether infrastructure configurations are monitored and controlled. Drift detection provides the evidence that infrastructure controls remain effective between audit cycles.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA requires firms to take reasonable care to establish and maintain systems and controls appropriate to their business. Infrastructure that silently deviates from approved configurations is not adequately controlled. The FCA's operational resilience framework (PS21/3) expects firms to identify and manage risks to important business services — infrastructure drift is a risk that can push services outside impact tolerances. Drift detection is a core systems and controls requirement for any firm operating AI agents in regulated financial services.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusEnvironment-wide — drift in shared infrastructure components (network rules, IAM policies, platform configurations) can affect every AI agent and service running in the affected environment simultaneously

Consequence chain: Infrastructure silently diverges from its approved baseline, causing the actual operational environment to differ from the governed, tested, and approved environment. The immediate technical failure is a gap between documented state and actual state: the organisation believes its infrastructure matches the baseline, but it does not. The operational impact varies by the nature of the drift: network rule drift can expose internal services to external attack (Scenario A: £4.7 million in client losses from front-running), compute resource drift can degrade real-time safety performance (Scenario B: £2.1 million remediation and 14-day shutdown from latency exceedance), and IAM policy drift can amplify the impact of other attacks by providing excessive permissions (Scenario C: 12,000 customer records exfiltrated due to drifted permissions). The business consequence includes regulatory enforcement for inadequate controls, financial loss from exploited drift, safety incidents from degraded infrastructure, and systemic risk when drift accumulates across multiple components — each individual drift may appear low-risk, but the combination creates emergent vulnerabilities that were never assessed. The compound nature of drift is its most dangerous property: it accumulates silently, each small deviation appearing harmless, until the aggregate state diverges so far from the baseline that fundamental governance assumptions are invalid.

Cross-references: AG-006 (Tamper-Evident Record Integrity), AG-007 (Governance Configuration Control), AG-022 (Behavioural Drift Detection), AG-399 (Infrastructure Identity & Access Governance), AG-401 (Network Segmentation Governance), AG-403 (Secret & Credential Lifecycle Governance), AG-406 (Dependency Supply-Chain Governance), AG-407 (Build Pipeline Attestation Governance).

Cite this protocol
AgentGoverning. (2026). AG-408: Infrastructure Drift Detection Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-408