AG-403

Dependency Failover Validation Governance

Infrastructure, Platform & Network ~21 min read AGS v2.1 · April 2026
EU AI Act GDPR SOX FCA NIST HIPAA ISO 42001

2. Summary

Dependency Failover Validation Governance requires that every failover event — whether triggered by infrastructure failure, cloud-provider region outage, load balancer health check, or manual switchover — preserves the full governance posture of the pre-failover state, not merely uptime. Failover mechanisms in modern distributed systems are optimised for availability: traffic shifts to a secondary replica, a standby cluster promotes, or a cold site activates. These mechanisms rarely verify that safety constraints, audit pipelines, mandate enforcement layers, region-pinning rules, and policy configurations survive the transition intact. This dimension mandates that failover validation includes governance-state verification as a blocking prerequisite before the failover target accepts agent traffic, ensuring that the system never trades safety for availability.

3. Example

Scenario A — Region Failover Drops Mandate Enforcement Gateway: A European financial services firm operates AI agents for automated trade execution across two availability zones within the EU. The primary zone hosts the mandate enforcement gateway (AG-001) as a sidecar alongside the agent runtime. When a power event forces failover to the secondary zone, the orchestrator brings up agent containers within 14 seconds — well within the 30-second recovery time objective. However, the mandate enforcement sidecar is configured through a region-specific deployment manifest that was not replicated to the secondary zone's container registry. Agents come online without the enforcement sidecar. Over 9 minutes before an operator notices, the agents execute 1,247 trades totalling £6.3 million with no per-transaction or aggregate limit enforcement. Twelve trades exceed the approved £50,000 per-transaction ceiling, with the largest single trade at £312,000.

What went wrong: The failover procedure validated compute availability and network connectivity but did not verify governance component presence. The mandate enforcement gateway was treated as an application dependency rather than a governance-critical component with its own failover validation requirement. The deployment manifest divergence between primary and secondary zones was never tested. Consequence: £6.3 million in uncontrolled trading exposure, FCA enforcement inquiry for inadequate systems and controls, £312,000 in potential loss from the largest uncontrolled trade, 90-day remediation mandate from the regulator, and personal liability exposure for the Senior Manager responsible under SM&CR.

Scenario B — DNS Failover Routes Traffic Outside Approved Jurisdiction: A healthcare AI agent platform serves patients across the EU, with strict GDPR data residency requirements enforced through AG-399 region-pinning controls. The platform uses DNS-based failover to route traffic between EU data centres. When the primary Frankfurt data centre experiences a network partition, the DNS failover configuration — last updated 11 months ago during a cloud migration — routes traffic to a disaster recovery site in Virginia, USA. The failover succeeds technically: agents resume processing patient queries within 22 seconds. However, patient health data is now being processed and stored in a US jurisdiction without Standard Contractual Clauses in place for this specific processing activity. The violation continues for 3 hours and 17 minutes, affecting 4,200 patient interactions containing protected health information.

What went wrong: The DNS failover target was configured before the data residency policy was implemented. No failover validation checked whether the target site satisfied jurisdictional requirements. The failover mechanism optimised for availability without verifying policy compliance. Consequence: GDPR Article 44 violation for unauthorised transfer of personal data to a third country. Data Protection Authority notification required within 72 hours. Potential fine of up to EUR 20 million or 4% of annual turnover. 4,200 patients requiring individual breach notification. Contractual liability to healthcare provider clients whose data residency SLAs were violated.

Scenario C — Database Failover Resets Audit Pipeline Connection: An enterprise deploys AI agents for automated invoice processing with a governance architecture that includes a PostgreSQL database for mandate storage, a message queue for audit event streaming, and a separate audit data warehouse. The database runs in a primary-replica configuration with automatic failover. When the primary database fails, the replica promotes successfully and agent operations resume within 8 seconds. However, the audit event streaming pipeline was connected to the primary database's change-data-capture slot, which does not exist on the newly promoted replica. Audit events generated after failover are written to the database but never propagate to the audit data warehouse. The gap persists for 6 days until a weekly reconciliation job detects missing records. During that window, 23,400 agent actions worth £4.1 million in aggregate invoice approvals have no corresponding audit trail in the compliance data warehouse.

What went wrong: The failover procedure validated database availability and application connectivity but did not verify downstream governance dependencies. The change-data-capture slot was a database-level resource tied to the primary instance. No post-failover health check verified audit pipeline continuity. Consequence: 6-day audit gap affecting 23,400 actions. SOX Section 404 material weakness finding for inability to demonstrate complete audit trail. External auditor qualification of the annual controls assessment. £290,000 in remediation costs for manual audit reconstruction and pipeline redesign.

4. Requirement Statement

Scope: This dimension applies to all systems where AI agents operate behind failover mechanisms of any kind — active-passive database replication, active-active multi-region deployment, DNS-based traffic shifting, load balancer health checks, container orchestrator rescheduling, cloud provider availability zone failover, or manual disaster recovery procedures. The scope extends to every component in the governance stack: mandate enforcement layers, audit pipelines, identity verification services, region-pinning controls, secrets management, encryption key availability, and human escalation channels. A failover that restores compute and network connectivity but leaves any governance component degraded, absent, or misconfigured is within scope. The scope includes planned failovers (maintenance windows, blue-green deployments) as well as unplanned failovers (infrastructure failures, provider outages). The test is not whether the application recovers, but whether the full governance posture recovers.

4.1. A conforming system MUST define a governance-state manifest for each deployment that enumerates all governance-critical components whose presence and correct configuration are required before agent traffic is accepted.

4.2. A conforming system MUST execute a governance-state validation check as a blocking gate after any failover event and before the failover target begins accepting agent actions.

4.3. A conforming system MUST block agent operations on the failover target if any governance-critical component enumerated in the governance-state manifest fails validation, rather than permitting ungoverned operation for availability.

4.4. A conforming system MUST verify that audit pipeline continuity is preserved across failover events, including confirmation that audit events generated after failover are captured, transmitted, and stored in the designated audit repository.

4.5. A conforming system MUST verify that jurisdictional and data-residency constraints (where applicable) are satisfied by the failover target before routing agent traffic to it.

4.6. A conforming system MUST log every failover event with a structured record that includes: trigger cause, failover source, failover target, governance-state validation result, timestamp of agent traffic resumption, and any governance components that required re-initialisation.

4.7. A conforming system SHOULD execute failover governance-validation drills at a defined frequency — no less than quarterly — using simulated failures that exercise the full governance-state validation sequence.

4.8. A conforming system SHOULD maintain governance-state manifest version parity between primary and failover targets through automated synchronisation, with drift detection alerting when manifests diverge.

4.9. A conforming system SHOULD measure and report governance recovery time as a distinct metric from application recovery time, tracking the interval between failover initiation and governance-state validation completion.

4.10. A conforming system MAY implement automated governance-state remediation that attempts to re-initialise missing governance components on the failover target before blocking agent traffic, provided the remediation completes within a defined time budget.

5. Rationale

Dependency Failover Validation Governance addresses a systemic blind spot in infrastructure resilience engineering: failover mechanisms are designed and tested for availability, not for governance preservation. Every major cloud provider, container orchestrator, and database system offers failover capabilities. These capabilities are measured by recovery time objectives (RTO) and recovery point objectives (RPO) — metrics that quantify how quickly the system recovers and how much data is lost. Neither metric addresses whether the governance posture survives the transition. An agent platform that recovers from a regional outage in 14 seconds with zero data loss scores perfectly on RTO and RPO while potentially operating with no mandate enforcement, no audit trail, and no jurisdictional compliance.

The root cause is architectural: governance components are typically deployed alongside application components but are not treated as first-class failover dependencies. A container orchestrator knows that an agent container needs a database connection and a message queue endpoint. It does not know that the agent also requires a mandate enforcement sidecar, an audit event pipeline, a region-pinning validator, and an encryption key accessible only from approved jurisdictions. When the orchestrator reschedules the agent to a new node after a failure, it restores the declared dependencies but not the undeclared governance dependencies.

This gap is particularly dangerous because it manifests only during failure — precisely the moment when operational attention is focused on restoring service rather than verifying governance. Incident response procedures prioritise availability. The first question in any outage is "is the service back up?" not "are the governance controls intact?" By the time governance verification occurs — if it occurs at all — agents may have operated without full governance for minutes, hours, or days.

Historical incident patterns confirm this risk. Major financial institutions have experienced scenarios where database failovers disrupted audit pipelines for days before detection. Healthcare platforms have inadvertently routed patient data to non-compliant jurisdictions during DNS failover events. Industrial control systems have operated safety-critical processes with degraded safety interlocks after controller failover. In each case, the availability failover succeeded — the system was "up" — but the governance failover failed.

AG-403 intersects with AG-008 (Governance Continuity Under Failure) but addresses a distinct problem. AG-008 governs what happens when governance infrastructure is known to be unavailable — it defines degraded-mode tiers and fail-safe behaviour. AG-403 governs the transition between states — the moment of failover itself — where the danger is not that governance is known to be absent but that it is assumed to be present when it is not. The failure mode is silent: the system reports healthy, the agent operates normally, and no alert fires because the governance component was never registered as a health check dependency.

The regulatory context reinforces the criticality. DORA Article 11 requires financial entities to test ICT business continuity plans including failover procedures. The EU AI Act Article 9 requires risk management systems that address foreseeable failures. SOX Section 404 requires internal controls to be effective across all operating conditions, including disaster recovery. A failover that disables governance controls creates a control gap that is reportable under each of these frameworks.

6. Implementation Guidance

AG-403 requires organisations to treat governance components as first-class failover dependencies with explicit validation gates. The implementation centres on governance-state manifests, post-failover validation sequences, drill programmes, and recovery-time instrumentation.

The governance-state manifest is the foundational artefact. It enumerates every component required for full governance operation: mandate enforcement gateways, audit pipeline endpoints, identity verification services, encryption key availability, region-pinning validators, human escalation channels, and any other governance-critical dependency. Each entry in the manifest specifies: component name, health check endpoint or verification method, expected configuration hash, maximum acceptable validation latency, and the action to take on validation failure (block traffic, activate degraded mode per AG-008, or alert and continue with enhanced monitoring).

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Failover validation must confirm that trading mandate limits, aggregate exposure counters, counterparty restrictions, and position tracking are all intact on the failover target. Under FCA expectations and DORA Article 11, firms must demonstrate that ICT business continuity testing includes verification of risk controls, not merely application availability. Failover governance-validation drill results should be included in the firm's operational resilience self-assessment. The PRA's operational resilience framework expects firms to remain within impact tolerances during disruption — a failover that disables mandate enforcement places the firm outside its impact tolerance even if the application is available.

Healthcare. Failover targets must satisfy HIPAA Security Rule requirements for access controls, audit controls, and integrity controls. A failover to a site without proper access control configuration could expose protected health information to unauthorised personnel. Jurisdictional verification is critical for cross-border healthcare platforms subject to GDPR data localisation requirements. Failover drills should verify that patient consent enforcement mechanisms survive the transition.

Critical Infrastructure / Safety-Critical Systems. For embodied agents and CPS deployments, failover validation must include verification of safety interlock integrity. A robotic control agent that fails over to a secondary controller must verify that emergency stop circuits, actuator range limits, and safety zone enforcement are all operational before resuming control. IEC 61508 SIL requirements apply to the failover validation mechanism itself — the validation must be at least as reliable as the safety function it is protecting.

Cross-Border / Multi-Jurisdiction. Failover events are a primary vector for inadvertent jurisdictional violations. Automated failover configurations that include targets in different legal jurisdictions must include jurisdictional pre-flight checks. The failover chain should be ordered by jurisdictional compliance: same jurisdiction first, then approved jurisdictions with appropriate transfer mechanisms, then deny. Failover to a non-compliant jurisdiction should never be automatic.

Maturity Model

Basic Implementation — The organisation has documented a governance-state manifest listing governance-critical components for each deployment. Post-failover validation is implemented as a checklist in the incident response runbook, executed manually by operators after failover events. Failover drills occur annually and include a manual governance verification step. Governance recovery time is not measured as a distinct metric. This level identifies governance components but relies on manual processes that are vulnerable to omission during high-pressure incidents.

Intermediate Implementation — Post-failover governance-state validation is automated and executes as a blocking gate before agent traffic is routed to the failover target. The governance-state manifest is version-controlled and synchronised between primary and failover targets with automated drift detection. Failover drills occur quarterly and include automated governance-state validation with pass/fail reporting. Governance recovery time is measured and reported as a distinct metric alongside application recovery time. Audit pipeline continuity is verified end-to-end after every failover event.

Advanced Implementation — All intermediate capabilities plus: governance-state validation is integrated into the container orchestrator or load balancer as a native readiness condition. Chaos engineering programmes regularly inject governance-component failures during failover events to verify that the validation gate catches them. Jurisdictional pre-flight checks are automated for all failover targets. The governance-state manifest is generated from infrastructure-as-code definitions, eliminating manual manifest maintenance. Governance recovery time is subject to a defined objective (governance recovery time objective, or GRTO) that is tested and reported to the board. Independent third-party testing has verified that no failover scenario permits ungoverned agent operation.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Testing AG-403 compliance requires simulating failover events across all failover mechanisms and verifying that governance-state validation operates correctly under each scenario.

Test 8.1: Governance-State Manifest Completeness

Test 8.2: Post-Failover Validation Blocks Until Complete

Test 8.3: Audit Pipeline Continuity Verification

Test 8.4: Jurisdictional Compliance on Failover

Test 8.5: Configuration Parity Between Primary and Failover

Test 8.6: Failover Event Logging Completeness

Test 8.7: Governance Drill Execution Under Simulated Failure

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Direct requirement
EU AI ActArticle 15 (Accuracy, Robustness and Cybersecurity)Supports compliance
SOXSection 404 (Internal Controls Over Financial Reporting)Direct requirement
FCA SYSC15A (Operational Resilience)Direct requirement
NIST AI RMFGOVERN 1.1, MANAGE 2.4Supports compliance
ISO 42001Clause 6.1 (Actions to Address Risks), Clause 8.4 (AI System Operation)Supports compliance
DORAArticle 11 (ICT Business Continuity Management)Direct requirement

EU AI Act — Article 9 (Risk Management System)

Article 9 requires providers of high-risk AI systems to establish a risk management system that identifies and mitigates foreseeable risks. Failover events are foreseeable operational risks for any distributed AI system. A failover that disables governance controls represents a failure to mitigate a foreseeable risk. AG-403 implements the risk mitigation measure specifically for the failover scenario: governance-state validation ensures that risk management controls survive infrastructure transitions. The regulation's requirement that risks be mitigated "as far as technically feasible" means that manual post-failover verification — when automated blocking gates are technically feasible — would not meet the standard.

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires high-risk AI systems to be resilient against errors and inconsistencies. A failover that degrades governance posture is an inconsistency in the system's operational behaviour — the system behaves differently (less governed) after failover than before. AG-403's governance-state validation ensures behavioural consistency across failover events, supporting the robustness requirement.

SOX — Section 404 (Internal Controls Over Financial Reporting)

Section 404 requires internal controls to be effective across all operating conditions. A failover that disables audit pipelines creates a gap in the audit trail that constitutes a control deficiency. If the gap is material — affecting the completeness of financial transaction records — it is a material weakness reportable in the annual assessment. AG-403's audit pipeline continuity verification directly addresses this risk. A SOX auditor will ask: "Are your internal controls effective during disaster recovery?" AG-403 provides the demonstrable answer.

FCA SYSC — 15A (Operational Resilience)

The FCA's operational resilience framework requires firms to identify important business services, set impact tolerances, and demonstrate the ability to remain within those tolerances during severe but plausible disruption scenarios. For firms deploying AI agents in important business services, the governance controls over those agents are part of the service's operational resilience. A failover that disables governance controls takes the firm outside its impact tolerance even if the application itself recovers. AG-403's governance recovery time metric and drill programme directly support the FCA's requirement for tested resilience within impact tolerances.

NIST AI RMF — GOVERN 1.1, MANAGE 2.4

GOVERN 1.1 addresses legal and regulatory requirements. MANAGE 2.4 addresses mechanisms to assess and monitor the AI system's trustworthiness. AG-403 supports compliance by ensuring that trustworthiness mechanisms (governance controls) remain operational during infrastructure transitions, maintaining the continuous assessment and monitoring required by MANAGE 2.4.

ISO 42001 — Clause 6.1, Clause 8.4

Clause 6.1 requires actions to address risks within the AI management system. Clause 8.4 addresses AI system operation, including maintaining operational controls. AG-403 ensures that operational controls survive failover events, directly addressing the risk that infrastructure transitions could disable governance mechanisms.

DORA — Article 11 (ICT Business Continuity Management)

Article 11 requires financial entities to establish ICT business continuity management policies including testing of ICT business continuity plans. DORA specifically requires that testing cover failover between primary infrastructure and redundant capacity. AG-403's quarterly failover governance-validation drills directly implement this testing requirement with a governance-specific focus. The requirement to document test results and remediate identified weaknesses maps to AG-403's drill report and configuration parity evidence requirements.

10. Failure Severity

FieldValue
Severity RatingCritical
Blast RadiusOrganisation-wide — potentially cross-jurisdiction where failover routes traffic to non-compliant regions, and cross-organisation where shared infrastructure failover affects multiple tenants

Consequence chain: A failover event that does not validate governance state creates a window of ungoverned agent operation that is invisible to operators because the system reports healthy from an availability perspective. The immediate technical failure is the absence of one or more governance components — mandate enforcement, audit pipeline, jurisdictional controls, or identity verification — on the failover target. Because the failure is silent (no alert fires, no error is returned to the agent), agents continue operating at full capability without the corresponding governance constraints. The operational impact compounds at machine speed: an agent without mandate enforcement can execute actions of unlimited scope; an agent without audit pipeline connectivity generates an irrecoverable gap in the compliance record; an agent without jurisdictional controls may process data in a non-compliant location. The duration of exposure depends on detection capability — without automated governance-state validation, detection relies on manual observation or downstream reconciliation, which can take minutes, hours, or days. The business consequence includes regulatory enforcement for control failures during disruption (DORA Article 11, FCA operational resilience), material weakness findings for audit trail gaps (SOX Section 404), data protection violations for jurisdictional breaches (GDPR Article 44), and potential personal liability for senior managers who certified the adequacy of controls that failed during a foreseeable operational event.

Cite this protocol
AgentGoverning. (2026). AG-403: Dependency Failover Validation Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-403