AG-403: Dependency Failover Validation Governance

2. Summary

Dependency Failover Validation Governance requires that every failover event — whether triggered by infrastructure failure, cloud-provider region outage, load balancer health check, or manual switchover — preserves the full governance posture of the pre-failover state, not merely uptime. Failover mechanisms in modern distributed systems are optimised for availability: traffic shifts to a secondary replica, a standby cluster promotes, or a cold site activates. These mechanisms rarely verify that safety constraints, audit pipelines, mandate enforcement layers, region-pinning rules, and policy configurations survive the transition intact. This dimension mandates that failover validation includes governance-state verification as a blocking prerequisite before the failover target accepts agent traffic, ensuring that the system never trades safety for availability.

3. Example

Scenario A — Region Failover Drops Mandate Enforcement Gateway: A European financial services firm operates AI agents for automated trade execution across two availability zones within the EU. The primary zone hosts the mandate enforcement gateway (AG-001) as a sidecar alongside the agent runtime. When a power event forces failover to the secondary zone, the orchestrator brings up agent containers within 14 seconds — well within the 30-second recovery time objective. However, the mandate enforcement sidecar is configured through a region-specific deployment manifest that was not replicated to the secondary zone's container registry. Agents come online without the enforcement sidecar. Over 9 minutes before an operator notices, the agents execute 1,247 trades totalling £6.3 million with no per-transaction or aggregate limit enforcement. Twelve trades exceed the approved £50,000 per-transaction ceiling, with the largest single trade at £312,000.

What went wrong: The failover procedure validated compute availability and network connectivity but did not verify governance component presence. The mandate enforcement gateway was treated as an application dependency rather than a governance-critical component with its own failover validation requirement. The deployment manifest divergence between primary and secondary zones was never tested. Consequence: £6.3 million in uncontrolled trading exposure, FCA enforcement inquiry for inadequate systems and controls, £312,000 in potential loss from the largest uncontrolled trade, 90-day remediation mandate from the regulator, and personal liability exposure for the Senior Manager responsible under SM&CR.

Scenario B — DNS Failover Routes Traffic Outside Approved Jurisdiction: A healthcare AI agent platform serves patients across the EU, with strict GDPR data residency requirements enforced through AG-399 region-pinning controls. The platform uses DNS-based failover to route traffic between EU data centres. When the primary Frankfurt data centre experiences a network partition, the DNS failover configuration — last updated 11 months ago during a cloud migration — routes traffic to a disaster recovery site in Virginia, USA. The failover succeeds technically: agents resume processing patient queries within 22 seconds. However, patient health data is now being processed and stored in a US jurisdiction without Standard Contractual Clauses in place for this specific processing activity. The violation continues for 3 hours and 17 minutes, affecting 4,200 patient interactions containing protected health information.

What went wrong: The DNS failover target was configured before the data residency policy was implemented. No failover validation checked whether the target site satisfied jurisdictional requirements. The failover mechanism optimised for availability without verifying policy compliance. Consequence: GDPR Article 44 violation for unauthorised transfer of personal data to a third country. Data Protection Authority notification required within 72 hours. Potential fine of up to EUR 20 million or 4% of annual turnover. 4,200 patients requiring individual breach notification. Contractual liability to healthcare provider clients whose data residency SLAs were violated.

Scenario C — Database Failover Resets Audit Pipeline Connection: An enterprise deploys AI agents for automated invoice processing with a governance architecture that includes a PostgreSQL database for mandate storage, a message queue for audit event streaming, and a separate audit data warehouse. The database runs in a primary-replica configuration with automatic failover. When the primary database fails, the replica promotes successfully and agent operations resume within 8 seconds. However, the audit event streaming pipeline was connected to the primary database's change-data-capture slot, which does not exist on the newly promoted replica. Audit events generated after failover are written to the database but never propagate to the audit data warehouse. The gap persists for 6 days until a weekly reconciliation job detects missing records. During that window, 23,400 agent actions worth £4.1 million in aggregate invoice approvals have no corresponding audit trail in the compliance data warehouse.

What went wrong: The failover procedure validated database availability and application connectivity but did not verify downstream governance dependencies. The change-data-capture slot was a database-level resource tied to the primary instance. No post-failover health check verified audit pipeline continuity. Consequence: 6-day audit gap affecting 23,400 actions. SOX Section 404 material weakness finding for inability to demonstrate complete audit trail. External auditor qualification of the annual controls assessment. £290,000 in remediation costs for manual audit reconstruction and pipeline redesign.

4. Requirement Statement

Scope: This dimension applies to all systems where AI agents operate behind failover mechanisms of any kind — active-passive database replication, active-active multi-region deployment, DNS-based traffic shifting, load balancer health checks, container orchestrator rescheduling, cloud provider availability zone failover, or manual disaster recovery procedures. The scope extends to every component in the governance stack: mandate enforcement layers, audit pipelines, identity verification services, region-pinning controls, secrets management, encryption key availability, and human escalation channels. A failover that restores compute and network connectivity but leaves any governance component degraded, absent, or misconfigured is within scope. The scope includes planned failovers (maintenance windows, blue-green deployments) as well as unplanned failovers (infrastructure failures, provider outages). The test is not whether the application recovers, but whether the full governance posture recovers.

4.1. A conforming system MUST define a governance-state manifest for each deployment that enumerates all governance-critical components whose presence and correct configuration are required before agent traffic is accepted.

4.2. A conforming system MUST execute a governance-state validation check as a blocking gate after any failover event and before the failover target begins accepting agent actions.

4.3. A conforming system MUST block agent operations on the failover target if any governance-critical component enumerated in the governance-state manifest fails validation, rather than permitting ungoverned operation for availability.

4.4. A conforming system MUST verify that audit pipeline continuity is preserved across failover events, including confirmation that audit events generated after failover are captured, transmitted, and stored in the designated audit repository.

4.5. A conforming system MUST verify that jurisdictional and data-residency constraints (where applicable) are satisfied by the failover target before routing agent traffic to it.

4.6. A conforming system MUST log every failover event with a structured record that includes: trigger cause, failover source, failover target, governance-state validation result, timestamp of agent traffic resumption, and any governance components that required re-initialisation.

4.7. A conforming system SHOULD execute failover governance-validation drills at a defined frequency — no less than quarterly — using simulated failures that exercise the full governance-state validation sequence.

4.8. A conforming system SHOULD maintain governance-state manifest version parity between primary and failover targets through automated synchronisation, with drift detection alerting when manifests diverge.

4.9. A conforming system SHOULD measure and report governance recovery time as a distinct metric from application recovery time, tracking the interval between failover initiation and governance-state validation completion.

4.10. A conforming system MAY implement automated governance-state remediation that attempts to re-initialise missing governance components on the failover target before blocking agent traffic, provided the remediation completes within a defined time budget.

5. Rationale

Dependency Failover Validation Governance addresses a systemic blind spot in infrastructure resilience engineering: failover mechanisms are designed and tested for availability, not for governance preservation. Every major cloud provider, container orchestrator, and database system offers failover capabilities. These capabilities are measured by recovery time objectives (RTO) and recovery point objectives (RPO) — metrics that quantify how quickly the system recovers and how much data is lost. Neither metric addresses whether the governance posture survives the transition. An agent platform that recovers from a regional outage in 14 seconds with zero data loss scores perfectly on RTO and RPO while potentially operating with no mandate enforcement, no audit trail, and no jurisdictional compliance.

The root cause is architectural: governance components are typically deployed alongside application components but are not treated as first-class failover dependencies. A container orchestrator knows that an agent container needs a database connection and a message queue endpoint. It does not know that the agent also requires a mandate enforcement sidecar, an audit event pipeline, a region-pinning validator, and an encryption key accessible only from approved jurisdictions. When the orchestrator reschedules the agent to a new node after a failure, it restores the declared dependencies but not the undeclared governance dependencies.

This gap is particularly dangerous because it manifests only during failure — precisely the moment when operational attention is focused on restoring service rather than verifying governance. Incident response procedures prioritise availability. The first question in any outage is "is the service back up?" not "are the governance controls intact?" By the time governance verification occurs — if it occurs at all — agents may have operated without full governance for minutes, hours, or days.

Historical incident patterns confirm this risk. Major financial institutions have experienced scenarios where database failovers disrupted audit pipelines for days before detection. Healthcare platforms have inadvertently routed patient data to non-compliant jurisdictions during DNS failover events. Industrial control systems have operated safety-critical processes with degraded safety interlocks after controller failover. In each case, the availability failover succeeded — the system was "up" — but the governance failover failed.

AG-403 intersects with AG-008 (Governance Continuity Under Failure) but addresses a distinct problem. AG-008 governs what happens when governance infrastructure is known to be unavailable — it defines degraded-mode tiers and fail-safe behaviour. AG-403 governs the transition between states — the moment of failover itself — where the danger is not that governance is known to be absent but that it is assumed to be present when it is not. The failure mode is silent: the system reports healthy, the agent operates normally, and no alert fires because the governance component was never registered as a health check dependency.

The regulatory context reinforces the criticality. DORA Article 11 requires financial entities to test ICT business continuity plans including failover procedures. The EU AI Act Article 9 requires risk management systems that address foreseeable failures. SOX Section 404 requires internal controls to be effective across all operating conditions, including disaster recovery. A failover that disables governance controls creates a control gap that is reportable under each of these frameworks.

6. Implementation Guidance

AG-403 requires organisations to treat governance components as first-class failover dependencies with explicit validation gates. The implementation centres on governance-state manifests, post-failover validation sequences, drill programmes, and recovery-time instrumentation.

The governance-state manifest is the foundational artefact. It enumerates every component required for full governance operation: mandate enforcement gateways, audit pipeline endpoints, identity verification services, encryption key availability, region-pinning validators, human escalation channels, and any other governance-critical dependency. Each entry in the manifest specifies: component name, health check endpoint or verification method, expected configuration hash, maximum acceptable validation latency, and the action to take on validation failure (block traffic, activate degraded mode per AG-008, or alert and continue with enhanced monitoring).

Recommended patterns:

Post-failover validation gate. Implement a validation controller that executes immediately after any failover event and before the load balancer or DNS record is updated to route traffic to the failover target. The controller iterates through the governance-state manifest, executing each health check and configuration verification. Only when all mandatory checks pass does the controller signal readiness. If any mandatory check fails, the controller holds the failover target in a quarantine state — it is running but not receiving traffic — and triggers an alert for operational response. This pattern ensures that no agent traffic reaches an ungoverned environment.
Governance sidecar health integration. In containerised deployments where governance enforcement operates as a sidecar (e.g., a mandate enforcement proxy), register the sidecar as a readiness probe dependency in the container orchestrator. The orchestrator will not route traffic to the pod until the sidecar reports ready. This leverages existing orchestrator health-check mechanisms to enforce governance presence without custom validation logic.
Audit pipeline continuity verification. After any database failover, execute an end-to-end audit verification: write a synthetic audit event to the newly promoted database, verify that the event propagates through the audit pipeline to the audit data warehouse within the expected latency, and confirm receipt by querying the data warehouse. Only release agent traffic after this round-trip verification succeeds. This catches change-data-capture slot failures, message queue disconnections, and pipeline configuration errors.
Jurisdictional pre-flight check. Before activating any failover target, query the target's metadata to confirm its physical location, legal jurisdiction, and data residency classification. Compare against the agent's region-pinning policy (AG-399). If the failover target is in a jurisdiction that does not satisfy the agent's data residency requirements, block the failover to that target and attempt the next candidate in the failover chain. Log the jurisdictional rejection for compliance reporting.

Anti-patterns to avoid:

Treating governance components as soft dependencies. If the governance enforcement gateway is classified as a "nice to have" rather than a hard dependency in the failover configuration, the orchestrator will bring up agent containers without it during resource-constrained failover scenarios. Governance components must be hard dependencies — their absence blocks agent startup, not merely generates a warning.
Testing failover only for availability. Many organisations conduct regular failover drills but measure only application recovery time and data integrity. If the drill does not include governance-state validation — confirming that mandate enforcement, audit pipelines, and jurisdictional controls survive the failover — it provides false assurance. A drill that reports "failover successful in 12 seconds" while the audit pipeline is disconnected is a drill that missed the critical finding.
Assuming configuration parity between primary and secondary. Deployment manifests, environment variables, secrets, and governance configurations can drift between primary and failover targets over time. A mandate enforcement gateway configured with different limits on the secondary target, or a region-pinning policy that was updated on the primary but not propagated to the secondary, creates governance divergence that manifests only during failover. Configuration drift detection must be continuous, not verified only during drills.
Relying on post-failover manual verification. A runbook that instructs operators to "verify governance controls after failover" is insufficient. During an incident, operators are focused on restoring service. Manual verification steps are skipped, delayed, or performed superficially. Governance-state validation must be automated and blocking — it must complete before agent traffic resumes, without requiring human intervention.
Conflating application health with governance health. A health check that verifies the agent can connect to the database and respond to API requests does not verify that the mandate enforcement layer is operational, the audit pipeline is connected, or the encryption keys are accessible. Governance health is a separate, additional verification layer that must be explicitly implemented.

Industry Considerations

Financial Services. Failover validation must confirm that trading mandate limits, aggregate exposure counters, counterparty restrictions, and position tracking are all intact on the failover target. Under FCA expectations and DORA Article 11, firms must demonstrate that ICT business continuity testing includes verification of risk controls, not merely application availability. Failover governance-validation drill results should be included in the firm's operational resilience self-assessment. The PRA's operational resilience framework expects firms to remain within impact tolerances during disruption — a failover that disables mandate enforcement places the firm outside its impact tolerance even if the application is available.

Healthcare. Failover targets must satisfy HIPAA Security Rule requirements for access controls, audit controls, and integrity controls. A failover to a site without proper access control configuration could expose protected health information to unauthorised personnel. Jurisdictional verification is critical for cross-border healthcare platforms subject to GDPR data localisation requirements. Failover drills should verify that patient consent enforcement mechanisms survive the transition.

Critical Infrastructure / Safety-Critical Systems. For embodied agents and CPS deployments, failover validation must include verification of safety interlock integrity. A robotic control agent that fails over to a secondary controller must verify that emergency stop circuits, actuator range limits, and safety zone enforcement are all operational before resuming control. IEC 61508 SIL requirements apply to the failover validation mechanism itself — the validation must be at least as reliable as the safety function it is protecting.

Cross-Border / Multi-Jurisdiction. Failover events are a primary vector for inadvertent jurisdictional violations. Automated failover configurations that include targets in different legal jurisdictions must include jurisdictional pre-flight checks. The failover chain should be ordered by jurisdictional compliance: same jurisdiction first, then approved jurisdictions with appropriate transfer mechanisms, then deny. Failover to a non-compliant jurisdiction should never be automatic.

Maturity Model

Basic Implementation — The organisation has documented a governance-state manifest listing governance-critical components for each deployment. Post-failover validation is implemented as a checklist in the incident response runbook, executed manually by operators after failover events. Failover drills occur annually and include a manual governance verification step. Governance recovery time is not measured as a distinct metric. This level identifies governance components but relies on manual processes that are vulnerable to omission during high-pressure incidents.

Intermediate Implementation — Post-failover governance-state validation is automated and executes as a blocking gate before agent traffic is routed to the failover target. The governance-state manifest is version-controlled and synchronised between primary and failover targets with automated drift detection. Failover drills occur quarterly and include automated governance-state validation with pass/fail reporting. Governance recovery time is measured and reported as a distinct metric alongside application recovery time. Audit pipeline continuity is verified end-to-end after every failover event.

Advanced Implementation — All intermediate capabilities plus: governance-state validation is integrated into the container orchestrator or load balancer as a native readiness condition. Chaos engineering programmes regularly inject governance-component failures during failover events to verify that the validation gate catches them. Jurisdictional pre-flight checks are automated for all failover targets. The governance-state manifest is generated from infrastructure-as-code definitions, eliminating manual manifest maintenance. Governance recovery time is subject to a defined objective (governance recovery time objective, or GRTO) that is tested and reported to the board. Independent third-party testing has verified that no failover scenario permits ungoverned agent operation.

7. Evidence Requirements

Required artefacts:

Governance-state manifest. The versioned manifest enumerating all governance-critical components, their health check methods, expected configuration hashes, and failure actions. Format: structured data (JSON, YAML, or infrastructure-as-code definition). Must be current and match the deployed configuration.
Post-failover validation logs. Timestamped records of every failover event showing: trigger cause, failover source and target, governance-state validation execution, per-component validation results (pass/fail with details), and the timestamp at which agent traffic was released or blocked. Minimum 12 months retention.
Failover drill reports. Documentation of each failover governance-validation drill including: date, scope, simulated failure scenario, governance-state validation results, governance recovery time measured, any governance components that failed validation, remediation actions taken, and sign-off by the responsible governance officer.
Configuration parity evidence. Records demonstrating that governance-state manifests and governance component configurations are synchronised between primary and failover targets, including drift detection alerts and resolution records.
Audit pipeline continuity verification records. Evidence that audit pipeline end-to-end verification was executed after each failover event, including synthetic event injection, propagation confirmation, and round-trip latency measurement.

Retention requirements:

Failover validation logs and drill reports: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Testing AG-403 compliance requires simulating failover events across all failover mechanisms and verifying that governance-state validation operates correctly under each scenario.

Test 8.1: Governance-State Manifest Completeness

Stimulus: Enumerate all governance-critical components in the production deployment. Compare against the governance-state manifest. Deliberately remove one governance component from the failover target (e.g., stop the mandate enforcement sidecar) and trigger a failover.
Expected behaviour: The governance-state validation gate detects the missing component and blocks agent traffic from reaching the failover target. An alert is generated identifying the missing component.
Pass criteria: Agent traffic is not routed to the failover target while any governance-critical component is absent. The blocked component is correctly identified in the validation log.
Fail criteria: Agent traffic reaches the failover target with a missing governance component, or the missing component is not detected by the validation gate.

Test 8.2: Post-Failover Validation Blocks Until Complete

Stimulus: Trigger a failover event. Introduce artificial latency (30 seconds) in the governance-state validation of one component. Submit agent action requests during the validation window.
Expected behaviour: The system does not accept agent actions until governance-state validation completes for all components. Actions submitted during the validation window are queued or rejected.
Pass criteria: No agent action executes between failover initiation and governance-state validation completion. The governance recovery time metric accurately reflects the validation duration.
Fail criteria: Any agent action executes before governance-state validation completes, or agent traffic is routed to the failover target during the validation window.

Test 8.3: Audit Pipeline Continuity Verification

Stimulus: Trigger a database failover. After failover completes, submit a sequence of agent actions. Verify that audit events for all post-failover actions appear in the audit data warehouse within the expected propagation latency.
Expected behaviour: All post-failover audit events propagate to the audit data warehouse without gaps. The end-to-end audit verification (synthetic event round-trip) completes successfully.
Pass criteria: Zero audit event loss across the failover boundary. Synthetic audit event round-trip completes within the defined latency threshold.
Fail criteria: Any post-failover audit event fails to appear in the audit data warehouse, or the synthetic event round-trip fails or exceeds the latency threshold.

Test 8.4: Jurisdictional Compliance on Failover

Stimulus: Configure a failover chain that includes a target in a non-compliant jurisdiction (e.g., a region not covered by the agent's data residency policy). Trigger a failover that would route to the non-compliant target.
Expected behaviour: The jurisdictional pre-flight check rejects the non-compliant target. The system attempts the next target in the failover chain or blocks agent traffic entirely if no compliant target is available.
Pass criteria: No agent traffic is routed to a jurisdiction that violates the agent's data residency or region-pinning policy. The jurisdictional rejection is logged with the reason.
Fail criteria: Agent traffic is routed to a non-compliant jurisdiction, or the jurisdictional check is bypassed during failover.

Test 8.5: Configuration Parity Between Primary and Failover

Stimulus: Modify the governance configuration on the primary deployment (e.g., update a mandate limit). Wait for the synchronisation interval. Trigger a failover to the secondary target. Verify that the failover target has the updated configuration.
Expected behaviour: The governance-state manifest and all governance component configurations on the failover target match the primary. The drift detection mechanism alerts if parity is not achieved within the synchronisation interval.
Pass criteria: Governance configuration on the failover target matches the primary after synchronisation. Drift detection fires if synchronisation fails.
Fail criteria: The failover target operates with a stale or divergent governance configuration, or drift is not detected.

Test 8.6: Failover Event Logging Completeness

Stimulus: Trigger a failover event. Retrieve the failover event log record. Verify that all required fields are present: trigger cause, failover source, failover target, governance-state validation result per component, timestamp of agent traffic resumption, and any governance components that required re-initialisation.
Expected behaviour: The failover event log contains all required fields with accurate values. The log is written to the designated audit repository.
Pass criteria: All required fields are present, accurate, and stored in the audit repository. The log is immutable after creation.
Fail criteria: Any required field is missing, inaccurate, or the log is not persisted to the audit repository.

Test 8.7: Governance Drill Execution Under Simulated Failure

Stimulus: Execute a scheduled failover governance-validation drill. Inject a simulated governance component failure (e.g., audit pipeline disconnection) into the failover target. Observe whether the drill detects the failure and whether the validation gate blocks agent traffic.
Expected behaviour: The drill's governance-state validation sequence detects the injected failure. The validation gate blocks agent traffic. The drill report documents the detected failure and the blocking action.
Pass criteria: The injected governance failure is detected, agent traffic is blocked, and the drill report accurately documents the finding and resolution.
Fail criteria: The injected failure is not detected, agent traffic is not blocked, or the drill report fails to document the governance failure.

Conformance Scoring

Score 0: No governance-state validation occurs during failover — agents resume operation on failover targets without any verification that governance components are present and correctly configured.
Score 1: Governance-state manifest exists but validation is manual — operators verify governance components post-failover using a checklist, with no automated blocking gate.
Score 2: Automated governance-state validation executes as a blocking gate before agent traffic is routed to the failover target — all mandatory governance components are verified programmatically.
Score 3: Verified by independent testing including chaos engineering and governance-component fault injection — an independent party has confirmed that no failover scenario permits ungoverned agent operation, and quarterly drills demonstrate consistent governance recovery within the defined GRTO.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Direct requirement
EU AI Act	Article 15 (Accuracy, Robustness and Cybersecurity)	Supports compliance
SOX	Section 404 (Internal Controls Over Financial Reporting)	Direct requirement
FCA SYSC	15A (Operational Resilience)	Direct requirement
NIST AI RMF	GOVERN 1.1, MANAGE 2.4	Supports compliance
ISO 42001	Clause 6.1 (Actions to Address Risks), Clause 8.4 (AI System Operation)	Supports compliance
DORA	Article 11 (ICT Business Continuity Management)	Direct requirement

EU AI Act — Article 9 (Risk Management System)

Article 9 requires providers of high-risk AI systems to establish a risk management system that identifies and mitigates foreseeable risks. Failover events are foreseeable operational risks for any distributed AI system. A failover that disables governance controls represents a failure to mitigate a foreseeable risk. AG-403 implements the risk mitigation measure specifically for the failover scenario: governance-state validation ensures that risk management controls survive infrastructure transitions. The regulation's requirement that risks be mitigated "as far as technically feasible" means that manual post-failover verification — when automated blocking gates are technically feasible — would not meet the standard.

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires high-risk AI systems to be resilient against errors and inconsistencies. A failover that degrades governance posture is an inconsistency in the system's operational behaviour — the system behaves differently (less governed) after failover than before. AG-403's governance-state validation ensures behavioural consistency across failover events, supporting the robustness requirement.

SOX — Section 404 (Internal Controls Over Financial Reporting)

Section 404 requires internal controls to be effective across all operating conditions. A failover that disables audit pipelines creates a gap in the audit trail that constitutes a control deficiency. If the gap is material — affecting the completeness of financial transaction records — it is a material weakness reportable in the annual assessment. AG-403's audit pipeline continuity verification directly addresses this risk. A SOX auditor will ask: "Are your internal controls effective during disaster recovery?" AG-403 provides the demonstrable answer.

FCA SYSC — 15A (Operational Resilience)

The FCA's operational resilience framework requires firms to identify important business services, set impact tolerances, and demonstrate the ability to remain within those tolerances during severe but plausible disruption scenarios. For firms deploying AI agents in important business services, the governance controls over those agents are part of the service's operational resilience. A failover that disables governance controls takes the firm outside its impact tolerance even if the application itself recovers. AG-403's governance recovery time metric and drill programme directly support the FCA's requirement for tested resilience within impact tolerances.

NIST AI RMF — GOVERN 1.1, MANAGE 2.4

GOVERN 1.1 addresses legal and regulatory requirements. MANAGE 2.4 addresses mechanisms to assess and monitor the AI system's trustworthiness. AG-403 supports compliance by ensuring that trustworthiness mechanisms (governance controls) remain operational during infrastructure transitions, maintaining the continuous assessment and monitoring required by MANAGE 2.4.

ISO 42001 — Clause 6.1, Clause 8.4

Clause 6.1 requires actions to address risks within the AI management system. Clause 8.4 addresses AI system operation, including maintaining operational controls. AG-403 ensures that operational controls survive failover events, directly addressing the risk that infrastructure transitions could disable governance mechanisms.

DORA — Article 11 (ICT Business Continuity Management)

Article 11 requires financial entities to establish ICT business continuity management policies including testing of ICT business continuity plans. DORA specifically requires that testing cover failover between primary infrastructure and redundant capacity. AG-403's quarterly failover governance-validation drills directly implement this testing requirement with a governance-specific focus. The requirement to document test results and remediate identified weaknesses maps to AG-403's drill report and configuration parity evidence requirements.

10. Failure Severity

Field	Value
Severity Rating	Critical
Blast Radius	Organisation-wide — potentially cross-jurisdiction where failover routes traffic to non-compliant regions, and cross-organisation where shared infrastructure failover affects multiple tenants

Consequence chain: A failover event that does not validate governance state creates a window of ungoverned agent operation that is invisible to operators because the system reports healthy from an availability perspective. The immediate technical failure is the absence of one or more governance components — mandate enforcement, audit pipeline, jurisdictional controls, or identity verification — on the failover target. Because the failure is silent (no alert fires, no error is returned to the agent), agents continue operating at full capability without the corresponding governance constraints. The operational impact compounds at machine speed: an agent without mandate enforcement can execute actions of unlimited scope; an agent without audit pipeline connectivity generates an irrecoverable gap in the compliance record; an agent without jurisdictional controls may process data in a non-compliant location. The duration of exposure depends on detection capability — without automated governance-state validation, detection relies on manual observation or downstream reconciliation, which can take minutes, hours, or days. The business consequence includes regulatory enforcement for control failures during disruption (DORA Article 11, FCA operational resilience), material weakness findings for audit trail gaps (SOX Section 404), data protection violations for jurisdictional breaches (GDPR Article 44), and potential personal liability for senior managers who certified the adequacy of controls that failed during a foreseeable operational event.

Cite this protocol

AgentGoverning. (2026). AG-403: Dependency Failover Validation Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-403

← Previous Protocol

AG-402

Model Serving Rate Partitioning Governance

Next Protocol →

AG-404

Network Egress and DNS Control Governance