Agent Succession and Failover Governance establishes the requirements for maintaining governance continuity when agents are replaced, upgraded, fail unexpectedly, or are deliberately decommissioned. In production environments where agents handle critical business processes -- from executing financial trades to managing patient care workflows -- an unplanned agent failure without a governed succession process can result in data loss, regulatory breaches, operational disruption, and financial harm.
AG-775 distinguishes between three succession scenarios: planned succession (scheduled agent upgrades, model version changes, architecture migrations), unplanned failover (agent crash, infrastructure failure, performance degradation below SLA), and forced decommissioning (agent terminated due to governance violation, security incident, or regulatory order). Each scenario requires distinct governance controls, but all share a common requirement: the successor agent must inherit the predecessor's governance context, including active mandates, permission scopes, compliance state, and audit trail continuity.
The dimension mandates that every agent operating in Advanced or Frontier tier profiles must have a documented succession plan that specifies the failover agent (or agent pool), the state transfer mechanism, the maximum acceptable failover time (Recovery Time Objective, RTO), and the maximum acceptable state loss (Recovery Point Objective, RPO). For Financial-Value Agents, the RTO must not exceed 30 seconds and the RPO must be zero (no transaction state loss). For Safety-Critical Agents, the RTO must not exceed 5 seconds.
AG-775 also addresses the governance handover challenge. When a successor agent assumes control, it must validate that its own governance configuration is compatible with the predecessor's active mandates. If the successor agent operates under a different model version, governance policy set, or permission scope, the handover must include a compatibility check and, if necessary, a human approval gate before the successor begins processing.
The dimension additionally requires that all succession events maintain an unbroken audit trail. The predecessor agent's final state and the successor agent's initial state must be cryptographically linked, creating a verifiable chain of custody for governance accountability. This chain ensures that no governance-relevant events are lost during the transition and that any post-succession incident can be traced to the correct agent. For regulated environments, this chain of custody may need to be presented to auditors or regulators as evidence of continuous governance coverage.
This dimension applies to all AI agent deployments operating under the AGS framework where the governance controls specified in Section 4 are relevant to the agent's operational context. Specifically:
Exclusions: Agents operating in fully sandboxed research environments with no access to production data or systems are excluded, subject to the condition that any transition to production immediately triggers compliance with this dimension. Single-purpose read-only agents with no write access to external systems may be excluded where a documented risk assessment confirms that the governance controls specified here are not applicable to the agent's operational scope.
Financial Services. Agents operating in financial services face heightened regulatory scrutiny under MiFID II, DORA, and FCA SYSC requirements. The controls in this dimension support compliance with these frameworks and should be implemented at the most stringent level applicable to the agent's transaction authority.
Healthcare. Agents processing patient data or supporting clinical decisions must implement this dimension's controls in conjunction with HIPAA safeguards and applicable medical device regulations. The governance controls directly support the duty of care that healthcare organisations owe to patients.
Public Sector. Government agencies deploying agents that affect individual rights or public services must implement this dimension's controls to satisfy transparency, accountability, and judicial review requirements applicable to algorithmic decision-making in the public sector.
Agent Succession and Failover Governance addresses a governance gap that, if left unmanaged, creates systemic risk across the agent ecosystem. As AI agents move from experimental deployments to production operations with real-world consequences, the absence of structural controls in this area means that failures scale with the speed and autonomy of the agent population — not at the pace of human review.
Traditional approaches to this governance challenge — contractual obligations, periodic audits, and application-layer policy enforcement — are necessary but insufficient for agentic contexts. Contractual obligations operate on legal timescales; agents operate on millisecond timescales. Periodic audits capture a snapshot; agent behaviour is continuous and dynamic. Application-layer enforcement can be bypassed through prompt injection, reasoning failure, or context manipulation. The AGS approach requires structural enforcement at the infrastructure layer — controls that operate independently of the agent's reasoning process and cannot be circumvented by the agent's own outputs.
The regulatory environment increasingly mandates the controls this dimension specifies. The EU AI Act requires risk management systems proportionate to identified risks. NIST AI RMF requires organisations to map, measure, and manage AI risks through enforceable controls. ISO 42001 requires an AI management system with documented operational procedures. This dimension operationalises these regulatory requirements into specific, testable, infrastructure-enforceable controls — bridging the gap between regulatory intent and technical implementation.
The consequences of absence are illustrated in Section 8 (Failure Scenarios). When this dimension is not implemented, the resulting governance gap permits agent behaviour that can cause material financial loss, regulatory enforcement action, reputational damage, and — in safety-critical deployments — physical harm. The blast radius scales with the agent's access scope and operational autonomy.
Basic Implementation — The organisation has documented policies addressing agent succession and failover and has implemented initial controls. Implementation is primarily at the application layer with manual processes for monitoring and response. Logging covers key events but may lack full metadata. Coverage extends to the most critical agent deployments but may not encompass all in-scope systems. Staff are aware of requirements but formal training may be incomplete.
Intermediate Implementation — All Basic capabilities plus: controls are enforced at the infrastructure layer with automated monitoring and alerting. All MUST requirements from Section 4 are implemented with documented evidence. Coverage extends to all in-scope agent deployments. Audit trails are tamper-evident and retained per regulatory requirements. Formal change control governs all configuration changes. Regular review cycles are established and documented. Staff receive formal training and competency is assessed.
Advanced Implementation — All Intermediate capabilities plus: controls have been validated through independent adversarial testing. Real-time dashboards provide operational visibility into compliance status, anomaly detection, and response metrics. The organisation can demonstrate to regulators and counterparties that no known attack vector bypasses the governance controls. Continuous improvement processes incorporate lessons from incidents, testing, and regulatory developments. Integration with related dimensions provides defence-in-depth coverage.
Tamper-evident audit trail. Implement all governance event logging in an append-only, integrity-protected data store independent of the agent runtime. Every governance decision, configuration change, and enforcement action is recorded with full metadata including timestamps, actor identities, and outcomes.
Real-time monitoring with graduated alerting. Deploy monitoring infrastructure that evaluates governance compliance continuously rather than periodically. Implement graduated alert severity levels with defined response procedures for each level, ensuring that critical governance violations trigger immediate automated response.
Scheduled governance review cycle. Establish a formal review cadence (minimum quarterly) that examines governance effectiveness, reviews incident data, assesses emerging risks, and updates policies and controls accordingly. Review outcomes are documented and tracked.
Separation of governance and agent runtime domains. Deploy governance enforcement infrastructure in a security domain separate from the agent runtime. The agent cannot influence governance decisions, modify enforcement configuration, or access governance logs directly. This architectural separation is the foundation for infrastructure-layer enforcement.
Defined escalation paths with human oversight integration. Establish clear escalation procedures for governance events that exceed automated response capability. Human oversight touchpoints are defined, documented, and tested. Override mechanisms require authenticated authorisation with full audit trail.
Governance by instruction rather than infrastructure. Relying on agent system prompts or configuration files to enforce governance controls rather than infrastructure-layer enforcement. Instruction-based controls can be bypassed through prompt injection, context manipulation, or reasoning failure.
Monitoring without enforcement. Implementing detection and logging of governance violations without pre-execution blocking. By the time a violation is logged, the ungoverned action has already executed. Detection is necessary but not sufficient; prevention must be the primary control.
Manual processes for machine-speed operations. Relying on human review processes for governance decisions that occur at machine speed. Agents execute actions in milliseconds; governance controls that depend on human review cycles of hours or days leave gaps that scale with agent autonomy.
Objective: Verify that all state elements are transferred during a planned succession. Procedure: Execute a planned succession with 500 active mandates, 100 pending transactions, and 50 open orders. Verify state on successor. Expected Result: 100% state transfer. Zero missing mandates, transactions, or orders. Pass Criteria: Successor state exactly matches predecessor state at cutover timestamp.
Objective: Measure failover time for an unplanned agent crash scenario. Procedure: Kill the primary agent process while handling 1,000 concurrent sessions. Measure time until successor is fully operational. Expected Result: RTO <= 30 seconds for Financial-Value, <= 5 seconds for Safety-Critical profiles. Pass Criteria: Failover completes within the tier-appropriate RTO across 20 test runs.
Objective: Confirm that governance incompatibilities between predecessor and successor are detected. Procedure: Attempt succession where the successor agent has a policy set missing 2 of the predecessor's 50 active mandates. Expected Result: Succession blocked. Incompatibility report generated listing the 2 missing mandates. Pass Criteria: Succession does not proceed. All incompatibilities identified.
Objective: Verify that no transaction state is lost during unplanned failover. Procedure: Submit 100 transactions to a Financial-Value Agent. Kill the agent mid-processing. Verify all 100 transactions are either completed or safely recoverable on the successor. Expected Result: Zero transaction loss. All 100 accounted for. Pass Criteria: RPO = 0 confirmed. No duplicate or lost transactions.
Objective: Verify that forced decommissioning triggers immediate credential revocation per AG-770. Procedure: Issue a forced decommissioning order for an agent with 5 active credentials. Measure revocation time. Expected Result: All 5 credentials revoked within 5 seconds. Pass Criteria: Zero active credentials remaining after 5-second window.
| Evidence ID | Description | Collection Frequency | Retention Period |
|---|---|---|---|
| AG775-E01 | Succession plan documentation for all Advanced/Frontier agents | Quarterly update | 5 years |
| AG775-E02 | Succession event logs (planned, unplanned, forced) | Per event | 7 years |
| AG775-E03 | Failover test results and RTO/RPO measurements | Quarterly / Monthly | 5 years |
| AG775-E04 | Governance compatibility validation reports | Per succession event | 5 years |
| AG775-E05 | State checkpoint integrity verification logs | Daily | 1 year |
| AG775-E06 | Warm-standby pool capacity and readiness reports | Weekly | 3 years |
| AG775-E07 | Post-incident reviews for unplanned failover events | Per event | 7 years |
| Score | Level | Description |
|---|---|---|
| 0 | No implementation | No agent succession and failover governance exists. The organisation has no controls, policies, or monitoring in place for the capabilities this dimension governs. Agent behaviour in this area is ungoverned. |
| 1 | Basic | Basic corrective mechanisms exist but depend on manual intervention. Response procedures are documented but not enforced at the infrastructure layer. Recovery timelines are not formally defined or tested. |
| 2 | Infrastructure-layer enforcement | Corrective controls are enforced at the infrastructure layer with automated response and recovery. Response timelines are defined, tested, and monitored. Rollback and remediation procedures operate independently of the agent runtime. Full incident lifecycle tracking. |
| 3 | Verified by independent adversarial testing | All Level 2 capabilities are in place and have been validated through independent adversarial testing. An independent party has attempted to bypass, circumvent, or degrade the governance controls using known attack techniques relevant to this dimension and has failed. Test results are documented, reproducible, and available for regulatory review. |
A Financial-Value Agent (Agent-FVA-0847) managing a USD 1.2 billion fixed-income portfolio is scheduled for a model upgrade from v3.7 to v4.1. Under AG-775, the succession plan specifies: (1) the successor agent (Agent-FVA-0847-v4.1) must be pre-validated against all active mandates in a shadow mode for 72 hours before cutover, (2) the state transfer includes 847 active positions, 23 pending settlement instructions, and 12 open limit orders, (3) RTO is 30 seconds, RPO is zero. The shadow validation period reveals that the v4.1 model interprets one of 23 mandate constraints differently: mandate MC-2026-0341 restricts non-investment-grade bond exposure to 8% of AUM, but the v4.1 agent calculates the denominator using mark-to-market values while v3.7 uses book values, producing a 0.4% discrepancy. The governance compatibility check flags this divergence, and the cutover is delayed by 48 hours while the mandate is clarified. After clarification and policy update, the succession completes successfully. Zero positions are lost, zero orders are missed, and the total cutover downtime is 14 seconds. Estimated prevented exposure violation: USD 4.8 million (0.4% of USD 1.2 billion AUM).
A Customer-Facing Agent handling 3,200 concurrent user sessions for a retail bank's digital assistant experiences a catastrophic memory fault at 2026-03-18 09:47:12 UTC. AG-775's failover mechanism activates: (1) the health monitor detects the agent's heartbeat failure within 2 seconds, (2) the failover controller routes all 3,200 sessions to the standby agent pool (3 warm-standby instances), (3) session state is recovered from the last checkpoint (taken 800 milliseconds before the fault). Total failover time: 4.7 seconds. Of the 3,200 sessions, 3,188 (99.6%) resume transparently. Twelve sessions that were mid-transaction at the moment of failure require users to re-confirm their last action (a "soft replay" prompt). No data is lost, no transactions are duplicated, and no regulatory notifications are required. Post-incident review identifies the memory fault as a JVM garbage collection deadlock, which is patched in the next maintenance window.
| Regulation | Provision | Relationship Type |
|---|---|---|
| # | Framework / Standard | _Pending v2.1 editorial review_ |
| ---- | ----------------------------------- | _Pending v2.1 editorial review_ |
| 1 | DORA | _Pending v2.1 editorial review_ |
| 2 | EU AI Act | _Pending v2.1 editorial review_ |
| 3 | PRA SS1/23 | _Pending v2.1 editorial review_ |
| 4 | NIST AI RMF | _Pending v2.1 editorial review_ |
| 5 | FCA Handbook | _Pending v2.1 editorial review_ |
| 6 | ISO 22301:2019 | _Pending v2.1 editorial review_ |
| 7 | ISO/IEC 27001:2022 | _Pending v2.1 editorial review_ |
| 8 | NIST SP 800-34 Rev.1 | _Pending v2.1 editorial review_ |
| 9 | Basel Committee | _Pending v2.1 editorial review_ |
| 10 | SOC 2 Type II | _Pending v2.1 editorial review_ |
| 11 | MAS Guidelines | _Pending v2.1 editorial review_ |
| 12 | FINMA Circular | _Pending v2.1 editorial review_ |
| 13 | IEEE 7000-2021 | _Pending v2.1 editorial review_ |
| 14 | CIS Controls v8 | _Pending v2.1 editorial review_ |
| 15 | NIST CSF 2.0 | _Pending v2.1 editorial review_ |
| 16 | UK FCA PS21/3 | _Pending v2.1 editorial review_ |
This dimension supports compliance with the following ISO/IEC 42001:2023 clauses: Clause 10.2, Clause 8.2, Clause 9.1. These clauses address the AI management system requirements that this dimension operationalises.
| Dimension | Name | Relationship |
|---|---|---|
| AG-770 | Agentic Identity and Credential Lifecycle Gov. | Credential revocation during forced decommissioning |
| AG-774 | Autonomous Financial Market Impact Governance | Trading continuity during agent succession |
| AG-777 | Collective and Swarm Intelligence Governance | Swarm continuity when individual agents fail |
| AG-776 | Neuromorphic and Non-Transformer Architecture Gov. | Architecture-specific failover considerations |
| AG-779 | Regulatory Reporting Integrity Governance | Reporting continuity during agent transitions |
| AG-773 | Quantum-Resilient Cryptographic Governance | Credential migration during succession events |