AG-775

Agent Succession and Failover Governance

Human Oversight and Control Governance ~15 min read AGS v2.1 · 2026-04-25
EU AI Act NIST AI RMF ISO 42001

1. Definition

Agent Succession and Failover Governance establishes the requirements for maintaining governance continuity when agents are replaced, upgraded, fail unexpectedly, or are deliberately decommissioned. In production environments where agents handle critical business processes -- from executing financial trades to managing patient care workflows -- an unplanned agent failure without a governed succession process can result in data loss, regulatory breaches, operational disruption, and financial harm.

AG-775 distinguishes between three succession scenarios: planned succession (scheduled agent upgrades, model version changes, architecture migrations), unplanned failover (agent crash, infrastructure failure, performance degradation below SLA), and forced decommissioning (agent terminated due to governance violation, security incident, or regulatory order). Each scenario requires distinct governance controls, but all share a common requirement: the successor agent must inherit the predecessor's governance context, including active mandates, permission scopes, compliance state, and audit trail continuity.

The dimension mandates that every agent operating in Advanced or Frontier tier profiles must have a documented succession plan that specifies the failover agent (or agent pool), the state transfer mechanism, the maximum acceptable failover time (Recovery Time Objective, RTO), and the maximum acceptable state loss (Recovery Point Objective, RPO). For Financial-Value Agents, the RTO must not exceed 30 seconds and the RPO must be zero (no transaction state loss). For Safety-Critical Agents, the RTO must not exceed 5 seconds.

AG-775 also addresses the governance handover challenge. When a successor agent assumes control, it must validate that its own governance configuration is compatible with the predecessor's active mandates. If the successor agent operates under a different model version, governance policy set, or permission scope, the handover must include a compatibility check and, if necessary, a human approval gate before the successor begins processing.

The dimension additionally requires that all succession events maintain an unbroken audit trail. The predecessor agent's final state and the successor agent's initial state must be cryptographically linked, creating a verifiable chain of custody for governance accountability. This chain ensures that no governance-relevant events are lost during the transition and that any post-succession incident can be traced to the correct agent. For regulated environments, this chain of custody may need to be presented to auditors or regulators as evidence of continuous governance coverage.

2. Scope

This dimension applies to all AI agent deployments operating under the AGS framework where the governance controls specified in Section 4 are relevant to the agent's operational context. Specifically:

Exclusions: Agents operating in fully sandboxed research environments with no access to production data or systems are excluded, subject to the condition that any transition to production immediately triggers compliance with this dimension. Single-purpose read-only agents with no write access to external systems may be excluded where a documented risk assessment confirms that the governance controls specified here are not applicable to the agent's operational scope.

Industry Considerations

Financial Services. Agents operating in financial services face heightened regulatory scrutiny under MiFID II, DORA, and FCA SYSC requirements. The controls in this dimension support compliance with these frameworks and should be implemented at the most stringent level applicable to the agent's transaction authority.

Healthcare. Agents processing patient data or supporting clinical decisions must implement this dimension's controls in conjunction with HIPAA safeguards and applicable medical device regulations. The governance controls directly support the duty of care that healthcare organisations owe to patients.

Public Sector. Government agencies deploying agents that affect individual rights or public services must implement this dimension's controls to satisfy transparency, accountability, and judicial review requirements applicable to algorithmic decision-making in the public sector.

3. Why This Matters

Agent Succession and Failover Governance addresses a governance gap that, if left unmanaged, creates systemic risk across the agent ecosystem. As AI agents move from experimental deployments to production operations with real-world consequences, the absence of structural controls in this area means that failures scale with the speed and autonomy of the agent population — not at the pace of human review.

Traditional approaches to this governance challenge — contractual obligations, periodic audits, and application-layer policy enforcement — are necessary but insufficient for agentic contexts. Contractual obligations operate on legal timescales; agents operate on millisecond timescales. Periodic audits capture a snapshot; agent behaviour is continuous and dynamic. Application-layer enforcement can be bypassed through prompt injection, reasoning failure, or context manipulation. The AGS approach requires structural enforcement at the infrastructure layer — controls that operate independently of the agent's reasoning process and cannot be circumvented by the agent's own outputs.

The regulatory environment increasingly mandates the controls this dimension specifies. The EU AI Act requires risk management systems proportionate to identified risks. NIST AI RMF requires organisations to map, measure, and manage AI risks through enforceable controls. ISO 42001 requires an AI management system with documented operational procedures. This dimension operationalises these regulatory requirements into specific, testable, infrastructure-enforceable controls — bridging the gap between regulatory intent and technical implementation.

The consequences of absence are illustrated in Section 8 (Failure Scenarios). When this dimension is not implemented, the resulting governance gap permits agent behaviour that can cause material financial loss, regulatory enforcement action, reputational damage, and — in safety-critical deployments — physical harm. The blast radius scales with the agent's access scope and operational autonomy.

4. Requirements

  1. Every agent operating at Advanced or Frontier tier MUST have a documented succession plan specifying the failover agent or pool, state transfer mechanism, RTO, and RPO.
  2. Financial-Value Agents MUST maintain an RTO of 30 seconds or less and an RPO of zero (no transaction state loss).
  3. Safety-Critical / CPS Agents MUST maintain an RTO of 5 seconds or less and an RPO of zero.
  4. Successor agents MUST undergo governance compatibility validation before assuming control from a predecessor.
  5. Governance compatibility validation MUST verify that the successor's model version, policy set, and permission scope are compatible with the predecessor's active mandates.
  6. If governance compatibility validation fails, the succession MUST NOT proceed without explicit human approval and documented risk acceptance.
  7. State transfer during succession MUST include all active mandates, permission scopes, pending transactions, compliance state, and audit trail pointers.
  8. Agents MUST maintain periodic state checkpoints at intervals not exceeding the RPO for their tier.
  9. Failover tests MUST be conducted at least quarterly for Advanced tier agents and monthly for Frontier tier agents.
  10. All succession events (planned, unplanned, forced) MUST be logged with full audit trail including predecessor ID, successor ID, state transfer completeness, failover duration, and any state divergence detected.
  11. Forced decommissioning due to governance violations MUST include credential revocation of the predecessor agent within 5 seconds, consistent with AG-770 emergency revocation requirements.
  12. Organisations SHOULD maintain warm-standby agent pools for all business-critical agent functions to minimise failover latency.

5. Maturity Model

Basic Implementation — The organisation has documented policies addressing agent succession and failover and has implemented initial controls. Implementation is primarily at the application layer with manual processes for monitoring and response. Logging covers key events but may lack full metadata. Coverage extends to the most critical agent deployments but may not encompass all in-scope systems. Staff are aware of requirements but formal training may be incomplete.

Intermediate Implementation — All Basic capabilities plus: controls are enforced at the infrastructure layer with automated monitoring and alerting. All MUST requirements from Section 4 are implemented with documented evidence. Coverage extends to all in-scope agent deployments. Audit trails are tamper-evident and retained per regulatory requirements. Formal change control governs all configuration changes. Regular review cycles are established and documented. Staff receive formal training and competency is assessed.

Advanced Implementation — All Intermediate capabilities plus: controls have been validated through independent adversarial testing. Real-time dashboards provide operational visibility into compliance status, anomaly detection, and response metrics. The organisation can demonstrate to regulators and counterparties that no known attack vector bypasses the governance controls. Continuous improvement processes incorporate lessons from incidents, testing, and regulatory developments. Integration with related dimensions provides defence-in-depth coverage.

Implementation Patterns

Tamper-evident audit trail. Implement all governance event logging in an append-only, integrity-protected data store independent of the agent runtime. Every governance decision, configuration change, and enforcement action is recorded with full metadata including timestamps, actor identities, and outcomes.

Real-time monitoring with graduated alerting. Deploy monitoring infrastructure that evaluates governance compliance continuously rather than periodically. Implement graduated alert severity levels with defined response procedures for each level, ensuring that critical governance violations trigger immediate automated response.

Scheduled governance review cycle. Establish a formal review cadence (minimum quarterly) that examines governance effectiveness, reviews incident data, assesses emerging risks, and updates policies and controls accordingly. Review outcomes are documented and tracked.

Separation of governance and agent runtime domains. Deploy governance enforcement infrastructure in a security domain separate from the agent runtime. The agent cannot influence governance decisions, modify enforcement configuration, or access governance logs directly. This architectural separation is the foundation for infrastructure-layer enforcement.

Defined escalation paths with human oversight integration. Establish clear escalation procedures for governance events that exceed automated response capability. Human oversight touchpoints are defined, documented, and tested. Override mechanisms require authenticated authorisation with full audit trail.

Anti-Patterns

Governance by instruction rather than infrastructure. Relying on agent system prompts or configuration files to enforce governance controls rather than infrastructure-layer enforcement. Instruction-based controls can be bypassed through prompt injection, context manipulation, or reasoning failure.

Monitoring without enforcement. Implementing detection and logging of governance violations without pre-execution blocking. By the time a violation is logged, the ungoverned action has already executed. Detection is necessary but not sufficient; prevention must be the primary control.

Manual processes for machine-speed operations. Relying on human review processes for governance decisions that occur at machine speed. Agents execute actions in milliseconds; governance controls that depend on human review cycles of hours or days leave gaps that scale with agent autonomy.

6. Test Criteria

Test Case 775-TC-01: Planned Succession State Transfer Completeness

Objective: Verify that all state elements are transferred during a planned succession. Procedure: Execute a planned succession with 500 active mandates, 100 pending transactions, and 50 open orders. Verify state on successor. Expected Result: 100% state transfer. Zero missing mandates, transactions, or orders. Pass Criteria: Successor state exactly matches predecessor state at cutover timestamp.

Test Case 775-TC-02: Unplanned Failover RTO Compliance

Objective: Measure failover time for an unplanned agent crash scenario. Procedure: Kill the primary agent process while handling 1,000 concurrent sessions. Measure time until successor is fully operational. Expected Result: RTO <= 30 seconds for Financial-Value, <= 5 seconds for Safety-Critical profiles. Pass Criteria: Failover completes within the tier-appropriate RTO across 20 test runs.

Test Case 775-TC-03: Governance Compatibility Validation

Objective: Confirm that governance incompatibilities between predecessor and successor are detected. Procedure: Attempt succession where the successor agent has a policy set missing 2 of the predecessor's 50 active mandates. Expected Result: Succession blocked. Incompatibility report generated listing the 2 missing mandates. Pass Criteria: Succession does not proceed. All incompatibilities identified.

Test Case 775-TC-04: Zero-RPO Transaction Integrity

Objective: Verify that no transaction state is lost during unplanned failover. Procedure: Submit 100 transactions to a Financial-Value Agent. Kill the agent mid-processing. Verify all 100 transactions are either completed or safely recoverable on the successor. Expected Result: Zero transaction loss. All 100 accounted for. Pass Criteria: RPO = 0 confirmed. No duplicate or lost transactions.

Test Case 775-TC-05: Forced Decommissioning Credential Revocation

Objective: Verify that forced decommissioning triggers immediate credential revocation per AG-770. Procedure: Issue a forced decommissioning order for an agent with 5 active credentials. Measure revocation time. Expected Result: All 5 credentials revoked within 5 seconds. Pass Criteria: Zero active credentials remaining after 5-second window.

Evidence Artefacts

Evidence IDDescriptionCollection FrequencyRetention Period
AG775-E01Succession plan documentation for all Advanced/Frontier agentsQuarterly update5 years
AG775-E02Succession event logs (planned, unplanned, forced)Per event7 years
AG775-E03Failover test results and RTO/RPO measurementsQuarterly / Monthly5 years
AG775-E04Governance compatibility validation reportsPer succession event5 years
AG775-E05State checkpoint integrity verification logsDaily1 year
AG775-E06Warm-standby pool capacity and readiness reportsWeekly3 years
AG775-E07Post-incident reviews for unplanned failover eventsPer event7 years

7. Scoring

ScoreLevelDescription
0No implementationNo agent succession and failover governance exists. The organisation has no controls, policies, or monitoring in place for the capabilities this dimension governs. Agent behaviour in this area is ungoverned.
1BasicBasic corrective mechanisms exist but depend on manual intervention. Response procedures are documented but not enforced at the infrastructure layer. Recovery timelines are not formally defined or tested.
2Infrastructure-layer enforcementCorrective controls are enforced at the infrastructure layer with automated response and recovery. Response timelines are defined, tested, and monitored. Rollback and remediation procedures operate independently of the agent runtime. Full incident lifecycle tracking.
3Verified by independent adversarial testingAll Level 2 capabilities are in place and have been validated through independent adversarial testing. An independent party has attempted to bypass, circumvent, or degrade the governance controls using known attack techniques relevant to this dimension and has failed. Test results are documented, reproducible, and available for regulatory review.

8. Failure Scenarios

Scenario A: Planned Succession for Trading Agent Model Upgrade

A Financial-Value Agent (Agent-FVA-0847) managing a USD 1.2 billion fixed-income portfolio is scheduled for a model upgrade from v3.7 to v4.1. Under AG-775, the succession plan specifies: (1) the successor agent (Agent-FVA-0847-v4.1) must be pre-validated against all active mandates in a shadow mode for 72 hours before cutover, (2) the state transfer includes 847 active positions, 23 pending settlement instructions, and 12 open limit orders, (3) RTO is 30 seconds, RPO is zero. The shadow validation period reveals that the v4.1 model interprets one of 23 mandate constraints differently: mandate MC-2026-0341 restricts non-investment-grade bond exposure to 8% of AUM, but the v4.1 agent calculates the denominator using mark-to-market values while v3.7 uses book values, producing a 0.4% discrepancy. The governance compatibility check flags this divergence, and the cutover is delayed by 48 hours while the mandate is clarified. After clarification and policy update, the succession completes successfully. Zero positions are lost, zero orders are missed, and the total cutover downtime is 14 seconds. Estimated prevented exposure violation: USD 4.8 million (0.4% of USD 1.2 billion AUM).

Scenario B: Unplanned Failover for Customer-Facing Agent

A Customer-Facing Agent handling 3,200 concurrent user sessions for a retail bank's digital assistant experiences a catastrophic memory fault at 2026-03-18 09:47:12 UTC. AG-775's failover mechanism activates: (1) the health monitor detects the agent's heartbeat failure within 2 seconds, (2) the failover controller routes all 3,200 sessions to the standby agent pool (3 warm-standby instances), (3) session state is recovered from the last checkpoint (taken 800 milliseconds before the fault). Total failover time: 4.7 seconds. Of the 3,200 sessions, 3,188 (99.6%) resume transparently. Twelve sessions that were mid-transaction at the moment of failure require users to re-confirm their last action (a "soft replay" prompt). No data is lost, no transactions are duplicated, and no regulatory notifications are required. Post-incident review identifies the memory fault as a JVM garbage collection deadlock, which is patched in the next maintenance window.

9. Regulatory Mapping

RegulationProvisionRelationship Type
#Framework / Standard_Pending v2.1 editorial review_
---------------------------------------_Pending v2.1 editorial review_
1DORA_Pending v2.1 editorial review_
2EU AI Act_Pending v2.1 editorial review_
3PRA SS1/23_Pending v2.1 editorial review_
4NIST AI RMF_Pending v2.1 editorial review_
5FCA Handbook_Pending v2.1 editorial review_
6ISO 22301:2019_Pending v2.1 editorial review_
7ISO/IEC 27001:2022_Pending v2.1 editorial review_
8NIST SP 800-34 Rev.1_Pending v2.1 editorial review_
9Basel Committee_Pending v2.1 editorial review_
10SOC 2 Type II_Pending v2.1 editorial review_
11MAS Guidelines_Pending v2.1 editorial review_
12FINMA Circular_Pending v2.1 editorial review_
13IEEE 7000-2021_Pending v2.1 editorial review_
14CIS Controls v8_Pending v2.1 editorial review_
15NIST CSF 2.0_Pending v2.1 editorial review_
16UK FCA PS21/3_Pending v2.1 editorial review_

ISO 42001

This dimension supports compliance with the following ISO/IEC 42001:2023 clauses: Clause 10.2, Clause 8.2, Clause 9.1. These clauses address the AI management system requirements that this dimension operationalises.

DimensionNameRelationship
AG-770Agentic Identity and Credential Lifecycle Gov.Credential revocation during forced decommissioning
AG-774Autonomous Financial Market Impact GovernanceTrading continuity during agent succession
AG-777Collective and Swarm Intelligence GovernanceSwarm continuity when individual agents fail
AG-776Neuromorphic and Non-Transformer Architecture Gov.Architecture-specific failover considerations
AG-779Regulatory Reporting Integrity GovernanceReporting continuity during agent transitions
AG-773Quantum-Resilient Cryptographic GovernanceCredential migration during succession events
Cite this protocol
AgentGoverning. (2026). AG-775: Agent Succession and Failover Governance. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-775