The Standard

The 841 Dimensions Regulatory Mapping Version History

Compliance

Compliance Leaderboard Platform Comparison

Verification

Submit for Verification Self-Assessment Tool

About

About AgentGoverning Press & Media

Contact

AG-422

Recovery Time Objective Governance

Incident Response, Recovery & Resilience ~24 min read AGS v2.1 · April 2026

EU AI Act SOX FCA NIST ISO 42001

2. Summary

Recovery Time Objective Governance requires that every organisation deploying AI agents formally defines, enforces, and validates the maximum tolerated downtime for each agent class and function following any disruption, measured from the moment the agent becomes unavailable to the moment it resumes full operational capability with restored state and verified governance controls. The Recovery Time Objective (RTO) is distinct from but complementary to the Recovery Point Objective (AG-421): RPO governs how much state can be lost, while RTO governs how long the agent can be unavailable. An undefined RTO means the organisation has no contractual, operational, or regulatory basis for determining whether a recovery was fast enough, no framework for prioritising recovery resources across multiple failed agents, and no basis for designing redundancy architectures that meet availability requirements. For safety-critical agents, undefined RTOs can mean the difference between a controlled degradation and a catastrophic failure cascade.

3. Example

Scenario A — Undefined RTO Causes Cascading Failure Across Dependent Workflows: An Enterprise Workflow Agent orchestrating procurement approvals for a manufacturing firm fails at 09:14 on a Monday due to a cloud provider regional outage. The agent processes an average of 340 purchase orders per day, with 23 orders requiring approval within contractual deadlines tied to supplier lead times. The organisation has no defined RTO for the procurement agent. The incident response team begins recovery but treats it as a standard-priority incident. The cloud provider restores service at 14:40 — 5 hours and 26 minutes after the failure. The agent requires an additional 48 minutes to restore state from checkpoints (per AG-421), reload governance configurations, and re-validate pending workflows. Total downtime: 6 hours 14 minutes. During this period, 4 purchase orders miss their contractual approval deadlines. Two suppliers exercise penalty clauses totalling £89,000. One supplier, unable to hold inventory past the deadline, diverts 12,000 units to a competitor, causing a production line shutdown that costs £620,000 in lost output over 3 days. The total cost of the 6-hour outage is £709,000.

What went wrong: No RTO was defined for the procurement agent, so the incident response team had no target to drive urgency, no pre-planned failover strategy, and no basis for escalating to business leadership when the outage extended beyond acceptable bounds. A 2-hour RTO — achievable with a warm standby in a second region — would have prevented all contractual deadline breaches. Consequence: £709,000 in supplier penalties and lost production, procurement process review, and mandatory failover architecture investment.

Scenario B — Customer-Facing Agent Downtime Exceeds Regulatory Tolerance: A Customer-Facing Agent serving as the primary channel for consumer credit applications at a retail bank fails at 11:02 on a Friday. The agent handles 78% of credit applications for the institution. The bank's defined RTO is 4 hours, based on a business impact analysis conducted 18 months ago. However, the agent's architecture has evolved: it now depends on 3 external microservices, 2 model inference endpoints, and a vector database that were not part of the original RTO analysis. Recovery of the agent process takes 35 minutes, but the vector database requires a cold rebuild from backup that takes 7 hours and 12 minutes. The agent is technically "running" after 35 minutes but cannot process applications without the vector database. Effective downtime: 7 hours 47 minutes. During this period, 412 credit applications are queued. The FCA receives 23 consumer complaints about inability to access credit. The bank's own operational resilience framework defines a 4-hour impact tolerance for consumer lending services. The regulator opens a Section 166 skilled person review, costing £380,000.

What went wrong: The RTO was defined but not validated against the agent's current dependency chain. The 4-hour RTO assumed a recovery path that no longer existed because the architecture had evolved beyond the original impact analysis. The vector database — a critical dependency added 8 months after the RTO was set — had no recovery plan and no RTO of its own. Consequence: £380,000 regulatory review cost, reputational damage, mandatory architecture remediation, and consumer redress for delayed applications.

Scenario C — Safety-Critical Agent Downtime Creates Physical Danger: A Safety-Critical / CPS Agent monitoring a chemical processing facility's pressure and temperature sensors fails at 02:17 due to a firmware incompatibility introduced by an automatic edge update. The agent monitors 156 sensors and provides predictive alerting for 12 critical process parameters. Without the agent, the facility reverts to legacy threshold-based alarms that lack predictive capability. The organisation's RTO for the monitoring agent is undefined. The on-call engineer is notified at 02:23 but the firmware rollback requires physical access to the edge computing unit. The engineer arrives at 03:45 and completes the rollback by 04:12. The agent requires 8 minutes to re-initialise sensor connections and rebuild its predictive model baseline. Total downtime: 1 hour 55 minutes. During this window, at 03:31, a heat exchanger develops a slow pressure build-up that the predictive agent would have detected 14 minutes before the legacy alarm threshold. The legacy alarm triggers at 03:47, giving operators 3 minutes to respond instead of the 17 minutes the predictive agent would have provided. The operators execute an emergency pressure release, venting £43,000 of product and requiring a 16-hour process restart. A post-incident analysis determines that the predictive agent would have detected the anomaly at 03:33 and triggered a controlled response that would have prevented the emergency venting entirely.

What went wrong: No RTO was defined for the safety-critical monitoring agent, so no pre-planned rapid recovery procedure existed. The dependency on physical access to edge hardware was not identified in advance. No fallback agent instance existed on alternative hardware. The 1 hour 55 minute downtime fell in a window where the predictive capability gap had direct safety and financial consequences. Consequence: £43,000 product loss, 16-hour process restart costing £190,000 in lost production, HSE incident report, and near-miss safety classification.

4. Requirement Statement

Scope: This dimension applies to every AI agent deployment where the agent's unavailability has operational, financial, safety, or regulatory consequences. An agent's RTO encompasses the full recovery chain: detection of the failure, initiation of recovery procedures, restoration of the agent process, restoration of state from checkpoints (per AG-421), re-establishment of connections to dependencies (APIs, databases, model endpoints, sensors), verification of governance controls, and confirmation of operational readiness. The RTO is not merely the time to restart a process — it is the time from failure to full operational capability. The scope includes planned downtime (maintenance windows, deployments) and unplanned downtime (failures, attacks, dependency outages). Different agent classes and different agent functions within the same deployment may have different RTOs. A financial settlement agent may require a 5-minute RTO during settlement windows and a 4-hour RTO outside settlement windows. The RTO must account for the agent's full dependency chain, including external services whose recovery the organisation does not directly control.

4.1. A conforming system MUST define a Recovery Time Objective for each deployed agent class, specifying the maximum tolerated downtime measured from the moment of failure to the moment the agent resumes full operational capability with restored state and verified governance controls.

4.2. A conforming system MUST base RTO values on a documented business impact analysis that quantifies the consequences of downtime at defined intervals (e.g., impact at 15 minutes, 1 hour, 4 hours, 24 hours) considering financial loss, safety risk, regulatory exposure, customer harm, and reputational damage.

4.3. A conforming system MUST map the full recovery dependency chain for each agent class — every component, service, data store, and external dependency that must be available for the agent to achieve full operational capability — and ensure that the recovery time of every dependency is less than or equal to the agent's RTO.

4.4. A conforming system MUST implement recovery procedures for each agent class that are documented, version-controlled, and capable of being executed by operational staff without requiring the original development team, achieving recovery within the defined RTO.

4.5. A conforming system MUST validate through periodic testing (at minimum quarterly) that recovery procedures achieve recovery within the defined RTO under realistic conditions, including dependency recovery, state restoration, and governance control verification.

4.6. A conforming system MUST implement monitoring that detects agent unavailability within a defined detection interval (recommended: no more than 10% of the RTO or 60 seconds, whichever is shorter) and initiates automated or semi-automated recovery procedures.

4.7. A conforming system MUST define RTO escalation thresholds — time intervals at which the incident is escalated to successively higher levels of authority — ensuring that prolonged recovery receives appropriate management attention before the RTO is breached.

4.8. A conforming system SHOULD implement automated failover to a standby agent instance for agent classes with RTOs of 15 minutes or less, because manual recovery procedures cannot reliably achieve sub-15-minute recovery.

4.9. A conforming system SHOULD define time-varying RTOs for agent classes whose criticality changes based on operational context — for example, tighter RTOs during business hours, settlement windows, or safety-critical operational phases and relaxed RTOs during low-activity periods.

4.10. A conforming system SHOULD implement recovery rehearsal automation that periodically simulates agent failures and validates recovery procedures without manual intervention, providing continuous RTO compliance evidence beyond the quarterly minimum.

4.11. A conforming system MAY implement predictive failure detection that identifies degradation patterns (increasing latency, rising error rates, memory pressure) and initiates pre-emptive recovery before a full failure occurs, effectively achieving near-zero observed downtime.

4.12. A conforming system MAY implement progressive recovery — restoring the agent to partial operational capability (e.g., read-only mode, reduced function set) within a shorter interval and full capability within the defined RTO — to reduce the impact of downtime on the most critical functions.

5. Rationale

The Recovery Time Objective is a cornerstone concept in business continuity and disaster recovery, formally defined in ISO 22301 and widely adopted across regulated industries. Its application to AI agent governance is both natural and urgent. As organisations delegate increasingly critical functions to AI agents — financial processing, customer service, safety monitoring, regulatory compliance — the availability of those agents becomes a direct determinant of business continuity.

Traditional RTO governance for IT systems focuses on infrastructure components: databases, application servers, network links. AI agent RTO governance must extend this to encompass the unique characteristics of agent systems. First, agents have stateful recovery requirements that interact with RPO (AG-421). A restarted agent process is not operationally recovered until its state is restored, its context is re-established, and its governance controls are verified. The RTO clock does not stop when the process starts — it stops when the agent is fully operational. Second, agents often depend on complex chains of services — model inference endpoints, vector databases, tool APIs, memory stores — each with its own failure mode and recovery time. The agent's effective RTO is bounded by the slowest dependency in the chain. Third, agents in safety-critical or real-time applications have RTO requirements measured in seconds, not hours, because the consequences of unavailability are immediate and physical.

The relationship between RTO and business impact is typically non-linear. The first hour of downtime may cost £10,000; the second hour may cost £50,000; the fourth hour may cost £500,000 — because contractual deadlines expire, regulatory windows close, safety margins erode, and cascading failures propagate through dependent systems. This non-linearity means that a missed RTO is not merely a proportional overshoot; it can trigger step-function increases in damage. Scenario A illustrates this: the procurement agent's 6-hour downtime crossed a contractual deadline threshold that converted a manageable outage into a £709,000 loss.

Regulatory frameworks increasingly mandate formal RTO governance for critical systems. DORA Article 11 requires financial entities to define recovery time objectives for ICT services supporting critical functions. The FCA's operational resilience framework (PS21/3) requires firms to set impact tolerances for important business services, which include maximum tolerable downtime. ISO 22301 requires organisations to determine recovery time objectives as part of business impact analysis. The EU AI Act Article 15 requires high-risk AI systems to be robust and resilient, which implicitly requires that disruptions are bounded in duration. For organisations deploying AI agents in regulated contexts, RTO governance is not optional — it is a regulatory requirement that multiple frameworks independently mandate.

The dependency between AG-422 and AG-421 is structural: recovery time includes state restoration time, and state restoration time depends on the RPO (which determines how much state must be restored) and the checkpoint architecture (which determines restoration speed). An agent with a 30-second RPO using incremental checkpoints may restore state in 5 seconds; the same agent with a 4-hour RPO using full snapshots may require 20 minutes for state restoration. The RPO decision directly constrains the achievable RTO, and both must be designed together.

6. Implementation Guidance

Recovery Time Objective governance requires a systematic approach that begins with business impact analysis, progresses through dependency mapping and recovery procedure design, and is validated through periodic testing. The core principle is that every agent's RTO must be derived from business requirements and validated through demonstrated recovery capability — not assumed from infrastructure specifications.

Recommended patterns:

Business impact analysis with time-stepped quantification. For each agent class, quantify the impact of unavailability at defined time intervals: 5 minutes, 15 minutes, 1 hour, 4 hours, 8 hours, 24 hours. For each interval, document: estimated financial loss (direct and consequential), safety risk assessment, regulatory exposure (which deadlines or obligations are breached), customer impact (number of affected users, service level agreement breaches), and reputational impact. The RTO is set at the time interval where impact transitions from acceptable to unacceptable. This analysis must be refreshed annually or when the agent's function, dependency chain, or regulatory context changes.
Full dependency chain mapping. Document every component required for the agent to achieve full operational capability: compute infrastructure, model inference endpoint, vector database, tool APIs, authentication services, governance configuration store, checkpoint storage, network connectivity, and any external third-party services. For each dependency, record: the dependency's own RTO (or recovery time if no formal RTO exists), the fallback mechanism if the dependency is unavailable, the impact on agent capability if the dependency is degraded rather than fully unavailable, and the monitoring mechanism that detects dependency failure. The agent's achievable RTO is the maximum of all dependency recovery times plus agent-specific recovery time (state restoration, governance verification). If any dependency's recovery time exceeds the agent's required RTO, either the dependency must be made faster, a fallback must be implemented, or the RTO must be revised.
Runbook-based recovery procedures. Document step-by-step recovery procedures (runbooks) for each agent class and each failure mode (process crash, infrastructure failure, dependency outage, data corruption, adversarial disruption). Runbooks must be: executable by operational staff without requiring development team involvement; version-controlled and updated whenever the agent architecture changes; tested quarterly against the defined RTO; and include decision trees for common complications (e.g., "if checkpoint restoration fails, proceed to cold rebuild from source data"). Runbooks should specify the expected duration of each step, enabling operators to detect when recovery is falling behind the RTO and escalate proactively.
Automated health checking and failover. Implement health check mechanisms that verify agent operational capability at intervals no greater than 10% of the RTO (e.g., every 30 seconds for a 5-minute RTO). Health checks must verify not just process liveness but functional capability: can the agent process a test request end-to-end? Are all dependencies reachable? Is governance configuration loaded? For agents with RTOs of 15 minutes or less, implement automated failover to a pre-provisioned standby instance. The standby should maintain a warm state (loaded model, pre-connected dependencies, recent checkpoint loaded) to minimise failover time.
RTO escalation framework. Define escalation thresholds as percentages of the RTO: at 50% of RTO, notify the on-call engineer; at 75%, notify the engineering manager; at 90%, notify the business owner; at 100%, declare the RTO breached and notify senior management and, where applicable, the regulator. Each escalation level triggers pre-defined actions: additional resources, vendor escalation, customer communication, or activation of manual workarounds. The escalation framework converts a technical recovery effort into a managed business decision when the RTO is at risk.
Recovery verification gates. After recovery, before declaring the agent operational, execute a verification gate that confirms: agent process is running, all dependencies are connected, state has been restored from checkpoint (per AG-421) and passes integrity verification, governance configuration is loaded and matches the expected version, and a synthetic test request produces the expected output. The agent is not declared recovered until all verification gates pass. This prevents premature recovery declarations where the agent is "running" but not "operational."

Anti-patterns to avoid:

Infrastructure-derived RTO. Setting the agent's RTO based on the infrastructure provider's SLA (e.g., "our cloud provider guarantees 99.99% uptime, so our RTO is 4.3 minutes") without accounting for the full recovery chain. Infrastructure uptime is a necessary but insufficient component of agent availability. The agent's RTO must account for state restoration, dependency recovery, and governance verification beyond infrastructure restart.
Single-dependency-point RTO. Setting the RTO based on the fastest component in the recovery chain rather than the slowest. Scenario B illustrates this: the agent process recovered in 35 minutes, but the vector database required 7 hours 12 minutes, making the effective RTO 7 hours 47 minutes regardless of how fast the agent itself restarted.
Stale business impact analysis. Conducting a business impact analysis once and never updating it. Agent functions evolve, dependency chains change, and business criticality shifts. A BIA conducted 18 months ago (Scenario B) may not reflect the current reality. BIAs must be refreshed at minimum annually.
Untested recovery procedures. Documenting runbooks without testing them under realistic conditions. Recovery procedures that have never been executed under time pressure will fail when needed. Staff will encounter unfamiliar steps, missing credentials, outdated API endpoints, and undocumented dependencies. Only tested procedures provide RTO assurance.
RTO without escalation. Defining an RTO without an escalation framework. If no one is notified when recovery is falling behind the RTO, the RTO provides no value — it is merely a number in a document that no one consults during an incident.
Ignoring planned downtime. Treating the RTO as applicable only to unplanned outages and allowing planned maintenance to exceed the RTO. From the business perspective, unavailability is unavailability regardless of whether it was planned. Maintenance windows must be designed to fit within the RTO or must include explicit business approval for exceeding it.

Industry Considerations

Financial Services. DORA Article 11 mandates recovery time objectives for ICT services supporting critical or important functions. Financial firms must demonstrate that their RTOs are based on documented business impact analyses and validated through regular testing. For agents involved in payment processing, settlement, or trading, RTOs during market hours may need to be measured in seconds to minutes. Firms should implement hot standby architectures with automated failover for agents in the critical path of financial transactions.

Healthcare. Clinical decision-support agents have RTOs determined by the clinical context. An agent supporting emergency department triage has a fundamentally different RTO requirement than an agent supporting routine administrative scheduling. Healthcare organisations must define RTOs in consultation with clinical leadership, considering patient safety impact at each downtime interval. Regulatory frameworks such as FDA software guidance require demonstrated reliability for clinical decision-support systems.

Crypto/Web3. Blockchain-interacting agents face unique RTO challenges because missed transaction windows (block finality deadlines, liquidity pool rebalancing windows, governance vote deadlines) cannot be recovered retroactively. The RTO for a DeFi trading agent must be shorter than the minimum transaction window, or missed windows will result in financial loss or protocol governance failures. Cross-chain bridge agents require particularly aggressive RTOs because bridge outages can trap liquidity across chains.

Safety-Critical / CPS. Physical systems cannot pause. When a safety-critical agent monitoring an industrial process goes offline, the physical process continues regardless. The RTO must be shorter than the time for the monitored system to transition from safe operating parameters to a dangerous condition. For Scenario C, the time from anomaly onset to emergency was approximately 14 minutes; the RTO must be shorter than this interval for the predictive monitoring value to be preserved. Safety-critical agents should implement sub-minute RTOs with automated failover to redundant instances on independent hardware.

Public Sector / Rights-Sensitive. Government service agents may have RTOs determined by statutory service delivery obligations. An agent processing benefit applications may need to meet statutory processing deadlines that create hard RTO requirements. Downtime during critical filing periods (tax deadlines, benefit enrollment windows) has outsized impact.

Maturity Model

Basic Implementation — The organisation has defined RTOs for each deployed agent class based on documented business impact analyses. Recovery procedures are documented in runbooks. The full dependency chain is mapped and each dependency's recovery time is verified against the agent's RTO. Monitoring detects agent unavailability within the defined detection interval. Recovery procedures are tested quarterly against the defined RTO. Escalation thresholds are defined and staffed. This level satisfies all MUST requirements.

Intermediate Implementation — All basic capabilities plus: automated failover is implemented for agent classes with RTOs of 15 minutes or less. Time-varying RTOs adjust recovery urgency based on operational context. Recovery verification gates confirm full operational capability before declaring recovery complete. Recovery rehearsal automation supplements quarterly testing with continuous validation. Dependency chain monitoring provides real-time visibility into the recovery-readiness of all components.

Advanced Implementation — All intermediate capabilities plus: predictive failure detection identifies degradation patterns and initiates pre-emptive recovery. Progressive recovery restores partial capability within a shorter interval than full capability. Chaos engineering exercises simulate random failures in production to validate RTO compliance continuously. Cross-region active-active architectures eliminate single points of failure. The organisation can demonstrate through evidence that no unplanned outage in the past 12 months exceeded the defined RTO. RTO metrics are integrated into executive operational resilience dashboards with real-time breach alerting.

7. Evidence Requirements

Required artefacts:

RTO definition document. Formal specification of RTO values for each agent class, including: agent class identifier, RTO value in time units, business impact analysis reference, escalation thresholds, time-varying conditions (if applicable), and approval authority.
Business impact analysis. Documented analysis quantifying the consequences of agent unavailability at defined time intervals, covering: financial loss, safety risk, regulatory exposure, customer impact, and reputational damage. Must be dated and refreshed at minimum annually.
Dependency chain map. Documentation of every component required for agent recovery, including: dependency name, dependency owner, dependency recovery time, fallback mechanism, and impact of dependency unavailability. Must demonstrate that all dependency recovery times are within the agent's RTO.
Recovery runbooks. Step-by-step recovery procedures for each agent class and failure mode, version-controlled, with expected step durations and decision trees for common complications.
Recovery test results. Results of quarterly (minimum) recovery testing, including: test date, agent class tested, simulated failure mode, time to detection, time to recovery initiation, time to full operational capability, RTO compliance (pass/fail), and any complications encountered.
RTO breach records. Records of all RTO breaches, including: incident date, agent class, defined RTO, actual recovery time, root cause, business impact, and remediation actions taken to prevent recurrence.
Escalation records. Records of all RTO escalation events, demonstrating that escalation thresholds triggered appropriate notifications and actions.

Retention requirements:

RTO definitions, BIAs, and dependency chain maps: retained for the operational life of the agent class plus 3 years.
Recovery test results, breach records, and escalation records: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Recovery test results and RTO breach records must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Test 8.1: RTO Definition Completeness

Stimulus: Enumerate all deployed agent classes. For each, verify that an RTO value is defined with a documented business impact analysis and escalation thresholds.
Expected behaviour: Every deployed agent class has a defined RTO, a referenced BIA, and defined escalation thresholds.
Pass criteria: 100% coverage — every deployed agent class has a defined RTO with BIA reference and escalation thresholds documented.
Fail criteria: Any deployed agent class lacks a defined RTO, lacks a BIA reference, or lacks defined escalation thresholds.

Test 8.2: Dependency Chain Recovery Time Validation

Stimulus: For each agent class, retrieve the dependency chain map. For each dependency, verify that the dependency's documented recovery time is less than or equal to the agent's defined RTO. Additionally, test the actual recovery time of at least 3 critical dependencies by simulating their failure.
Expected behaviour: All documented dependency recovery times are within the agent's RTO. Actual tested recovery times confirm the documented values within 20% tolerance.
Pass criteria: Zero dependencies have documented recovery times exceeding the agent's RTO. Tested dependency recovery times are within 20% of documented values.
Fail criteria: Any dependency's recovery time exceeds the agent's RTO, or any tested recovery time exceeds the documented value by more than 20%.

Test 8.3: End-to-End Recovery Within RTO

Stimulus: For each agent class, simulate a complete failure (process termination, dependency disconnection, state loss). Execute the documented recovery procedure. Measure the time from failure to full operational capability (process running, state restored, dependencies connected, governance verified, synthetic test request successful).
Expected behaviour: Full recovery is achieved within the defined RTO. All recovery verification gates pass.
Pass criteria: Recovery time is less than or equal to the defined RTO. All verification gates pass. The recovered agent produces correct outputs for synthetic test requests.
Fail criteria: Recovery time exceeds the defined RTO, any verification gate fails, or the recovered agent produces incorrect outputs.

Test 8.4: Failure Detection Timeliness

Stimulus: Simulate agent failure without notifying the monitoring system (i.e., the failure must be detected by monitoring, not by manual report). Measure the time from failure to detection alert.
Expected behaviour: The monitoring system detects the failure within the defined detection interval (10% of RTO or 60 seconds, whichever is shorter).
Pass criteria: Detection alert generated within the defined detection interval. Alert contains: agent identifier, failure timestamp, failure type (if determinable), and escalation status.
Fail criteria: Detection alert is not generated, or alert is generated outside the defined detection interval.

Test 8.5: Escalation Threshold Enforcement

Stimulus: Simulate a recovery that progresses slowly, passing through the 50%, 75%, 90%, and 100% RTO escalation thresholds. Verify that each threshold triggers the defined escalation action.
Expected behaviour: Escalation notifications are sent at each defined threshold to the correct recipients. Escalation actions are initiated as documented.
Pass criteria: All escalation thresholds trigger notifications within 60 seconds of the threshold being crossed. Notifications reach the correct recipients. Actions are logged.
Fail criteria: Any escalation threshold is missed, notifications are delayed beyond 60 seconds, or notifications reach incorrect recipients.

Test 8.6: Recovery Procedure Executability Without Development Team

Stimulus: Assign the recovery runbook to an operational staff member who was not involved in the agent's development. Simulate a failure and have the operational staff member execute the recovery procedure using only the documented runbook.
Expected behaviour: The operational staff member achieves recovery within the defined RTO using only the documented runbook. No undocumented steps, credentials, or tribal knowledge is required.
Pass criteria: Recovery achieved within RTO by non-development staff using only the runbook. No steps required information not in the runbook.
Fail criteria: Recovery fails or exceeds RTO due to missing, unclear, or incorrect runbook instructions. Staff member requires development team assistance to complete recovery.

Test 8.7: Recovery Verification Gate Validation

Stimulus: After recovery from a simulated failure, deliberately introduce a deficiency in one recovery component: (a) leave one dependency disconnected, (b) load an outdated governance configuration, (c) skip state restoration. Verify that the recovery verification gate detects each deficiency and prevents premature recovery declaration.
Expected behaviour: The verification gate detects each deliberately introduced deficiency and blocks the recovery-complete declaration until the deficiency is resolved.
Pass criteria: All three deliberate deficiencies are detected. Recovery is not declared complete until all deficiencies are resolved. Each deficiency produces a specific, actionable error message.
Fail criteria: Any deficiency passes the verification gate undetected, or recovery is declared complete despite an unresolved deficiency.

Conformance Scoring

Score 0: No RTO is defined for any agent class. Recovery is ad hoc, with no documented procedures, no monitoring, and no escalation framework. Recovery time is unknown and unbounded.
Score 1: RTOs are defined for at least the highest-criticality agent classes. Recovery procedures exist in document form but have not been tested within the past 12 months. Monitoring detects failures but may not meet the detection interval requirement. Escalation is informal.
Score 2: RTOs are defined for all agent classes with documented business impact analyses. Recovery procedures are documented, version-controlled, and tested quarterly with demonstrated RTO compliance. Monitoring detects failures within the defined detection interval. Escalation thresholds are defined and enforced. Recovery verification gates confirm full operational capability. This level satisfies all MUST requirements.
Score 3: Verified by independent audit — all Score 2 capabilities confirmed plus: automated failover for sub-15-minute RTO agents. Predictive failure detection enables pre-emptive recovery. Chaos engineering exercises continuously validate RTO compliance. No unplanned outage in the past 12 months exceeded the defined RTO. Progressive recovery provides partial capability within a fraction of the full RTO. Independent assessor has validated recovery procedures, dependency chain maps, and escalation frameworks.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 15 (Accuracy, Robustness and Cybersecurity)	Direct requirement
EU AI Act	Article 9 (Risk Management System)	Supports compliance
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	15A.2 (Operational Resilience — Impact Tolerances)	Direct requirement
NIST AI RMF	MANAGE 2.4 (Mechanisms for Tracking Risks)	Supports compliance
ISO 42001	Clause 8.4 (AI System Operation and Monitoring)	Supports compliance
DORA	Article 11 (ICT Business Continuity Management)	Direct requirement

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires high-risk AI systems to achieve an appropriate level of robustness and to perform consistently throughout their lifecycle. A system that is unavailable for extended or unpredictable periods following disruptions is not robust. The requirement for resilience against "errors, faults or inconsistencies" covers the scenario of system failures and the need for timely recovery. RTO governance ensures that the duration of any disruption is bounded and that recovery procedures are validated, directly supporting the robustness requirement.

SOX — Section 404 (Internal Controls Over Financial Reporting)

Financial agents that serve as internal controls over financial reporting processes must be available to perform their control function. Extended downtime of a reconciliation agent, an approval workflow agent, or a transaction monitoring agent creates a gap in the internal control environment. SOX auditors will assess whether the organisation has defined RTOs for control-relevant agents and can demonstrate that recovery procedures meet those RTOs. An undefined or untested RTO for a financial control agent is a potential material weakness.

FCA SYSC — 15A.2 (Operational Resilience — Impact Tolerances)

The FCA's operational resilience rules (PS21/3, effective March 2022) require firms to set impact tolerances for important business services — the maximum tolerable disruption before intolerable harm occurs to consumers, market integrity, or firm safety and soundness. For firms using AI agents to deliver important business services, the RTO directly implements the impact tolerance. Firms must demonstrate that they can remain within their impact tolerance during severe but plausible scenarios, which requires validated RTOs and tested recovery procedures. The FCA's approach is outcomes-based: it is not sufficient to have an RTO on paper; the firm must demonstrate the ability to recover within it.

NIST AI RMF — MANAGE 2.4

MANAGE 2.4 addresses mechanisms for tracking identified AI risks over time. Agent unavailability is an identified operational risk, and RTO governance is the mechanism for bounding and tracking that risk. The business impact analysis required by AG-422 (Requirement 4.2) directly supports MANAGE 2.4's requirement for documented risk assessment and tracking.

ISO 42001 — Clause 8.4

ISO 42001 requires that AI systems are operated and monitored in accordance with documented procedures. RTO governance defines the documented procedures for detecting and recovering from disruptions. The monitoring requirements (Requirement 4.6) and testing requirements (Requirement 4.5) ensure ongoing operational compliance. Certification auditors will review RTO definitions, recovery test results, and breach records as evidence of operational discipline.

DORA — Article 11 (ICT Business Continuity Management)

DORA Article 11 explicitly requires financial entities to maintain ICT business continuity policies and arrangements that ensure continuity of critical or important functions. The article mandates recovery time objectives set in accordance with the entity's business impact analysis. AG-422 directly implements this requirement for AI agent deployments. DORA further requires that business continuity plans are tested at least annually and that the results of tests are documented and reported to senior management. AG-422's quarterly testing requirement exceeds DORA's minimum annual testing frequency.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Agent-function-level for individual outages; organisation-wide for governance gaps where no RTO framework exists, as every agent-dependent business function is exposed to unbounded downtime risk

Consequence chain: Without RTO governance, an AI agent disruption has no defined endpoint. The immediate technical consequence is agent unavailability — the agent's function is not performed. The first-order business consequence depends on what the agent does: financial agents stop processing transactions (Scenario A: £709,000 in supplier penalties and lost production from a 6-hour outage); customer-facing agents stop serving customers (Scenario B: 412 queued applications and £380,000 regulatory review); safety-critical agents stop monitoring physical systems (Scenario C: near-miss safety incident and £233,000 in product loss and restart costs). The second-order consequence is cascading failure — other systems, processes, and agents that depend on the unavailable agent begin to degrade. The procurement agent's downtime in Scenario A cascaded to supplier relationships, production scheduling, and inventory management. The third-order consequence is regulatory and reputational — regulators assess not just whether an outage occurred but whether the organisation had defined tolerances, tested recovery procedures, and escalated appropriately. An outage within a defined and tested RTO is an operational incident; an outage without any RTO framework is a governance failure. DORA, the FCA, and SOX all distinguish between "something went wrong" (acceptable if recovery is managed) and "there was no plan" (a finding regardless of the outcome). The compounding risk is that undefined RTOs prevent the organisation from making rational resource allocation decisions: without quantified downtime costs, the business case for redundancy, failover architecture, and recovery automation cannot be made, leaving agents perpetually vulnerable to extended outages.

Cross-references: AG-421 (Recovery Point Objective for Memory and State Governance) governs the complementary dimension of acceptable state loss, which directly affects recovery time (state restoration is a component of RTO). AG-008 (Governance Continuity Under Failure) ensures that governance controls themselves remain available during agent recovery. AG-419 (Adverse Event Severity Matrix Governance) provides the severity classification framework used in business impact analysis. AG-420 (Tabletop Exercise Governance) provides the exercise framework for validating recovery procedures. AG-425 (Emergency Change Freeze Governance) governs change restrictions during recovery operations. AG-426 (Fallback Staffing Governance) ensures human fallback capacity during agent unavailability. AG-403 (Dependency Failover Validation Governance) governs the validation of dependency failover mechanisms referenced in AG-422's dependency chain requirements. AG-402 (Model Serving Rate Partitioning Governance) governs capacity allocation that affects recovery resource availability.

Cite this protocol

AgentGoverning. (2026). AG-422: Recovery Time Objective Governance. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-422

← Previous Protocol

AG-421

Recovery Point Objective for Memory and State Governance

Next Protocol →

AG-423

Incident Learning Closure Governance