AG-420

Tabletop Exercise Governance

Incident Response, Recovery & Resilience ~24 min read AGS v2.1 · April 2026
EU AI Act SOX FCA NIST ISO 42001

2. Summary

Tabletop Exercise Governance requires that organisations operating AI agents conduct structured, scenario-based exercises that simulate severe and unusual agent failure modes, testing the organisation's incident response capabilities, escalation paths, decision-making processes, and coordination mechanisms without requiring actual system disruption. Tabletop exercises are the primary mechanism for discovering gaps in incident response procedures before those gaps are exposed by real incidents — they reveal whether the people responsible for managing agent crises actually know what to do, can locate the tools and information they need, and can coordinate effectively under time pressure. Without governed tabletop exercises, organisations discover the weaknesses in their incident response during actual crises, when the cost of discovery includes real harm, regulatory scrutiny, and reputational damage.

3. Example

Scenario A — Untested Escalation Path Fails During Real Incident: A financial-value agent managing a £450 million fixed-income portfolio develops a systematic pricing error that overvalues holdings by 3.2%. The total misstatement is £14.4 million. The incident response plan specifies that Critical financial incidents require immediate notification of the Chief Risk Officer, the Head of Operations, and the external auditor. When the incident occurs at 22:15 on a Friday evening, the on-call analyst follows the escalation procedure — and discovers that the CRO's emergency contact number was changed 4 months ago and the plan was never updated, the Head of Operations is on sabbatical with no designated deputy listed in the plan, and the external auditor contact is a general mailbox that is not monitored outside business hours. The analyst spends 47 minutes locating the correct contacts. During that time, the agent continues to execute trades based on the mispriced portfolio, adding £2.1 million in additional exposure. The total loss, including the delayed response period, reaches £16.5 million.

What went wrong: The escalation path was documented but never tested. A single tabletop exercise simulating a Critical financial incident outside business hours would have revealed all three contact failures. The organisation had an incident response plan that looked complete on paper but failed on first real use. The £2.1 million additional exposure during the 47-minute escalation delay was entirely preventable. Consequence: £16.5 million total exposure, FCA supervisory finding for inadequate incident management under SYSC 6.1.1R, board-level governance review, and £320,000 in remediation costs including incident response plan overhaul.

Scenario B — Cross-Functional Coordination Failure Under Time Pressure: A customer-facing agent serving 340,000 retail banking customers begins providing incorrect interest rate information due to a rate-feed corruption. The error affects balance projections, early-repayment calculations, and savings rate quotations. The incident response plan assigns concurrent responsibilities: the technology team must isolate the agent, the compliance team must assess regulatory notification obligations, the customer communications team must prepare customer notifications, and the legal team must assess liability exposure. In practice, all four teams begin working in isolation. The technology team isolates the agent after 25 minutes but does not notify the compliance team, which continues its assessment based on the assumption the agent is still active. The customer communications team prepares a notification stating the error has been corrected, but the legal team — unaware of the communications draft — issues a legal hold that prevents any external communication. Three hours into the incident, the four teams hold their first coordination call and discover they have been working at cross-purposes. Customer notifications are delayed by 6 hours. During that delay, 4,200 customers make financial decisions based on incorrect information. Remediation costs reach £890,000.

What went wrong: The incident response plan assigned responsibilities but did not define the coordination mechanism — who convenes the cross-functional response, how teams share status updates, and how conflicting actions (legal hold vs. customer notification) are resolved. A tabletop exercise would have surfaced the coordination gap within the first 30 minutes of the simulated scenario, when participants would have realised that no one was designated to lead the cross-functional response. Consequence: 6-hour notification delay, 4,200 affected customers requiring individual remediation, £890,000 in remediation costs, FCA Consumer Duty finding for delayed customer communication, and reputational damage across financial media.

Scenario C — Novel Failure Mode Not Covered by Existing Playbooks: A safety-critical agent controlling environmental systems in a pharmaceutical cold-chain facility experiences a novel failure: the agent correctly identifies a temperature excursion but, due to a logic error, implements the inverse corrective action — increasing cooling when heating is needed and vice versa. The facility's incident playbooks cover "agent offline" (failover to manual control), "sensor failure" (cross-reference redundant sensors), and "communication loss" (activate backup communication channel). None of the playbooks covers "agent operating but providing inverted commands" — a scenario where the agent appears functional and actively working to resolve the issue, but its actions are making the situation worse. The operations team monitors the agent's activity log, sees it actively responding to the temperature excursion, and concludes the agent is handling the situation. The temperature deviation worsens for 2 hours before a technician physically inspects the facility and discovers the inversion. Product worth £3.7 million is destroyed by the temperature deviation.

What went wrong: The incident playbooks covered only anticipated failure modes (offline, sensor failure, communication loss) and did not cover the more dangerous scenario of an agent actively operating but producing harmful outputs. A tabletop exercise focused on "unusual and deceptive failure modes" — where participants are presented with an agent that appears to be functioning correctly but is causing harm — would have revealed this gap. The exercise would have forced participants to define detection mechanisms for active-but-harmful agent behaviour and response procedures for overriding an agent that appears operational. Consequence: £3.7 million in destroyed pharmaceutical product, regulatory investigation by the Medicines and Healthcare products Regulatory Agency, cold-chain certification suspended pending review, and supply disruption affecting 14 downstream distributors.

4. Requirement Statement

Scope: This dimension applies to every organisation operating AI agents in production environments where agent failures can cause harm across any of the five severity axes defined in AG-419 (safety, financial, rights, legal, reputational). The scope includes the design, scheduling, execution, and follow-up of tabletop exercises. Organisations operating only low-risk agents (General/Internal Copilot profile with no external-facing or decision-making capabilities) may conduct exercises annually rather than semi-annually but are not exempted from the requirement. The test is: if an agent failed catastrophically, would the organisation need to coordinate a multi-person, time-sensitive response? If yes, tabletop exercises are required to validate that the response capability actually works.

4.1. A conforming system MUST conduct tabletop exercises at least semi-annually for High-Risk/Critical tier agents and at least annually for all other production agents, simulating agent failure scenarios that test the organisation's incident response capabilities.

4.2. A conforming system MUST design exercise scenarios that span the full range of the AG-419 severity matrix, including at least one Critical-severity scenario per exercise that tests the organisation's maximum-escalation response pathway.

4.3. A conforming system MUST include at least one "novel failure mode" scenario per exercise — a failure type not covered by existing incident playbooks — to test the organisation's ability to respond to unanticipated agent behaviour.

4.4. A conforming system MUST require participation from all roles identified in the incident response plan, including technical responders, governance leads, legal counsel, communications staff, and executive decision-makers, with documented attendance records.

4.5. A conforming system MUST produce a structured exercise report within 14 calendar days of each exercise, documenting: scenario descriptions, participant actions, gaps identified, decisions made, time-to-response metrics, and a prioritised remediation plan for each identified gap.

4.6. A conforming system MUST track remediation of identified gaps to closure, with each gap assigned an owner, a target remediation date, and a verification method, and with gap status reported to governance leadership at least monthly until all gaps are closed.

4.7. A conforming system MUST maintain an exercise scenario library that evolves based on real incidents (from the organisation and from industry), emerging threat intelligence, changes to the agent portfolio, and findings from previous exercises.

4.8. A conforming system SHOULD include external participants (regulators, key vendors, mutual-aid partners) in at least one exercise per year to test cross-organisational coordination mechanisms.

4.9. A conforming system SHOULD incorporate "injects" — mid-exercise scenario changes that escalate severity, introduce new information, or create conflicting priorities — to test adaptive decision-making under evolving conditions.

4.10. A conforming system SHOULD vary exercise timing to include at least one exercise conducted outside normal business hours (evenings, weekends, holidays) to test after-hours response capabilities.

4.11. A conforming system MAY conduct unannounced exercises where participants are not informed in advance, to test the organisation's readiness to respond without preparation time.

4.12. A conforming system MAY integrate tabletop exercises with technical simulation, where the tabletop scenario is accompanied by simulated telemetry, alerts, and dashboards that replicate the information environment of a real incident.

5. Rationale

Incident response plans are hypotheses about how an organisation will behave during a crisis. Until tested, they remain hypotheses — plausible, internally consistent, and potentially wrong. Tabletop exercises are the primary mechanism for converting incident response hypotheses into validated capabilities.

The gap between documented procedures and actual crisis performance is well-established in incident management research. Studies across multiple domains — aviation, healthcare, cybersecurity, financial services — consistently show that untested incident response plans fail at rates between 40% and 70% on first real use. The failure modes are predictable: contact information is outdated, roles and responsibilities are ambiguous, coordination mechanisms are undefined, decision authority is unclear, and novel failure modes are not covered. These are not exotic failures — they are the standard consequences of plans that have never been executed, even in simulation.

AI agent incidents introduce failure modes that are qualitatively different from traditional IT incidents. Traditional IT incidents typically involve systems that are clearly broken — they crash, they return errors, they become unavailable. The response paradigm is: detect the failure, identify the root cause, restore service. AI agent incidents can involve systems that appear to be functioning normally while causing harm — an agent that is available, responsive, and producing outputs, but whose outputs are subtly wrong (Scenario C), systematically biased (AG-419 Scenario C), or financially harmful (AG-419 Scenario B). These "active-but-harmful" failure modes require fundamentally different detection and response strategies that are unlikely to be developed under the time pressure of a real incident. Tabletop exercises provide the space to think through these scenarios, develop detection mechanisms, and define response procedures before the scenarios occur in production.

Tabletop exercises also serve a critical coordination function. Agent-related incidents typically require cross-functional response — technology teams to isolate or remediate the agent, governance teams to assess compliance implications, legal teams to evaluate liability, communications teams to manage stakeholder notification, and executive leadership to make escalation decisions. In the absence of practised coordination, these teams default to working in isolation (Scenario B), producing conflicting actions, duplicated effort, and delayed response. The tabletop format forces cross-functional interaction in a low-stakes environment, building the coordination muscle memory that will be needed during real incidents.

The "novel failure mode" requirement (4.3) deserves specific rationale. Incident playbooks, by definition, cover anticipated failure modes. But the history of AI system failures demonstrates that the most damaging incidents are often novel — failure modes that were not anticipated during planning. The 2010 Flash Crash involved a novel interaction between algorithmic trading systems that was not covered by any existing playbook. The 2018 Boeing 737 MAX accidents involved a novel failure mode (MCAS system activating on erroneous sensor data) that was not covered by pilot training. The requirement to include at least one novel failure mode per exercise is not about predicting the specific novel failure that will occur — it is about practising the organisation's ability to respond to the unexpected. Organisations that have practised responding to novel scenarios are measurably better at responding to the next novel scenario they encounter, even if the specifics are different.

Regulatory expectations for incident response testing are explicit and growing. DORA Article 11 requires financial entities to test their ICT business continuity plans at least annually, including scenario-based testing. The EU AI Act's quality management system requirements under Article 17 implicitly require testing of incident response procedures. The Bank of England's operational resilience framework requires firms to test their ability to remain within impact tolerances during severe but plausible scenarios. ISO 22301 (Business Continuity Management) requires exercising and testing of continuity procedures. AG-420 aligns with and extends these requirements to the specific context of AI agent failures.

6. Implementation Guidance

Tabletop exercises should be designed as structured, facilitated discussions — not free-form conversations and not scripted walkthroughs. The facilitator presents a scenario, participants discuss what they would do, and the facilitator introduces additional information ("injects") that evolves the scenario. The goal is not to test whether participants have memorised the incident response plan — it is to test whether the plan actually works when executed by real people under realistic conditions.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Financial regulators increasingly expect scenario-based testing of incident response capabilities. DORA Article 11 explicitly requires testing of ICT business continuity plans. The Bank of England's operational resilience framework requires firms to test their ability to remain within impact tolerances during severe but plausible scenarios. Financial-sector tabletop exercises should include scenarios involving market-hours timing pressure (e.g., a pricing agent failure during market open), cross-border coordination (e.g., an agent failure affecting multiple jurisdictions with different regulatory notification requirements), and cascading failures (e.g., an agent failure triggering a downstream settlement failure).

Healthcare and Safety-Critical. Tabletop exercises in safety-critical domains should include scenarios involving immediate physical risk, where the exercise tests the speed of agent shutdown and failover to manual control. Healthcare exercises should include scenarios involving clinical decision support agents providing incorrect recommendations, with the exercise testing whether clinicians can detect and override incorrect agent outputs under time pressure. Exercises should also test coordination with external parties: medical device regulators, patient safety organisations, and other healthcare providers.

Public Sector. Public-sector exercises should include scenarios involving rights impact — agents making incorrect benefit determinations, biased risk assessments, or erroneous enforcement decisions. These scenarios test the organisation's ability to detect differential impact on protected groups, coordinate with equality bodies, and manage public communication in politically sensitive contexts.

Crypto and Web3. Exercises should include scenarios involving the irreversibility of blockchain transactions, where an agent executes incorrect transactions that cannot be reversed. These scenarios test the organisation's ability to respond when remediation through reversal is impossible and alternative compensation or recovery mechanisms must be improvised.

Maturity Model

Basic Implementation — The organisation conducts tabletop exercises at the required frequency (semi-annually for High-Risk/Critical, annually for others). Scenarios span at least three of the five severity axes. At least one Critical-severity scenario is included per exercise. Participants include representatives from all roles in the incident response plan. Exercise reports are produced within 14 days. Gaps are documented with owners and target dates. Gap status is reported monthly. This level meets the minimum mandatory requirements and provides baseline assurance that the incident response plan has been tested.

Intermediate Implementation — All basic capabilities plus: scenarios include progressive inject sequences with at least 3 injects per scenario. At least one novel failure mode scenario is included per exercise. A scenario library of at least 20 scenarios is maintained and updated. After-hours exercises are conducted at least annually. Gap remediation is verified before closure (the fix is tested, not just implemented). Exercises incorporate the AG-419 severity matrix, requiring participants to classify the incident using the matrix as part of the exercise. External participants (key vendors, mutual-aid partners) are included in at least one exercise per year.

Advanced Implementation — All intermediate capabilities plus: exercises are integrated with technical simulation, providing realistic telemetry, alerts, and dashboards. Unannounced exercises test readiness without preparation time. Exercise scenarios are informed by threat intelligence, real-incident databases, and AI safety research. Cross-organisational exercises test coordination with regulators, industry peers, and supply-chain partners. Exercise effectiveness is measured through metrics (time-to-response improvement, gap recurrence rate, participant confidence scores). The organisation can demonstrate year-over-year improvement in exercise outcomes. Independent observers evaluate exercise design and execution quality.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: Exercise Frequency Compliance

Test 8.2: Severity Matrix Coverage in Scenarios

Test 8.3: Novel Failure Mode Inclusion

Test 8.4: Cross-Functional Participation Completeness

Test 8.5: Exercise Report Timeliness and Completeness

Test 8.6: Gap Remediation Tracking and Closure

Test 8.7: Scenario Library Maintenance

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Supports compliance
EU AI ActArticle 17 (Quality Management System)Supports compliance
SOXSection 404 (Internal Controls Over Financial Reporting)Supports compliance
FCA SYSC6.1.1R (Systems and Controls)Direct requirement
NIST AI RMFGOVERN 1.5, MANAGE 4.1Supports compliance
ISO 42001Clause 9.2 (Internal Audit), Clause 10.1 (Continual Improvement)Supports compliance
DORAArticle 11 (ICT Business Continuity Policy), Article 26 (Testing of ICT Tools)Direct requirement

EU AI Act — Article 9 (Risk Management System)

Article 9 requires providers of high-risk AI systems to establish, implement, document, and maintain a risk management system that includes the adoption of suitable risk management measures. Tabletop exercises are a risk management measure that validates the effectiveness of incident response procedures — the organisation's primary defence against the consequences of AI system failures. While Article 9 does not explicitly mandate tabletop exercises, it requires that risk management measures be tested and their effectiveness demonstrated. Tabletop exercises provide the evidence that incident response procedures (a core risk management measure) have been tested and found effective or have been improved based on test findings.

EU AI Act — Article 17 (Quality Management System)

Article 17 requires a quality management system that includes procedures for handling nonconformities, including corrective actions. Tabletop exercises test the procedures for handling nonconformities (incident response) under realistic conditions. The exercise reports and gap remediation records provide the quality management evidence that procedures have been tested, gaps have been identified, and corrective actions have been taken.

SOX — Section 404 (Internal Controls Over Financial Reporting)

For SOX-subject organisations, the incident response capability for financial-value agents is an internal control. SOX auditors assess not only whether controls exist but whether they are effective. A control that has never been tested cannot be assessed as effective. Tabletop exercises provide the testing evidence that incident response controls for financial-value agents have been validated. The exercise report and gap remediation records demonstrate the control testing that Section 404 requires.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA expects firms to maintain systems and controls that are adequate for the nature and scale of their business. For firms deploying AI agents in regulated activities, this includes incident management capabilities proportional to the risk. The FCA has explicitly emphasised the importance of testing contingency arrangements through scenario-based exercises. Supervisory findings have cited inadequate scenario testing as a contributing factor in incident response failures. AG-420 directly addresses the FCA's expectation that firms test their incident response capabilities through realistic scenarios rather than relying solely on documented procedures.

NIST AI RMF — GOVERN 1.5, MANAGE 4.1

GOVERN 1.5 addresses processes for escalation and response to AI-related risks. MANAGE 4.1 addresses incident response planning and execution. Both functions require not only documented procedures but validated capabilities. The NIST AI RMF's emphasis on organisational practices and processes for AI risk management implicitly requires testing of those practices. Tabletop exercises provide the testing mechanism that validates GOVERN 1.5 escalation processes and MANAGE 4.1 response procedures are functional.

ISO 42001 — Clause 9.2 (Internal Audit), Clause 10.1 (Continual Improvement)

ISO 42001 Clause 9.2 requires internal audits to determine whether the AI management system conforms to requirements and is effectively implemented. Tabletop exercises are a form of internal audit of the incident response component of the AI management system — they test whether documented procedures are effectively implemented by the people who will execute them. Clause 10.1 requires continual improvement based on audit findings, corrective actions, and performance evaluation. The exercise gap remediation cycle (identify gaps, remediate, verify, track) is a continual improvement mechanism directly aligned with Clause 10.1.

DORA — Article 11 (ICT Business Continuity Policy) and Article 26 (Testing of ICT Tools)

DORA Article 11 requires financial entities to put in place ICT business continuity policies and plans that are tested at least annually. Article 11(6) specifically requires scenario-based testing of ICT business continuity plans, including scenarios involving severe but plausible disruptions. Article 26 requires financial entities to establish programmes for testing ICT tools, systems, and processes. AG-420's tabletop exercise requirements directly implement DORA's scenario-based testing mandate. The semi-annual frequency for High-Risk/Critical agents exceeds DORA's annual minimum, reflecting the higher risk profile of AI agent failures. The requirement for novel failure mode scenarios aligns with DORA's emphasis on "severe but plausible" scenarios that go beyond routine disruptions.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusOrganisation-wide — affects the quality and speed of response to every agent-related incident

Consequence chain: Without tabletop exercises, incident response procedures remain untested hypotheses. The organisation discovers gaps in its response capability during actual incidents — when the cost of discovery is measured in additional harm, delayed response, and regulatory scrutiny rather than in exercise time and remediation effort. The immediate failure mode is response delay: during a real incident, responders encounter problems that should have been identified and resolved during exercises — outdated contact information (Scenario A: 47-minute escalation delay adding £2.1 million in exposure), undefined coordination mechanisms (Scenario B: 6-hour notification delay affecting 4,200 customers), and unrecognised failure modes (Scenario C: 2-hour detection delay destroying £3.7 million in product). The downstream consequence is amplified harm: every minute of response delay during a real incident allows the harm to compound. Financial exposure grows. Additional individuals are affected. Safety hazards persist. The regulatory consequence is particularly severe because regulators view untested incident response as a governance failure independent of any specific incident. DORA Article 11(6) explicitly requires scenario-based testing; the FCA expects tested contingency arrangements; the EU AI Act's risk management requirements implicitly include procedure validation. An organisation that cannot demonstrate regular, structured exercise programmes faces enforcement action for inadequate governance even if no specific incident has occurred. The reputational consequence emerges when post-incident reviews reveal that the response failures were foreseeable and would have been identified by even a basic tabletop exercise — creating the narrative that the organisation knew (or should have known) its response capability was inadequate and chose not to test it.

Cross-references: AG-419 (Adverse Event Severity Matrix Governance) provides the severity framework used to design exercise scenarios and classify simulated incidents during exercises. AG-008 (Governance Continuity Under Failure) defines the continuity mechanisms that exercises test. AG-421 (Recovery Point Objective for Memory and State Governance) and AG-422 (Recovery Time Objective Governance) define recovery targets that exercises validate. AG-423 (Incident Learning Closure Governance) consumes exercise findings as inputs to the learning process. AG-426 (Fallback Staffing Governance) defines staffing arrangements that exercises test. AG-427 (Mutual Aid and Vendor Coordination Governance) defines cross-organisational coordination mechanisms that exercises validate. AG-403 (Dependency Failover Validation Governance) defines failover mechanisms that exercise scenarios may activate.

Cite this protocol
AgentGoverning. (2026). AG-420: Tabletop Exercise Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-420