AG-420: Tabletop Exercise Governance

2. Summary

Tabletop Exercise Governance requires that organisations operating AI agents conduct structured, scenario-based exercises that simulate severe and unusual agent failure modes, testing the organisation's incident response capabilities, escalation paths, decision-making processes, and coordination mechanisms without requiring actual system disruption. Tabletop exercises are the primary mechanism for discovering gaps in incident response procedures before those gaps are exposed by real incidents — they reveal whether the people responsible for managing agent crises actually know what to do, can locate the tools and information they need, and can coordinate effectively under time pressure. Without governed tabletop exercises, organisations discover the weaknesses in their incident response during actual crises, when the cost of discovery includes real harm, regulatory scrutiny, and reputational damage.

3. Example

Scenario A — Untested Escalation Path Fails During Real Incident: A financial-value agent managing a £450 million fixed-income portfolio develops a systematic pricing error that overvalues holdings by 3.2%. The total misstatement is £14.4 million. The incident response plan specifies that Critical financial incidents require immediate notification of the Chief Risk Officer, the Head of Operations, and the external auditor. When the incident occurs at 22:15 on a Friday evening, the on-call analyst follows the escalation procedure — and discovers that the CRO's emergency contact number was changed 4 months ago and the plan was never updated, the Head of Operations is on sabbatical with no designated deputy listed in the plan, and the external auditor contact is a general mailbox that is not monitored outside business hours. The analyst spends 47 minutes locating the correct contacts. During that time, the agent continues to execute trades based on the mispriced portfolio, adding £2.1 million in additional exposure. The total loss, including the delayed response period, reaches £16.5 million.

What went wrong: The escalation path was documented but never tested. A single tabletop exercise simulating a Critical financial incident outside business hours would have revealed all three contact failures. The organisation had an incident response plan that looked complete on paper but failed on first real use. The £2.1 million additional exposure during the 47-minute escalation delay was entirely preventable. Consequence: £16.5 million total exposure, FCA supervisory finding for inadequate incident management under SYSC 6.1.1R, board-level governance review, and £320,000 in remediation costs including incident response plan overhaul.

Scenario B — Cross-Functional Coordination Failure Under Time Pressure: A customer-facing agent serving 340,000 retail banking customers begins providing incorrect interest rate information due to a rate-feed corruption. The error affects balance projections, early-repayment calculations, and savings rate quotations. The incident response plan assigns concurrent responsibilities: the technology team must isolate the agent, the compliance team must assess regulatory notification obligations, the customer communications team must prepare customer notifications, and the legal team must assess liability exposure. In practice, all four teams begin working in isolation. The technology team isolates the agent after 25 minutes but does not notify the compliance team, which continues its assessment based on the assumption the agent is still active. The customer communications team prepares a notification stating the error has been corrected, but the legal team — unaware of the communications draft — issues a legal hold that prevents any external communication. Three hours into the incident, the four teams hold their first coordination call and discover they have been working at cross-purposes. Customer notifications are delayed by 6 hours. During that delay, 4,200 customers make financial decisions based on incorrect information. Remediation costs reach £890,000.

What went wrong: The incident response plan assigned responsibilities but did not define the coordination mechanism — who convenes the cross-functional response, how teams share status updates, and how conflicting actions (legal hold vs. customer notification) are resolved. A tabletop exercise would have surfaced the coordination gap within the first 30 minutes of the simulated scenario, when participants would have realised that no one was designated to lead the cross-functional response. Consequence: 6-hour notification delay, 4,200 affected customers requiring individual remediation, £890,000 in remediation costs, FCA Consumer Duty finding for delayed customer communication, and reputational damage across financial media.

Scenario C — Novel Failure Mode Not Covered by Existing Playbooks: A safety-critical agent controlling environmental systems in a pharmaceutical cold-chain facility experiences a novel failure: the agent correctly identifies a temperature excursion but, due to a logic error, implements the inverse corrective action — increasing cooling when heating is needed and vice versa. The facility's incident playbooks cover "agent offline" (failover to manual control), "sensor failure" (cross-reference redundant sensors), and "communication loss" (activate backup communication channel). None of the playbooks covers "agent operating but providing inverted commands" — a scenario where the agent appears functional and actively working to resolve the issue, but its actions are making the situation worse. The operations team monitors the agent's activity log, sees it actively responding to the temperature excursion, and concludes the agent is handling the situation. The temperature deviation worsens for 2 hours before a technician physically inspects the facility and discovers the inversion. Product worth £3.7 million is destroyed by the temperature deviation.

What went wrong: The incident playbooks covered only anticipated failure modes (offline, sensor failure, communication loss) and did not cover the more dangerous scenario of an agent actively operating but producing harmful outputs. A tabletop exercise focused on "unusual and deceptive failure modes" — where participants are presented with an agent that appears to be functioning correctly but is causing harm — would have revealed this gap. The exercise would have forced participants to define detection mechanisms for active-but-harmful agent behaviour and response procedures for overriding an agent that appears operational. Consequence: £3.7 million in destroyed pharmaceutical product, regulatory investigation by the Medicines and Healthcare products Regulatory Agency, cold-chain certification suspended pending review, and supply disruption affecting 14 downstream distributors.

4. Requirement Statement

Scope: This dimension applies to every organisation operating AI agents in production environments where agent failures can cause harm across any of the five severity axes defined in AG-419 (safety, financial, rights, legal, reputational). The scope includes the design, scheduling, execution, and follow-up of tabletop exercises. Organisations operating only low-risk agents (General/Internal Copilot profile with no external-facing or decision-making capabilities) may conduct exercises annually rather than semi-annually but are not exempted from the requirement. The test is: if an agent failed catastrophically, would the organisation need to coordinate a multi-person, time-sensitive response? If yes, tabletop exercises are required to validate that the response capability actually works.

4.1. A conforming system MUST conduct tabletop exercises at least semi-annually for High-Risk/Critical tier agents and at least annually for all other production agents, simulating agent failure scenarios that test the organisation's incident response capabilities.

4.2. A conforming system MUST design exercise scenarios that span the full range of the AG-419 severity matrix, including at least one Critical-severity scenario per exercise that tests the organisation's maximum-escalation response pathway.

4.3. A conforming system MUST include at least one "novel failure mode" scenario per exercise — a failure type not covered by existing incident playbooks — to test the organisation's ability to respond to unanticipated agent behaviour.

4.4. A conforming system MUST require participation from all roles identified in the incident response plan, including technical responders, governance leads, legal counsel, communications staff, and executive decision-makers, with documented attendance records.

4.5. A conforming system MUST produce a structured exercise report within 14 calendar days of each exercise, documenting: scenario descriptions, participant actions, gaps identified, decisions made, time-to-response metrics, and a prioritised remediation plan for each identified gap.

4.6. A conforming system MUST track remediation of identified gaps to closure, with each gap assigned an owner, a target remediation date, and a verification method, and with gap status reported to governance leadership at least monthly until all gaps are closed.

4.7. A conforming system MUST maintain an exercise scenario library that evolves based on real incidents (from the organisation and from industry), emerging threat intelligence, changes to the agent portfolio, and findings from previous exercises.

4.8. A conforming system SHOULD include external participants (regulators, key vendors, mutual-aid partners) in at least one exercise per year to test cross-organisational coordination mechanisms.

4.9. A conforming system SHOULD incorporate "injects" — mid-exercise scenario changes that escalate severity, introduce new information, or create conflicting priorities — to test adaptive decision-making under evolving conditions.

4.10. A conforming system SHOULD vary exercise timing to include at least one exercise conducted outside normal business hours (evenings, weekends, holidays) to test after-hours response capabilities.

4.11. A conforming system MAY conduct unannounced exercises where participants are not informed in advance, to test the organisation's readiness to respond without preparation time.

4.12. A conforming system MAY integrate tabletop exercises with technical simulation, where the tabletop scenario is accompanied by simulated telemetry, alerts, and dashboards that replicate the information environment of a real incident.

5. Rationale

Incident response plans are hypotheses about how an organisation will behave during a crisis. Until tested, they remain hypotheses — plausible, internally consistent, and potentially wrong. Tabletop exercises are the primary mechanism for converting incident response hypotheses into validated capabilities.

The gap between documented procedures and actual crisis performance is well-established in incident management research. Studies across multiple domains — aviation, healthcare, cybersecurity, financial services — consistently show that untested incident response plans fail at rates between 40% and 70% on first real use. The failure modes are predictable: contact information is outdated, roles and responsibilities are ambiguous, coordination mechanisms are undefined, decision authority is unclear, and novel failure modes are not covered. These are not exotic failures — they are the standard consequences of plans that have never been executed, even in simulation.

AI agent incidents introduce failure modes that are qualitatively different from traditional IT incidents. Traditional IT incidents typically involve systems that are clearly broken — they crash, they return errors, they become unavailable. The response paradigm is: detect the failure, identify the root cause, restore service. AI agent incidents can involve systems that appear to be functioning normally while causing harm — an agent that is available, responsive, and producing outputs, but whose outputs are subtly wrong (Scenario C), systematically biased (AG-419 Scenario C), or financially harmful (AG-419 Scenario B). These "active-but-harmful" failure modes require fundamentally different detection and response strategies that are unlikely to be developed under the time pressure of a real incident. Tabletop exercises provide the space to think through these scenarios, develop detection mechanisms, and define response procedures before the scenarios occur in production.

Tabletop exercises also serve a critical coordination function. Agent-related incidents typically require cross-functional response — technology teams to isolate or remediate the agent, governance teams to assess compliance implications, legal teams to evaluate liability, communications teams to manage stakeholder notification, and executive leadership to make escalation decisions. In the absence of practised coordination, these teams default to working in isolation (Scenario B), producing conflicting actions, duplicated effort, and delayed response. The tabletop format forces cross-functional interaction in a low-stakes environment, building the coordination muscle memory that will be needed during real incidents.

The "novel failure mode" requirement (4.3) deserves specific rationale. Incident playbooks, by definition, cover anticipated failure modes. But the history of AI system failures demonstrates that the most damaging incidents are often novel — failure modes that were not anticipated during planning. The 2010 Flash Crash involved a novel interaction between algorithmic trading systems that was not covered by any existing playbook. The 2018 Boeing 737 MAX accidents involved a novel failure mode (MCAS system activating on erroneous sensor data) that was not covered by pilot training. The requirement to include at least one novel failure mode per exercise is not about predicting the specific novel failure that will occur — it is about practising the organisation's ability to respond to the unexpected. Organisations that have practised responding to novel scenarios are measurably better at responding to the next novel scenario they encounter, even if the specifics are different.

Regulatory expectations for incident response testing are explicit and growing. DORA Article 11 requires financial entities to test their ICT business continuity plans at least annually, including scenario-based testing. The EU AI Act's quality management system requirements under Article 17 implicitly require testing of incident response procedures. The Bank of England's operational resilience framework requires firms to test their ability to remain within impact tolerances during severe but plausible scenarios. ISO 22301 (Business Continuity Management) requires exercising and testing of continuity procedures. AG-420 aligns with and extends these requirements to the specific context of AI agent failures.

6. Implementation Guidance

Tabletop exercises should be designed as structured, facilitated discussions — not free-form conversations and not scripted walkthroughs. The facilitator presents a scenario, participants discuss what they would do, and the facilitator introduces additional information ("injects") that evolves the scenario. The goal is not to test whether participants have memorised the incident response plan — it is to test whether the plan actually works when executed by real people under realistic conditions.

Recommended patterns:

Scenario design from the AG-419 severity matrix. Use the severity matrix axes as the basis for scenario design. For each exercise, select at least one scenario where the primary severity is on each of the five axes (safety, financial, rights, legal, reputational). Design the Critical-severity scenario to test the maximum-escalation pathway: board notification, regulatory communication, public statement, and multi-team coordination simultaneously. This ensures exercises are not biased toward a single type of incident (typically technical/financial) while neglecting rights, safety, and reputational scenarios.
Progressive inject sequences. Design scenarios with 3-5 injects that escalate the situation. Inject 1 presents the initial incident. Inject 2 introduces new information that changes the severity assessment (e.g., the affected population is 10x larger than initially estimated). Inject 3 introduces a coordination challenge (e.g., a key responder is unreachable, or a vendor fails to respond). Inject 4 introduces a conflicting priority (e.g., the legal team imposes a communication hold that conflicts with the regulatory notification deadline). Inject 5 introduces a secondary incident that competes for resources. This sequence tests not just the initial response but the organisation's ability to adapt as the situation evolves.
Facilitator observation protocol. The exercise facilitator should document specific observations using a structured template: time-to-first-action (how long before participants take the first response action), decision authority (who makes key decisions and whether their authority is clear), information flow (how teams share information and whether any team is operating in isolation), plan adherence (whether participants follow documented procedures or deviate, and whether deviations are justified), and gap identification (any point where participants are uncertain, disagree, or cannot find needed information). These observations form the basis of the exercise report.
After-action review and gap tracking. Conduct a structured after-action review immediately following the exercise, while observations are fresh. Use a "sustain/improve" framework: what worked well that should be sustained, and what did not work that must be improved. Convert every "improve" item into a tracked gap with an owner, a target date, and a verification method. Report gap status to governance leadership monthly. Do not conduct the next exercise until at least 80% of the previous exercise's gaps have been closed — otherwise exercises identify the same gaps repeatedly without driving improvement.
Scenario library management. Maintain a library of exercise scenarios, tagged by severity axes tested, agent profiles involved, failure mode type (known vs. novel), and exercise results. After each exercise, add the scenario to the library with annotations about what it revealed. After each real incident, add a scenario based on the real incident (appropriately sanitised). Review the library annually to retire scenarios that no longer reflect current risk and add scenarios reflecting emerging threats. The library should include at least 20 scenarios to enable varied exercise selection.

Anti-patterns to avoid:

Scripted walkthrough exercises. Reading through the incident response plan step by step in a conference room. This tests whether the plan is internally consistent — which is useful but limited. It does not test whether the plan works when executed by real people under realistic conditions. Scripted walkthroughs are a starting point, not a substitute for scenario-based exercises.
Technology-only exercises. Conducting exercises that involve only the technology team. Agent-related incidents require cross-functional response. Exercises that exclude legal, compliance, communications, and executive participants will not reveal the coordination failures that dominate real incident response.
Identical scenarios repeated. Running the same exercise scenario multiple times. After the first execution, participants know the scenario and their responses are rehearsed rather than adaptive. Scenarios should be varied, drawing from the scenario library, with novel elements in each exercise.
Exercise without follow-through. Conducting exercises, documenting gaps, and then not remediating the gaps. This pattern, which is extremely common, creates a false sense of preparedness while accumulating documented-but-unresolved vulnerabilities. Gap remediation tracking with governance leadership visibility is essential.
Exclusively business-hours exercises. Conducting all exercises during normal business hours with full staffing. Real incidents disproportionately occur outside business hours (when monitoring may be reduced and key personnel are unavailable). At least one exercise per year should test after-hours response, including on-call activation, after-hours escalation, and reduced-staffing coordination.
Tabletop exercises as compliance checkboxes. Treating the exercise as a box-ticking requirement rather than a genuine capability test. Signs of checkbox exercises: minimal scenario complexity, no injects, exercise completed in under 30 minutes, same participants every time, no gap remediation tracking. Effective exercises take 2-4 hours, involve realistic complexity, and produce actionable findings.

Industry Considerations

Financial Services. Financial regulators increasingly expect scenario-based testing of incident response capabilities. DORA Article 11 explicitly requires testing of ICT business continuity plans. The Bank of England's operational resilience framework requires firms to test their ability to remain within impact tolerances during severe but plausible scenarios. Financial-sector tabletop exercises should include scenarios involving market-hours timing pressure (e.g., a pricing agent failure during market open), cross-border coordination (e.g., an agent failure affecting multiple jurisdictions with different regulatory notification requirements), and cascading failures (e.g., an agent failure triggering a downstream settlement failure).

Healthcare and Safety-Critical. Tabletop exercises in safety-critical domains should include scenarios involving immediate physical risk, where the exercise tests the speed of agent shutdown and failover to manual control. Healthcare exercises should include scenarios involving clinical decision support agents providing incorrect recommendations, with the exercise testing whether clinicians can detect and override incorrect agent outputs under time pressure. Exercises should also test coordination with external parties: medical device regulators, patient safety organisations, and other healthcare providers.

Public Sector. Public-sector exercises should include scenarios involving rights impact — agents making incorrect benefit determinations, biased risk assessments, or erroneous enforcement decisions. These scenarios test the organisation's ability to detect differential impact on protected groups, coordinate with equality bodies, and manage public communication in politically sensitive contexts.

Crypto and Web3. Exercises should include scenarios involving the irreversibility of blockchain transactions, where an agent executes incorrect transactions that cannot be reversed. These scenarios test the organisation's ability to respond when remediation through reversal is impossible and alternative compensation or recovery mechanisms must be improvised.

Maturity Model

Basic Implementation — The organisation conducts tabletop exercises at the required frequency (semi-annually for High-Risk/Critical, annually for others). Scenarios span at least three of the five severity axes. At least one Critical-severity scenario is included per exercise. Participants include representatives from all roles in the incident response plan. Exercise reports are produced within 14 days. Gaps are documented with owners and target dates. Gap status is reported monthly. This level meets the minimum mandatory requirements and provides baseline assurance that the incident response plan has been tested.

Intermediate Implementation — All basic capabilities plus: scenarios include progressive inject sequences with at least 3 injects per scenario. At least one novel failure mode scenario is included per exercise. A scenario library of at least 20 scenarios is maintained and updated. After-hours exercises are conducted at least annually. Gap remediation is verified before closure (the fix is tested, not just implemented). Exercises incorporate the AG-419 severity matrix, requiring participants to classify the incident using the matrix as part of the exercise. External participants (key vendors, mutual-aid partners) are included in at least one exercise per year.

Advanced Implementation — All intermediate capabilities plus: exercises are integrated with technical simulation, providing realistic telemetry, alerts, and dashboards. Unannounced exercises test readiness without preparation time. Exercise scenarios are informed by threat intelligence, real-incident databases, and AI safety research. Cross-organisational exercises test coordination with regulators, industry peers, and supply-chain partners. Exercise effectiveness is measured through metrics (time-to-response improvement, gap recurrence rate, participant confidence scores). The organisation can demonstrate year-over-year improvement in exercise outcomes. Independent observers evaluate exercise design and execution quality.

7. Evidence Requirements

Required artefacts:

Exercise schedule. A forward-looking schedule of planned tabletop exercises for the current year, showing dates, scenarios (at minimum, the severity axes to be tested), and invited participants. Updated when exercises are rescheduled.
Exercise scenario documentation. For each exercise conducted, the detailed scenario description including: the agent failure being simulated, the severity axis/axes targeted, the inject sequence, the expected decisions and actions, and the success criteria.
Exercise attendance records. Documented attendance for each exercise, showing which roles were represented and whether any required roles were absent. If required roles were absent, documentation of the reason and remediation (e.g., a make-up session or alternative participation method).
Exercise report. The structured report produced within 14 days of each exercise, containing: scenario summary, participant observations, gaps identified (with severity and priority), decisions made during the exercise, time-to-response metrics, and the remediation plan.
Gap remediation tracker. A living tracker showing all gaps identified across all exercises, each gap's owner, target date, current status, and verification method. The tracker must show the full lifecycle of each gap from identification to verified closure.
Scenario library. The maintained library of exercise scenarios with metadata tags (severity axes, agent profiles, failure mode type, date last used, exercise results summary).

Retention requirements:

Exercise reports, attendance records, and gap remediation records: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.
Scenario library: retained as a living artefact, with historical scenarios archived rather than deleted.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Exercise reports must exist as completed, retained artefacts — not as notes that would need to be compiled into a report after a regulatory request.

8. Test Specification

Test 8.1: Exercise Frequency Compliance

Stimulus: Review the exercise schedule and execution records for the past 12 months. Count the number of tabletop exercises conducted for each agent tier.
Expected behaviour: High-Risk/Critical tier agents have been covered by at least 2 exercises in the past 12 months. All other production agents have been covered by at least 1 exercise.
Pass criteria: Exercise frequency meets or exceeds the required cadence for all agent tiers. No tier has fewer exercises than required. Exercise dates are documented with attendance records.
Fail criteria: Any agent tier has fewer exercises than required, or any exercise lacks documented attendance records.

Test 8.2: Severity Matrix Coverage in Scenarios

Stimulus: Review the scenarios used across all exercises in the past 12 months. Map each scenario to the AG-419 severity axes it tests. Verify that at least one Critical-severity scenario was included in each exercise.
Expected behaviour: Scenarios span all five severity axes across the exercise programme. Each individual exercise includes at least one Critical-severity scenario.
Pass criteria: All five severity axes are covered at least once across the annual exercise programme. Every exercise includes at least one Critical-severity scenario. Scenario documentation explicitly identifies which axes are tested.
Fail criteria: Any severity axis is not covered in the annual exercise programme, or any exercise lacks a Critical-severity scenario.

Test 8.3: Novel Failure Mode Inclusion

Stimulus: Review scenarios from the past 12 months and identify which scenarios involved failure modes not covered by existing incident playbooks at the time of the exercise. Verify that at least one novel failure mode scenario was included per exercise.
Expected behaviour: Each exercise includes at least one scenario involving a failure mode that was not previously documented in incident playbooks.
Pass criteria: At least one novel scenario per exercise is documented, with evidence that the failure mode was not covered by existing playbooks (e.g., the scenario description notes the specific gap in playbook coverage).
Fail criteria: Any exercise lacks a novel failure mode scenario, or novel scenarios cannot be distinguished from playbook-covered scenarios.

Test 8.4: Cross-Functional Participation Completeness

Stimulus: Review attendance records for each exercise. Compare attendance against the roles specified in the incident response plan. Identify any required roles that were not represented.
Expected behaviour: All roles identified in the incident response plan are represented in each exercise. Technical, governance, legal, communications, and executive roles all participate.
Pass criteria: At least 90% of required roles are represented in each exercise. Any absent roles have documented justification and a remediation plan (make-up session or alternative participation). No exercise has fewer than 80% of required roles present.
Fail criteria: Any exercise has fewer than 80% of required roles represented, or absent roles lack documented justification and remediation.

Test 8.5: Exercise Report Timeliness and Completeness

Stimulus: Review exercise reports for all exercises conducted in the past 12 months. Verify that each report was produced within 14 calendar days of the exercise and contains all required elements: scenario description, participant observations, gaps identified, decisions made, time-to-response metrics, and remediation plan.
Expected behaviour: Reports are produced within 14 days and contain all required elements.
Pass criteria: 100% of exercise reports are dated within 14 calendar days of the exercise. All reports contain every required element. Gaps are described with sufficient specificity to enable remediation.
Fail criteria: Any report is produced more than 14 days after the exercise, or any report is missing a required element.

Test 8.6: Gap Remediation Tracking and Closure

Stimulus: Review the gap remediation tracker. Identify all gaps from exercises conducted more than 90 days ago. Verify that each gap has an owner, a target date, a current status, and — if closed — a verification record confirming the fix was tested.
Expected behaviour: Gaps older than 90 days are either closed with verification or have a documented, active remediation plan with a revised target date. No gap is abandoned without explicit disposition.
Pass criteria: At least 80% of gaps from exercises conducted more than 90 days ago are verified-closed. All remaining gaps have active remediation plans with revised target dates and governance leadership visibility. No gap lacks an assigned owner.
Fail criteria: Fewer than 80% of gaps older than 90 days are verified-closed, any gap lacks an owner, or any gap is in an indeterminate state (neither closed nor actively tracked).

Test 8.7: Scenario Library Maintenance

Stimulus: Review the scenario library. Verify that it contains at least 15 scenarios, that scenarios are tagged by severity axes and failure mode type, that scenarios from real incidents have been added within 90 days of incident closure, and that the library has been reviewed and updated within the past 12 months.
Expected behaviour: The library is maintained as a living artefact with sufficient breadth and recency. Scenarios reflect current risks and incorporate lessons from real incidents.
Pass criteria: The library contains at least 15 scenarios. Scenarios are tagged with severity axes and failure mode type. At least one scenario derived from a real incident has been added in the past 12 months (or the organisation has had no incidents, which is documented). The library was reviewed within the past 12 months with a documented review record.
Fail criteria: Fewer than 15 scenarios in the library, scenarios lack severity-axis tags, no real-incident-derived scenarios have been added despite real incidents occurring, or the library has not been reviewed in the past 12 months.

Conformance Scoring

Score 0: No tabletop exercises have been conducted — incident response procedures have never been tested through scenario-based simulation.
Score 1: Tabletop exercises are conducted, but irregularly, without structured scenarios, without cross-functional participation, or without documented gap remediation. Exercises may be scripted walkthroughs rather than scenario-based simulations.
Score 2: Tabletop exercises are conducted at the required frequency with structured scenarios spanning the AG-419 severity matrix. Cross-functional participation is achieved. Exercise reports are produced within 14 days with gap tracking to closure. Novel failure mode scenarios are included. A scenario library is maintained with at least 15 scenarios.
Score 3: Verified by independent assessment — an independent observer has evaluated exercise design, execution quality, and gap remediation effectiveness. Exercises incorporate progressive injects, after-hours timing, external participants, and technical simulation. Year-over-year improvement in exercise outcomes (time-to-response, gap recurrence, participant performance) is demonstrated. The scenario library incorporates threat intelligence, real-incident lessons, and AI safety research.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Supports compliance
EU AI Act	Article 17 (Quality Management System)	Supports compliance
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Direct requirement
NIST AI RMF	GOVERN 1.5, MANAGE 4.1	Supports compliance
ISO 42001	Clause 9.2 (Internal Audit), Clause 10.1 (Continual Improvement)	Supports compliance
DORA	Article 11 (ICT Business Continuity Policy), Article 26 (Testing of ICT Tools)	Direct requirement

EU AI Act — Article 9 (Risk Management System)

Article 9 requires providers of high-risk AI systems to establish, implement, document, and maintain a risk management system that includes the adoption of suitable risk management measures. Tabletop exercises are a risk management measure that validates the effectiveness of incident response procedures — the organisation's primary defence against the consequences of AI system failures. While Article 9 does not explicitly mandate tabletop exercises, it requires that risk management measures be tested and their effectiveness demonstrated. Tabletop exercises provide the evidence that incident response procedures (a core risk management measure) have been tested and found effective or have been improved based on test findings.

EU AI Act — Article 17 (Quality Management System)

Article 17 requires a quality management system that includes procedures for handling nonconformities, including corrective actions. Tabletop exercises test the procedures for handling nonconformities (incident response) under realistic conditions. The exercise reports and gap remediation records provide the quality management evidence that procedures have been tested, gaps have been identified, and corrective actions have been taken.

SOX — Section 404 (Internal Controls Over Financial Reporting)

For SOX-subject organisations, the incident response capability for financial-value agents is an internal control. SOX auditors assess not only whether controls exist but whether they are effective. A control that has never been tested cannot be assessed as effective. Tabletop exercises provide the testing evidence that incident response controls for financial-value agents have been validated. The exercise report and gap remediation records demonstrate the control testing that Section 404 requires.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA expects firms to maintain systems and controls that are adequate for the nature and scale of their business. For firms deploying AI agents in regulated activities, this includes incident management capabilities proportional to the risk. The FCA has explicitly emphasised the importance of testing contingency arrangements through scenario-based exercises. Supervisory findings have cited inadequate scenario testing as a contributing factor in incident response failures. AG-420 directly addresses the FCA's expectation that firms test their incident response capabilities through realistic scenarios rather than relying solely on documented procedures.

NIST AI RMF — GOVERN 1.5, MANAGE 4.1

GOVERN 1.5 addresses processes for escalation and response to AI-related risks. MANAGE 4.1 addresses incident response planning and execution. Both functions require not only documented procedures but validated capabilities. The NIST AI RMF's emphasis on organisational practices and processes for AI risk management implicitly requires testing of those practices. Tabletop exercises provide the testing mechanism that validates GOVERN 1.5 escalation processes and MANAGE 4.1 response procedures are functional.

ISO 42001 — Clause 9.2 (Internal Audit), Clause 10.1 (Continual Improvement)

ISO 42001 Clause 9.2 requires internal audits to determine whether the AI management system conforms to requirements and is effectively implemented. Tabletop exercises are a form of internal audit of the incident response component of the AI management system — they test whether documented procedures are effectively implemented by the people who will execute them. Clause 10.1 requires continual improvement based on audit findings, corrective actions, and performance evaluation. The exercise gap remediation cycle (identify gaps, remediate, verify, track) is a continual improvement mechanism directly aligned with Clause 10.1.

DORA — Article 11 (ICT Business Continuity Policy) and Article 26 (Testing of ICT Tools)

DORA Article 11 requires financial entities to put in place ICT business continuity policies and plans that are tested at least annually. Article 11(6) specifically requires scenario-based testing of ICT business continuity plans, including scenarios involving severe but plausible disruptions. Article 26 requires financial entities to establish programmes for testing ICT tools, systems, and processes. AG-420's tabletop exercise requirements directly implement DORA's scenario-based testing mandate. The semi-annual frequency for High-Risk/Critical agents exceeds DORA's annual minimum, reflecting the higher risk profile of AI agent failures. The requirement for novel failure mode scenarios aligns with DORA's emphasis on "severe but plausible" scenarios that go beyond routine disruptions.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide — affects the quality and speed of response to every agent-related incident

Consequence chain: Without tabletop exercises, incident response procedures remain untested hypotheses. The organisation discovers gaps in its response capability during actual incidents — when the cost of discovery is measured in additional harm, delayed response, and regulatory scrutiny rather than in exercise time and remediation effort. The immediate failure mode is response delay: during a real incident, responders encounter problems that should have been identified and resolved during exercises — outdated contact information (Scenario A: 47-minute escalation delay adding £2.1 million in exposure), undefined coordination mechanisms (Scenario B: 6-hour notification delay affecting 4,200 customers), and unrecognised failure modes (Scenario C: 2-hour detection delay destroying £3.7 million in product). The downstream consequence is amplified harm: every minute of response delay during a real incident allows the harm to compound. Financial exposure grows. Additional individuals are affected. Safety hazards persist. The regulatory consequence is particularly severe because regulators view untested incident response as a governance failure independent of any specific incident. DORA Article 11(6) explicitly requires scenario-based testing; the FCA expects tested contingency arrangements; the EU AI Act's risk management requirements implicitly include procedure validation. An organisation that cannot demonstrate regular, structured exercise programmes faces enforcement action for inadequate governance even if no specific incident has occurred. The reputational consequence emerges when post-incident reviews reveal that the response failures were foreseeable and would have been identified by even a basic tabletop exercise — creating the narrative that the organisation knew (or should have known) its response capability was inadequate and chose not to test it.

Cross-references: AG-419 (Adverse Event Severity Matrix Governance) provides the severity framework used to design exercise scenarios and classify simulated incidents during exercises. AG-008 (Governance Continuity Under Failure) defines the continuity mechanisms that exercises test. AG-421 (Recovery Point Objective for Memory and State Governance) and AG-422 (Recovery Time Objective Governance) define recovery targets that exercises validate. AG-423 (Incident Learning Closure Governance) consumes exercise findings as inputs to the learning process. AG-426 (Fallback Staffing Governance) defines staffing arrangements that exercises test. AG-427 (Mutual Aid and Vendor Coordination Governance) defines cross-organisational coordination mechanisms that exercises validate. AG-403 (Dependency Failover Validation Governance) defines failover mechanisms that exercise scenarios may activate.

Cite this protocol

AgentGoverning. (2026). AG-420: Tabletop Exercise Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-420

← Previous Protocol

AG-419

Adverse Event Severity Matrix Governance

Next Protocol →

AG-421

Recovery Point Objective for Memory and State Governance