AG-067

Root Cause and Corrective Action Governance

Incident Response, Containment & Recovery ~22 min read AGS v2.1 · April 2026
EU AI Act SOX FCA NIST ISO 42001

2. Summary

Root Cause and Corrective Action Governance requires that every serious incident involving an AI agent is subjected to a structured root cause analysis process that identifies the actual cause of the failure — not merely the proximate trigger — and that corrective actions are defined, implemented, verified, and tracked to closure. The root cause analysis must go beyond the immediate technical failure to examine the systemic factors that allowed the failure to occur: inadequate controls, configuration errors, untested edge cases, gaps in monitoring, or organisational process failures. Corrective actions must be specific, measurable, time-bound, and verified through testing before the agent is returned to service (AG-068). The root cause and corrective action process must produce a formal record that is retained, reviewable by regulators, and feeds into the organisation's continuous improvement of AI governance. Without this dimension, incidents recur — the same root cause produces the same failure, the organisation applies the same superficial fix, and the cycle continues until a regulator or a catastrophic loss breaks it.

3. Example

Scenario A — Superficial Root Cause Leads to Recurring Incident: A customer-facing AI agent handling insurance claims incorrectly denies 234 valid claims over a 2-week period. The initial investigation identifies the proximate cause: a reference data update on day 1 of the period changed the format of policy type codes from 3-character to 5-character strings, and the agent's validation logic rejected the new format as invalid. The corrective action is to update the validation logic to accept 5-character codes. The fix is deployed, and the agent is returned to service. Three months later, a similar reference data update changes claim category codes, and the agent incorrectly denies another 189 valid claims. The root cause was never the specific format change — it was the absence of a contract between the reference data system and the agent that defines the expected data format and provides change notification. The superficial fix addressed one symptom; the systemic root cause was untouched.

What went wrong: The root cause analysis stopped at the proximate trigger (the format change) without examining the systemic factor (the absence of an interface contract with change notification). The corrective action fixed one instance of the problem without preventing future instances. No validation tested whether the corrective action addressed the root cause rather than just the symptom. Consequence: 189 additional incorrectly denied claims, reputational damage, regulatory scrutiny, and the eventual realisation that the original root cause analysis was inadequate — requiring a second, more thorough investigation at greater cost and with less available evidence (AG-066 retention notwithstanding).

Scenario B — Corrective Action Not Verified Before Return to Service: A financial-value AI agent executing foreign exchange trades is contained after executing 12 trades at prices that deviated from the mid-market rate by more than the 0.5% tolerance defined in its mandate. Root cause analysis determines that the agent's pricing model was using a stale exchange rate feed — the feed provider changed the API endpoint, and the agent's fallback logic was using a cached rate from 4 hours earlier. The corrective action is to update the API endpoint configuration and add a staleness check that rejects rates older than 60 seconds. The configuration change and staleness check are deployed, and the agent is returned to service based on a code review of the changes. No testing is performed. On return to service, the staleness check works correctly but has an unintended interaction with the agent's error handling: when a rate is rejected as stale, the error handler retries the request to the old (non-functional) endpoint rather than the new one, creating an infinite retry loop that consumes all available connections to the rate feed. The agent freezes, and 47 pending trades fail to execute within the required settlement window.

What went wrong: The corrective action was deployed without verification testing. The code review confirmed that the staleness check was correctly implemented but did not test the interaction between the staleness check and the existing error handling. The corrective action introduced a new failure mode that was not present in the original incident. Consequence: 47 failed settlements, counterparty claims, regulatory finding for inadequate change control, and a second incident investigation required for the failure introduced by the corrective action from the first incident.

Scenario C — Root Cause Analysis Omits Organisational Factors: A safety-critical AI agent monitoring air quality in an underground mine triggers a false evacuation after misinterpreting a sensor calibration test as an actual gas leak. The root cause analysis determines that the agent did not have access to the maintenance schedule and could not distinguish between a planned calibration event and an actual reading. The corrective action integrates the maintenance schedule with the agent's input data so it can suppress readings during calibration. However, the root cause analysis does not examine why the maintenance team did not notify the agent operations team of the calibration, why there was no standard operating procedure requiring such notification, or why the agent was designed to trigger evacuation on a single sensor reading without corroboration. Six months later, a different maintenance activity (equipment testing) causes another false evacuation because the corrective action addressed only the specific case of sensor calibration, not the general case of maintenance activities affecting agent inputs.

What went wrong: The root cause analysis was technically focused and did not examine organisational factors. The corrective action was narrowly scoped to the specific trigger (sensor calibration) rather than the general vulnerability (uncoordinated maintenance activities affecting agent inputs). No organisational process change was implemented to require maintenance teams to coordinate with agent operations. Consequence: recurring false evacuations, mine production disruption (each false evacuation costs approximately £180,000 in lost production), erosion of trust in the AI monitoring system, eventual reversion to fully manual monitoring.

4. Requirement Statement

Scope: This dimension applies to all serious incidents classified under AG-064 at any severity level. The scope includes incidents that were successfully contained (AG-065) and those where containment was partial or delayed. The scope extends to near-miss events — incidents where the conditions for a serious failure existed but the failure did not materialise due to coincidental factors rather than design controls. Near-misses are within scope because the root cause analysis may reveal systemic vulnerabilities that will produce an actual failure under different circumstances. The scope includes multi-agent incidents where the root cause spans multiple agents, external system failures that the agent governance should have detected or mitigated, and organisational process failures that enabled the technical failure.

4.1. A conforming system MUST initiate a structured root cause analysis for every incident classified as Severity 1 or Severity 2 under AG-064, beginning within 24 hours of incident containment and completing within 15 business days for Severity 1 and 30 business days for Severity 2.

4.2. A conforming system MUST ensure that root cause analysis examines at least three layers: the proximate technical cause (what directly triggered the failure), the contributing technical factors (what conditions allowed the proximate cause to produce the observed impact), and the systemic organisational factors (what governance, process, or oversight gaps allowed the contributing factors to exist).

4.3. A conforming system MUST produce a formal root cause analysis report that documents: the incident timeline, the evidence examined (referencing AG-066 records), the determined root cause at each layer, the contributing factors, the corrective actions defined, and the rationale linking each corrective action to a specific root cause finding.

4.4. A conforming system MUST define corrective actions that are specific (addressing a defined root cause finding), measurable (with defined success criteria), time-bound (with a defined implementation deadline), and assigned (to a named responsible individual or team).

4.5. A conforming system MUST verify corrective actions through testing before the affected agent is returned to service per AG-068 — verification must demonstrate that the specific root cause no longer produces the observed failure and that the corrective action does not introduce new failure modes.

4.6. A conforming system MUST track corrective actions to closure, with evidence of implementation and verification recorded for each action, and with escalation to senior management when implementation deadlines are missed.

4.7. A conforming system SHOULD implement a corrective action effectiveness review at 30, 90, and 180 days after implementation to verify that the corrective action remains effective in production conditions and that the root cause has not recurred.

4.8. A conforming system SHOULD maintain a root cause taxonomy that categorises historical root causes to support trend analysis — identifying recurring root cause categories enables systemic improvements rather than incident-by-incident fixes.

4.9. A conforming system SHOULD conduct root cause analysis for Severity 3 incidents where the incident reveals a novel failure mode, a gap in existing controls, or a pattern of recurring low-severity incidents that may indicate a systemic issue.

4.10. A conforming system MAY implement automated root cause hypothesis generation using AG-066 forensic evidence, producing preliminary root cause candidates that human investigators can evaluate and refine.

5. Rationale

Root Cause and Corrective Action Governance addresses the question that determines whether an organisation learns from its AI agent failures or is condemned to repeat them: "Do we actually understand why this happened, and have we fixed the real problem?"

The distinction between proximate cause and root cause is fundamental. The proximate cause of an incident is the immediate trigger — a corrupted data feed, a misconfigured threshold, a prompt injection. The root cause is the systemic condition that allowed the proximate cause to produce the observed impact — the absence of input validation, the lack of configuration change control, the failure to test adversarial scenarios. Fixing the proximate cause addresses one instance of the problem. Fixing the root cause prevents the class of problems.

AI agent incidents are particularly vulnerable to superficial root cause analysis because the proximate cause is often technically interesting and apparently sufficient. "The agent received a corrupted data feed and produced incorrect outputs" is a complete narrative — it explains what happened. But it does not explain why the agent did not detect the corruption, why there was no input validation, why the data feed had no integrity check, or why the monitoring system did not detect the output anomaly. Each of these "why" questions reveals a systemic factor that, if unaddressed, will produce a different incident with a different proximate cause but the same root cause.

The corrective action verification requirement reflects the reality that corrective actions can themselves introduce new failure modes. An AI agent is a complex system operating in a complex environment. Changes to one component may have unexpected interactions with other components. A staleness check that works correctly in isolation may interact with error handling in unexpected ways. A format validation that accepts the new data format may reject a future format that the old validation would have accepted. Corrective actions must be tested not only for their intended effect but for their unintended interactions, and this testing must occur before the agent is returned to service.

The requirement for organisational root cause analysis acknowledges that many AI agent incidents have organisational causes. The technical failure is the effect; the organisational process gap is the cause. An agent that receives corrupted data failed technically, but the root cause may be an organisational failure to establish data quality contracts between teams, to include the agent operations team in change management processes, or to test the agent's resilience to data quality degradation. Without examining the organisational layer, corrective actions remain purely technical and leave the systemic vulnerability intact.

6. Implementation Guidance

AG-067 establishes the root cause analysis and corrective action process as a mandatory governance function — not an optional best practice. The process should be defined in advance, with clear roles, responsibilities, timelines, and quality standards. Root cause analysis conducted under time pressure during an active incident is prone to the same superficial analysis that this dimension aims to prevent. The process should be initiated after containment (AG-065) has stabilised the situation, using evidence preserved by AG-066.

The root cause analysis methodology should be structured and repeatable. Recommended methodologies include the "5 Whys" technique (iteratively asking "why" to move from proximate cause to root cause), Ishikawa (fishbone) diagrams for multi-factor analysis, and fault tree analysis for complex system interactions. The methodology should require analysis at three layers: technical, process, and organisational. Each finding should be linked to specific evidence from the AG-066 forensic record.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. FCA expectations for incident management include root cause analysis that identifies systemic issues and corrective actions that are tracked to completion. The Senior Managers Regime requires that the responsible Senior Manager can demonstrate that root cause analysis was thorough, corrective actions were appropriate, and implementation was verified. For incidents involving market conduct (e.g., best execution failures, market manipulation), the root cause analysis may need to be disclosed to the FCA and may be subject to skilled person review under Section 166 of the Financial Services and Markets Act. Corrective actions must be reflected in the firm's risk register and control framework.

Healthcare. For incidents involving patient safety, root cause analysis must follow established patient safety investigation methodologies (e.g., the NHS Serious Incident Framework or equivalent). The analysis must consider clinical pathway impacts — an AI agent that provides incorrect clinical decision support may have affected patient treatment decisions downstream. Corrective actions must be reviewed by clinical governance before implementation to ensure they do not introduce clinical safety risks. The investigation record must be retained as part of the clinical governance record.

Critical Infrastructure. For incidents in critical infrastructure, root cause analysis must include physical process safety analysis alongside AI-specific analysis. A root cause in the agent's reasoning may have physical safety implications that require process safety engineering review. Corrective actions that modify the agent's behaviour in a physical control context must be validated through process safety analysis (e.g., HAZOP review) before implementation. IEC 61511 requirements for safety instrumented systems may apply to corrective actions that affect safety functions.

Maturity Model

Basic Implementation — The organisation conducts root cause analysis for Severity 1 incidents, documented in a free-form report. The analysis identifies the proximate technical cause. Corrective actions are defined and tracked in a ticketing system. Verification consists of code review and basic functional testing. No formal methodology is prescribed — the quality of analysis depends on the investigator's expertise. Corrective actions are closed when implemented. No effectiveness review is conducted. This level meets the minimum mandatory requirements but is vulnerable to superficial analysis, inconsistent quality, and recurring incidents from unaddressed systemic root causes.

Intermediate Implementation — Root cause analysis follows a structured methodology (5 Whys, Ishikawa, or fault tree) applied to all Severity 1 and 2 incidents. The analysis template requires three-layer examination (technical, process, organisational). Corrective actions are traceable to specific root cause findings. Verification testing is conducted in a pre-production environment before production deployment. Corrective action effectiveness reviews are conducted at 30 and 90 days. Root causes are categorised in a taxonomy for trend analysis. Severity 1 investigations are conducted by an independent team. The root cause taxonomy is reviewed quarterly to identify systemic trends.

Advanced Implementation — All intermediate capabilities plus: automated root cause hypothesis generation from AG-066 forensic evidence feeds preliminary candidates to human investigators, reducing time to root cause. Corrective action impact analysis is formally conducted and approved before implementation. Effectiveness reviews at 30, 90, and 180 days verify sustained effectiveness. Root cause trend analysis drives programme-level improvements — when a root cause category reaches a defined threshold, a systemic improvement programme is initiated. The organisation can demonstrate to regulators a declining trend in incident recurrence rates attributable to effective root cause analysis and corrective action. Cross-organisation root cause sharing (anonymised) contributes to industry-wide learning.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Testing AG-067 compliance requires verification that the root cause analysis process is structured, thorough, and effective, and that corrective actions are verified and tracked.

Test 8.1: Root Cause Analysis Depth

Test 8.2: Corrective Action Traceability

Test 8.3: Corrective Action Verification Completeness

Test 8.4: Timeline Compliance

Test 8.5: Corrective Action Tracking to Closure

Test 8.6: Effectiveness Review Execution

Test 8.7: Root Cause Recurrence Analysis

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Direct requirement
EU AI ActArticle 72 (Post-Market Monitoring)Supports compliance
DORAArticle 13 (Learning and Evolving)Direct requirement
DORAArticle 19 (Reporting — Final Report)Direct requirement
FCA SYSC6.1.1R (Systems and Controls)Direct requirement
NIST AI RMFMANAGE 4.1, MANAGE 4.2Supports compliance
ISO 42001Clause 10.2 (Nonconformity and Corrective Action)Direct requirement
SOXSection 404 (Internal Controls)Supports compliance
NIS2 DirectiveArticle 23 (Final Report)Supports compliance

EU AI Act — Article 9 (Risk Management System)

Article 9 requires a risk management system that includes "estimation and evaluation of the risks that may emerge when the high-risk AI system is used in accordance with its intended purpose and under conditions of reasonably foreseeable misuse" and "adoption of appropriate and targeted risk management measures." AG-067 implements the feedback loop within the risk management system: when an incident reveals a risk that was not previously managed, the root cause analysis identifies the gap, and the corrective action closes it. Without this feedback loop, the risk management system is static and does not improve based on operational experience. Article 9(4)(d) specifically requires that risk management measures "are implemented with a view to eliminating or reducing risks as far as possible through adequate design and development" — root cause analysis is the mechanism by which design and development are improved based on real-world failure data.

DORA — Article 13 (Learning and Evolving)

Article 13 requires financial entities to "incorporate lessons learnt from ICT-related incidents" into their ICT risk management framework. AG-067 directly implements this requirement by ensuring that every serious incident produces a formal root cause analysis with corrective actions that are tracked to closure. The requirement for root cause taxonomy and trend analysis supports the broader Article 13 obligation to identify patterns and systemic weaknesses, not just individual incident responses.

DORA — Article 19 (Reporting — Final Report)

Article 19 requires a final report on major ICT-related incidents that includes "the root cause analysis, regardless of whether mitigating actions have already been completed." AG-067 ensures that the root cause analysis is conducted with the rigour and documentation necessary to satisfy this reporting requirement. The root cause analysis report produced under AG-067 forms the basis of the DORA final report.

ISO 42001 — Clause 10.2 (Nonconformity and Corrective Action)

Clause 10.2 requires organisations to react to nonconformities, evaluate the need for action to eliminate causes, implement corrective actions, review their effectiveness, and make changes to the AI management system if necessary. AG-067 implements Clause 10.2 for AI agent incidents by providing the structured process for root cause determination, corrective action definition, implementation verification, and effectiveness review. The requirement for root cause taxonomy and trend analysis supports the Clause 10.2 obligation to "make changes to the AI management system if necessary" by identifying systemic issues that require management system changes.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA requires firms to establish and maintain adequate policies and procedures sufficient to ensure compliance. For AI agent deployments, this includes the ability to learn from incidents and improve controls. The FCA has stated that it expects firms to demonstrate a "continuous improvement" approach to AI governance — AG-067's root cause analysis and corrective action process is the mechanism that delivers this improvement. The Senior Managers Regime requires that responsible individuals can demonstrate that incidents were thoroughly investigated and that corrective actions were appropriate and effective.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusOrganisation-wide — recurring incidents from unaddressed root causes affect all agent deployments sharing the same systemic weakness

Consequence chain: Without structured root cause analysis and corrective action governance, the organisation's incident response is purely reactive — each incident is treated as an isolated event, the proximate cause is fixed, and the systemic vulnerability remains. The immediate consequence is incident recurrence: the same class of failure produces different specific incidents, each requiring containment, investigation, and remediation. The operational impact is escalating: each recurring incident consumes investigation resources, disrupts operations, and erodes confidence in the AI agent deployment. The regulatory impact compounds: regulators view recurring incidents as evidence of inadequate governance. A single incident may be treated as an operational failure; recurring incidents with the same root cause class are treated as a governance failure — a materially more serious finding. Under DORA Article 13, failure to learn from incidents is an independent regulatory breach. Under the EU AI Act, failure to update the risk management system based on operational experience violates Article 9. The financial impact grows: each recurring incident carries its own direct costs (containment, investigation, remediation, customer impact) plus the cumulative cost of repeated disruption and the eventual cost of a comprehensive remediation programme that should have been initiated after the first incident. The business consequence includes regulatory enforcement for inadequate governance, escalating incident costs, loss of organisational confidence in AI agent capabilities, potential moratorium on new agent deployments pending governance improvements, and personal liability for senior managers who cannot demonstrate that the organisation learned from its failures.

Cite this protocol
AgentGoverning. (2026). AG-067: Root Cause and Corrective Action Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-067