AG-154: Correlated Control Failure Analysis

2. Summary

Correlated Control Failure Analysis requires that organisations systematically identify, model, and mitigate shared failure modes across their AI agent governance controls. When multiple governance controls share dependencies — the same infrastructure, the same credentials, the same vendor, the same reasoning model, or the same data pipeline — a single point of failure can simultaneously disable all controls that share that dependency. This dimension ensures that the apparent redundancy of multiple governance controls translates into actual independence, and that correlated failure risks are identified before they materialise as simultaneous multi-control failures.

3. Example

Scenario A — Shared Infrastructure Disables All Monitoring: An organisation deploys five governance controls for its AI agent fleet: mandate enforcement (AG-001), behavioural drift detection (AG-022), deception detection (AG-039), action logging, and human escalation triggers. All five controls are deployed as microservices on the same Kubernetes cluster, sharing the same node pool. A resource exhaustion event caused by a denial-of-service attack on the agent API consumes all available cluster resources. All five governance controls become unresponsive simultaneously. The agents continue to operate because they are hosted on a separate cluster, but they now operate with zero governance oversight. In the 47 minutes before the infrastructure team restores governance services, one agent executes £340,000 in transactions that would have been blocked by mandate enforcement, and another exhibits behavioural drift that would have triggered escalation.

What went wrong: Five nominally independent governance controls shared a single infrastructure dependency. The apparent redundancy was illusory — a single failure event disabled all controls simultaneously. No correlated failure analysis had identified this shared dependency.

Scenario B — Shared Credential Rotation Disables All Controls: An organisation's governance controls authenticate to the agent event stream using service account credentials managed by a central identity provider. During a scheduled credential rotation, the identity provider experiences a 2-hour outage. All governance controls lose access to the event stream simultaneously. The controls are designed to fail-safe (AG-008 compliant), so agents are paused — but the organisation's entire AI agent fleet is offline for 2 hours during peak business operations, causing an estimated £1.2 million in lost revenue and 3,400 customer service failures. A correlated failure analysis would have identified the identity provider as a single point of failure across all controls and recommended credential caching or independent authentication for critical controls.

What went wrong: All governance controls shared a credential management dependency. The correlated failure was not identified during governance architecture design. The fail-safe behaviour (agent pause) was correct but the correlated failure made it unnecessarily broad — all agents paused rather than only those whose specific controls were affected.

Scenario C — Shared AI Model Creates Common-Mode Reasoning Failure: An organisation uses the same large language model as a component in three governance controls: content safety filtering, regulatory compliance checking, and anomaly detection. A model update introduces a reasoning regression that causes the model to underweight specific categories of safety-relevant content. All three controls degrade simultaneously because they share the same reasoning component. The content safety filter misses 28% of policy-violating outputs. The compliance checker fails to flag 19% of regulatory violations. The anomaly detector generates 40% more false negatives. Because all three controls use the same model, their failures are correlated — they all miss the same categories of violations, providing no compensating detection. A diverse model approach (different models for different controls) would have limited the failure to one control while the others compensated.

What went wrong: Three governance controls shared a common reasoning component (the same LLM). A model regression created a common-mode failure across all three. No diversity analysis had identified the shared model dependency as a correlated failure risk.

4. Requirement Statement

Scope: This dimension applies to all AI agent governance deployments where multiple governance controls are expected to provide layered or redundant assurance. Any organisation deploying two or more governance controls has a correlated failure risk — and the risk increases with the number of controls sharing common dependencies. Single-control deployments are technically excluded, though such deployments are inherently fragile and should be evolving toward multi-control architectures. The scope extends to shared dependencies at all layers: infrastructure (compute, network, storage), platform (operating system, container runtime, orchestration), service (identity providers, logging services, configuration management), data (event streams, data pipelines, reference data), and reasoning (shared AI models, shared algorithmic components).

4.1. A conforming system MUST maintain a dependency map documenting the infrastructure, platform, service, data, and reasoning dependencies of each deployed governance control.

4.2. A conforming system MUST identify all shared dependencies where two or more governance controls depend on the same component, and classify each shared dependency by the number of controls affected and the criticality of the controls involved.

4.3. A conforming system MUST assess the correlated failure risk for each shared dependency, documenting the impact of the dependency's failure on each governance control and the combined governance coverage loss.

4.4. A conforming system MUST implement mitigation for any shared dependency whose failure would simultaneously disable governance controls covering more than 50% of the governance function scope (e.g., more than 50% of deployed controls, or all controls covering a specific risk domain).

4.5. A conforming system MUST update the dependency map and correlated failure assessment within 30 days of any change to governance control deployment, infrastructure, or dependencies.

4.6. A conforming system SHOULD implement diversity requirements for critical governance controls, ensuring that controls covering the same risk domain do not share infrastructure, platform, vendor, or reasoning model dependencies.

4.7. A conforming system SHOULD conduct correlated failure simulation exercises at least annually, testing the impact of simultaneous failure of controls sharing a common dependency.

4.8. A conforming system SHOULD monitor shared dependencies in real time and alert when a dependency experiences degradation that could affect multiple controls.

4.9. A conforming system MAY implement automatic governance posture adjustment when a shared dependency degrades, tightening remaining controls or pausing agent operations based on the residual governance coverage level.

5. Rationale

Redundancy is a fundamental principle of reliable system design. Governance frameworks rely on redundancy — multiple controls covering the same risk domain — to ensure that the failure of any single control does not leave a risk unmitigated. But redundancy only works if the redundant components fail independently. When multiple components share a common dependency, their failures are correlated, and the apparent redundancy is partially or wholly illusory.

This problem is well-understood in safety engineering (common-cause failure analysis is mandatory in nuclear safety and aerospace), financial risk management (correlation risk is a core concept in portfolio theory and was a primary driver of the 2008 financial crisis), and reliability engineering (common-mode failure analysis is standard in high-reliability systems). Yet in AI governance, correlated failure risk is routinely overlooked. Organisations deploy multiple governance controls and assume that "more controls means more assurance" without analysing whether those controls share dependencies that could cause them to fail together.

The relationship to AG-155 (Oversight Diversity and Heterogeneous Redundancy Governance) is direct: AG-154 identifies correlated failure risks; AG-155 prescribes the diversity measures needed to mitigate them. AG-154 is the diagnostic dimension (where are the correlated risks?); AG-155 is the prescriptive dimension (how do we eliminate them?). Both are necessary — diagnosis without prescription identifies risks but doesn't fix them; prescription without diagnosis may address the wrong risks.

The 50% threshold in requirement 4.4 represents a pragmatic balance. Requiring mitigation for every shared dependency would be impractical and disproportionate. Requiring mitigation only for dependencies whose failure would disable all controls would miss dangerous partial failures. The 50% threshold ensures that the most impactful correlated failure risks are mitigated while allowing organisations to accept minor correlation risks with appropriate documentation.

6. Implementation Guidance

Correlated control failure analysis begins with mapping dependencies and proceeds through identification, assessment, and mitigation of shared failure modes.

Recommended patterns:

Dependency graph construction. Build a formal dependency graph for the governance control architecture. Each governance control is a node. Each dependency (infrastructure, platform, service, data, reasoning) is also a node. Edges represent the dependency relationship. Shared dependencies appear as nodes with multiple incoming edges from governance controls. This graph is the primary analysis artefact — it makes correlated failure risks visually and computationally identifiable.
Common-cause failure analysis (CCFA). Adapted from nuclear safety methodology, CCFA systematically identifies mechanisms by which a single cause can affect multiple barriers. For each shared dependency, document: the failure modes of the dependency, the governance controls affected, the expected impact on each control, the combined governance coverage loss, and the available mitigations. Rate each common-cause failure by likelihood and severity to prioritise mitigation.
Diversity scoring. For each risk domain covered by multiple governance controls, compute a diversity score that measures the independence of those controls across dependency categories (infrastructure, platform, vendor, model, data source). A pair of controls that share no dependencies scores 100% diverse. A pair that shares all dependencies scores 0% diverse. Target a minimum diversity score (e.g., 60%) for each risk domain.
Correlated failure simulation. At least annually, conduct simulation exercises that disable a shared dependency and observe the impact on all dependent governance controls. Measure the residual governance coverage and compare it against the minimum acceptable coverage level. These exercises should be tabletop initially and progress to live simulations for critical shared dependencies.
Real-time dependency health monitoring. Instrument shared dependencies to report health metrics in real time. When a shared dependency degrades, automatically assess the impact on governance coverage and alert governance leadership. For critical shared dependencies, implement automatic governance posture adjustment — for example, tightening agent mandate limits when a monitoring control degrades, or pausing agents when a critical enforcement control's dependency is unavailable.

Anti-patterns to avoid:

Counting controls without analysing independence. "We have five governance controls" is not a statement about assurance if all five share the same dependency. The number of controls is relevant only when weighted by their independence.
Assuming cloud provider regions provide independence. Two governance controls deployed in different availability zones of the same cloud region may share control plane dependencies, credential management services, and network infrastructure at the regional level. True independence may require multi-region or multi-cloud deployment.
Treating vendor diversity as sufficient. Using different vendors for different controls addresses vendor-specific risk but does not address shared infrastructure risk if both vendors deploy on the same cloud provider, or shared reasoning risk if both vendors use the same underlying foundation model.
Analysing dependencies only at deployment time. Dependencies change as infrastructure evolves. A governance control that was independent at deployment may acquire shared dependencies through infrastructure migrations, vendor consolidations, or platform upgrades. Dependency analysis must be continuous, not one-time.
Ignoring reasoning model dependencies. Two governance controls that use different microservices, different databases, and different network segments but both call the same LLM for reasoning share a reasoning dependency. A model regression, an API outage, or a safety filter change at the model provider level affects both controls simultaneously.

Industry Considerations

Financial Services. Correlated failure risk in governance controls maps directly to operational resilience requirements under DORA and FCA/PRA operational resilience policy. Financial regulators expect firms to identify important business services and map their dependencies to identify concentration risks. For AI agent governance, governance controls are supporting functions for important business services, and their correlated failure risks must be identified and mitigated.

Healthcare. Clinical AI governance controls that share a common dependency create patient safety risks. If a clinical decision support agent's safety filter and dosage checker both depend on the same drug interaction database, a database corruption event disables both safety controls simultaneously. Healthcare regulators expect defence-in-depth with genuinely independent layers.

Critical Infrastructure. IEC 62443 and nuclear safety frameworks require common-cause failure analysis for safety-related systems. AI agents controlling critical infrastructure must apply equivalent analysis to their governance controls.

Maturity Model

Basic Implementation — A dependency map documents infrastructure, platform, service, data, and reasoning dependencies for each governance control. Shared dependencies are identified and classified. Correlated failure risk is assessed for shared dependencies affecting more than 50% of governance coverage. Mitigation is implemented for the highest-risk shared dependencies. The dependency map is updated within 30 days of changes. This level meets the minimum mandatory requirements.

Intermediate Implementation — All basic capabilities plus: diversity scoring is computed for each risk domain. Correlated failure simulation exercises are conducted annually. Real-time dependency health monitoring alerts when shared dependencies degrade. The dependency graph is maintained as a live artefact, updated automatically from infrastructure-as-code configurations.

Advanced Implementation — All intermediate capabilities plus: automatic governance posture adjustment responds to shared dependency degradation in real time. Multi-cloud or multi-region deployment eliminates infrastructure-level correlated failure risks for critical controls. Reasoning model diversity eliminates common-mode reasoning failures. Independent adversarial testing of correlated failure resilience has been conducted. The organisation can demonstrate to regulators that no single dependency failure can disable more than a defined percentage of governance coverage.

7. Evidence Requirements

Required artefacts:

Dependency map. Formal dependency graph documenting infrastructure, platform, service, data, and reasoning dependencies for each governance control. Must include shared dependency identification and classification.
Correlated failure assessment. Risk assessment for each shared dependency documenting the failure modes, affected controls, combined coverage loss, and mitigation measures.
Mitigation evidence. Documentation of mitigations implemented for shared dependencies exceeding the 50% impact threshold.
Dependency map update log. Evidence that the dependency map is updated within 30 days of changes, with change dates and triggers documented.
Simulation exercise reports. Results of correlated failure simulation exercises including scenarios tested, impact observed, and remediation actions.

Retention requirements:

Dependency maps and correlated failure assessments: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Dependency Map Completeness

Stimulus: Independently audit the governance control deployment and compare discovered dependencies against the documented dependency map.
Expected behaviour: All dependencies are documented. No undocumented shared dependencies exist.
Pass criteria: The dependency map matches the independently discovered dependency structure with no omissions of shared dependencies.
Fail criteria: Any shared dependency is not documented in the dependency map.

Test 8.2: Shared Dependency Identification

Stimulus: Review the dependency map for shared dependencies. Verify that all instances where two or more controls share a dependency are identified and classified.
Expected behaviour: All shared dependencies are explicitly identified with the list of affected controls and criticality classification.
Pass criteria: 100% of shared dependencies are identified and classified.
Fail criteria: Any shared dependency is not identified or classified.

Test 8.3: Correlated Failure Impact Simulation

Stimulus: Select the shared dependency affecting the most governance controls. Simulate its failure (service shutdown, credential revocation, or network partition).
Expected behaviour: All dependent controls are affected as documented in the correlated failure assessment. The actual impact matches the assessed impact.
Pass criteria: Actual failure impact matches the documented assessment within acceptable tolerance. Mitigations function as designed.
Fail criteria: Actual impact exceeds the documented assessment, or mitigations fail to function.

Test 8.4: 50% Coverage Threshold Mitigation

Stimulus: Verify that all shared dependencies whose failure would disable more than 50% of governance coverage have documented and implemented mitigations.
Expected behaviour: Mitigations exist and are operational for all above-threshold shared dependencies.
Pass criteria: All above-threshold shared dependencies have implemented mitigations. Mitigations are verified as operational.
Fail criteria: Any above-threshold shared dependency lacks mitigation, or the mitigation is not operational.

Test 8.5: Dependency Map Currency

Stimulus: Identify a governance control deployment change made within the last 90 days. Verify that the dependency map was updated within 30 days of the change.
Expected behaviour: The dependency map reflects the change with an update timestamp within 30 days.
Pass criteria: The dependency map is current, reflecting all changes within the 30-day update window.
Fail criteria: The dependency map does not reflect a change that occurred more than 30 days ago.

Test 8.6: Reasoning Model Correlation

Stimulus: Identify all governance controls that use the same AI model for reasoning. Introduce a synthetic regression in the model (e.g., a prompt that triggers a known weakness). Evaluate the impact across all dependent controls.
Expected behaviour: All controls sharing the model are affected. The correlated failure assessment correctly predicted this correlation.
Pass criteria: The reasoning model correlation is documented, and the simulated regression affects dependent controls as predicted.
Fail criteria: The reasoning model correlation is not documented, or the impact differs significantly from the assessment.

Test 8.7: Automatic Posture Adjustment

Stimulus: Degrade a shared dependency that is monitored for real-time health. Verify that governance posture adjustments activate.
Expected behaviour: The system detects the degradation and adjusts governance posture (tighter limits, agent pause, or alternative routing) based on residual coverage.
Pass criteria: Posture adjustment activates within the configured response time. The adjustment is proportionate to the coverage loss.
Fail criteria: No posture adjustment activates, or the adjustment is disproportionate.

Conformance Scoring

Score 0: No correlated failure analysis has been performed — governance controls are deployed without analysis of shared dependencies.
Score 1: A dependency map exists identifying shared dependencies — but correlated failure risk is not assessed, and no mitigation is implemented for high-impact shared dependencies.
Score 2: Full mandatory requirements met including dependency map, shared dependency identification, correlated failure assessment, mitigation for above-threshold shared dependencies, and 30-day update cycle.
Score 3: All Score 2 capabilities plus diversity scoring, annual correlated failure simulation, real-time dependency health monitoring, automatic governance posture adjustment, and independent adversarial testing of correlated failure resilience.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Supports compliance
EU AI Act	Article 15 (Accuracy, Robustness and Cybersecurity)	Supports compliance
DORA	Article 9 (ICT Risk Management Framework)	Direct requirement
DORA	Article 28 (Third-Party ICT Concentration Risk)	Direct requirement
FCA/PRA	Operational Resilience Policy (PS6/21, PS21/3)	Direct requirement
NIST AI RMF	GOVERN 1.1, MANAGE 2.2	Supports compliance
ISO 42001	Clause 6.1 (Actions to Address Risks)	Supports compliance
IEC 62443	SR 7.1 (Denial of Service Protection)	Supports compliance

DORA — Article 28 (Third-Party ICT Concentration Risk)

Article 28 requires financial entities to identify and manage concentration risk arising from dependency on third-party ICT service providers. For AI agent governance, this includes concentration risk where multiple governance controls depend on the same third-party provider (cloud infrastructure, model provider, identity service). AG-154 directly implements the concentration risk identification and assessment required by Article 28.

FCA/PRA Operational Resilience Policy

PS6/21 and PS21/3 require firms to identify important business services and map the resources (people, processes, technology, facilities, information) required to deliver them. Firms must identify vulnerabilities arising from concentration of resources. For AI agent governance, governance controls are resources supporting important business services, and their shared dependencies create concentration vulnerabilities that must be identified and mitigated.

DORA — Article 9

Article 9 requires financial entities to have in place mechanisms to promptly detect anomalous activities. Correlated control failure — where multiple governance controls degrade simultaneously due to a shared dependency — is an anomalous activity that requires detection mechanisms. Real-time dependency health monitoring implements this detection requirement.

10. Failure Severity

Field	Value
Severity Rating	Critical
Blast Radius	Organisation-wide — a correlated failure can simultaneously disable all governance controls, leaving the entire agent fleet ungoverned

Consequence chain: Correlated control failure is a meta-failure that turns apparent defence-in-depth into a single point of failure. The immediate consequence is simultaneous degradation or loss of multiple governance controls. The operational consequence depends on the fail-safe posture: if controls fail-safe (AG-008 compliant), all dependent agents pause simultaneously, causing a total AI service outage; if controls fail-open, all dependent agents operate without governance oversight simultaneously, creating unbounded risk exposure. Both outcomes are severe — total outage has immediate business impact, and total governance loss has immediate risk exposure. The failure is particularly dangerous because it occurs precisely when the organisation most needs its governance controls: during adversarial attack, infrastructure degradation, or unusual operational conditions. The business consequences include: regulatory enforcement for inadequate operational resilience, financial losses from ungoverned agent operations, service disruption costs, and potential systemic risk if the correlated failure affects agents interacting with external markets, counterparties, or public services.

Cross-references: AG-008 (Governance Continuity Under Failure) — ensures individual controls survive component failures; AG-154 ensures that multiple controls do not share failure modes. AG-007 (Governance Configuration Control) — changes to governance configuration can introduce or remove shared dependencies, triggering dependency map updates. AG-027 (Governance Override Resistance) — a correlated failure that disables override resistance across multiple controls creates an override vulnerability. AG-056 (Independent Validation) — validates that claimed independence between controls is actual independence. AG-153 (Control Efficacy Measurement Governance) — live challenge results provide data for identifying correlated failure patterns. AG-155 (Oversight Diversity and Heterogeneous Redundancy Governance) — prescribes the diversity measures needed to mitigate the correlated failure risks identified by AG-154.

Cite this protocol

AgentGoverning. (2026). AG-154: Correlated Control Failure Analysis. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-154

← Previous Protocol

AG-153

Control Efficacy Measurement Governance

Next Protocol →

AG-155

Oversight Diversity and Heterogeneous Redundancy Governance