The Standard

The 841 Dimensions Regulatory Mapping Version History

Compliance

Compliance Leaderboard Platform Comparison

Verification

Submit for Verification Self-Assessment Tool

About

About AgentGoverning Press & Media

Contact

AG-008

Governance Continuity Under Failure

Group A — Mandate & Action Governance ~18 min read AGS v2.1 · April 2026

EU AI Act FCA HIPAA SOC 2

2. Summary

Governance Continuity Under Failure requires that every AI governance system defines explicit, tested behaviour for when governance infrastructure components are degraded or unavailable. The fundamental principle is that governance must not fail open: unavailability of an enforcement mechanism, audit system, or verification service must never result in unrestricted agent operation. When normal governance infrastructure is unavailable, the system must default to the most conservative available posture. This dimension introduces degraded-mode tiers — defined levels of reduced capability with progressively more conservative governance postures as failure severity increases — balancing risk management with operational continuity. Without AG-008, every other dimension in the framework has an implicit single point of failure: a routine infrastructure outage can disable governance entirely, creating an unprotected window for unrestricted agent operation.

3. Example

Scenario A — Timeout Fallback Creates Fail-Open Path: A healthcare AI agent platform processes patient data access requests through a governance gateway that checks each request against the agent's data access mandate stored in a database. The gateway has a 200-millisecond timeout — if the database does not respond within this window, the gateway returns a default response. During implementation, the development team set the default response to "permit" to avoid blocking clinical workflows. During a database performance degradation caused by a routine backup operation, the gateway begins timing out on approximately 30% of requests. For these requests, the agents receive automatic approval regardless of their mandate scope. Over a four-hour period, three agents access patient records outside their authorised scope — including mental health records that require elevated access controls under the organisation's clinical governance policy.

What went wrong: The timeout fallback was set to "permit" rather than "deny." The decision was made during development to avoid clinical workflow disruption, but it created a fail-open path. The database performance degradation was a routine operational event, not an attack, demonstrating that fail-open behaviour does not require adversarial action. Consequence: Unauthorised access to sensitive patient records, including specially protected mental health data. HIPAA breach notification required. Potential OCR investigation. Disciplinary action for clinical staff whose agents accessed records outside their scope. Remediation cost for implementing proper fail-safe behaviour.

Scenario B — Audit System Failure Creates Unmonitored Window: A financial services firm operates AI agents for automated compliance screening of customer transactions. The agents are governed by mandates enforced at the infrastructure layer (AG-001 compliant), and all actions are logged to a centralised audit system. However, the audit system experiences a failure due to a certificate expiry that was not renewed. The enforcement system continues operating — actions are still checked against mandates — but actions are not being logged. The agents continue operating for 18 hours before the certificate expiry is detected and remediated. During this window, the organisation has no audit trail of agent actions. When a regulatory inquiry later asks for a complete record of agent actions during this period, the organisation cannot produce it.

What went wrong: The enforcement system and the audit system were treated as independent components. When the audit system failed, the enforcement system continued operating without recognising that a critical governance component was unavailable. No health check monitored the audit system's availability or triggered degraded-mode behaviour when it became unavailable. Consequence: 18-hour gap in audit trail. Inability to respond to regulatory inquiry for that period. Regulatory finding for inadequate operational resilience.

Scenario C — Network Partition Creates Split-Brain Governance: An organisation operates AI agents across two data centres with a shared governance database. A network partition separates the two data centres for 23 minutes. Agents in data centre A continue operating with their local connection to the governance database. Agents in data centre B lose connectivity to the database and, under the degraded-mode policy, switch to cached mandates. However, the cached mandates in data centre B are from a prior version — a mandate update was applied 30 minutes before the partition but had not yet propagated to the cache. Agents in data centre B operate for 23 minutes under outdated mandates that permit higher transaction limits than the current approved configuration.

What went wrong: The cached mandate synchronisation was not immediate, creating a window where the cache held a prior version. The degraded-mode policy used the cached mandate without verifying its currency. No mechanism existed to detect that the cached mandate was stale relative to the active version. Consequence: 23 minutes of agent operation under outdated, more permissive mandates. Potential regulatory finding if any actions during this window exceeded the limits defined in the current mandate version. Demonstration that degraded-mode caching must include version verification or use the most conservative interpretation when version currency cannot be confirmed.

4. Requirement Statement

Scope: This dimension applies to all governance system components including mandate enforcement databases, audit logging services, identity verification systems, external verification dependencies, monitoring infrastructure, and communication systems used for human escalation. The scope extends to dependencies between components — a mandate enforcement service that depends on a database is affected by database failures even if the service itself is healthy. Any system where an AI agent operates under governance controls that depend on infrastructure availability is within scope. The scope includes partial failures (slow responses, intermittent errors, stale data) as well as complete failures (component unavailability).

4.1. A conforming system MUST define explicit degraded-mode behaviour for each governance infrastructure failure scenario, specifying the governance posture to adopt when each component is unavailable.

4.2. A conforming system MUST default degraded-mode operation to the most conservative available governance posture — governance MUST NOT fail open.

4.3. A conforming system MUST ensure that unavailability of any governance component does not result in unrestricted agent operation — agents MUST be blocked from executing actions rather than permitted to operate without governance.

4.4. A conforming system MUST log degraded-mode operation events and flag them for human review, using local buffering when the primary audit system is unavailable.

4.5. A conforming system SHOULD define degraded-mode tiers from partial degradation through complete infrastructure failure, with progressively more conservative governance postures at each tier.

4.6. A conforming system SHOULD test recovery procedures under simulated failure conditions, including replay of locally buffered audit records when connectivity is restored.

4.7. A conforming system SHOULD maintain a minimum capability in-memory buffer during brief outages to provide continuity without compromising governance.

4.8. A conforming system MAY implement automatic escalation to human oversight when degraded mode is entered.

5. Rationale

Governance Continuity Under Failure addresses the question every governance framework must confront: what happens when the governance system itself breaks? Without AG-008, every other protocol in the framework has an implicit single point of failure. If the infrastructure that enforces boundaries, tracks actions, verifies identities, or maintains audit records becomes unavailable, agents either stop entirely — causing business disruption — or continue without governance, creating uncontrolled risk exposure.

The critical distinction is between normal operating controls and fail-safe behaviour. Every other protocol in this framework defines how governance operates under normal conditions. AG-008 defines what happens when those normal conditions no longer exist. Infrastructure failures are routine in any distributed system. Databases go offline, network partitions occur, services crash, and external dependencies become unavailable. The question is not whether governance infrastructure will fail, but how agents behave when it does.

Consider a major cloud provider experiencing a regional database outage lasting 47 minutes. An investment management firm operates AI agents for portfolio rebalancing on this infrastructure. The agents' mandate enforcement relies on a database that stores mandate limits and tracks aggregate exposure. When the database becomes unavailable, the enforcement gateway — which was designed to check every action against the database — begins returning errors. The development team, aware that gateway errors would halt all agent operations, had implemented a timeout fallback: if the database does not respond within 500 milliseconds, the gateway assumes the action is within limits and permits it. During the 47-minute outage, the portfolio rebalancing agents execute 3,200 trades without mandate enforcement. Several agents exceed their aggregate exposure limits. The aggregate exposure across all agents reaches GBP 34 million against a total approved exposure of GBP 8 million. When the database recovers, the enforcement system catches up and begins flagging violations — but all 3,200 trades have already been executed and settled.

This scenario illustrates the core problem. The governance system was robust under normal conditions but had an implicit fail-open path triggered by a routine infrastructure event. The 500-millisecond timeout fallback was a well-intentioned engineering decision that created a catastrophic governance gap. AG-008 also recognises that binary fail-closed behaviour is often impractical. The protocol introduces degraded-mode tiers: defined levels of reduced capability with progressively more conservative governance postures as failure severity increases. Each tier has explicit trigger conditions, permitted capabilities, and escalation procedures. This graduated approach balances risk management with operational continuity.

6. Implementation Guidance

AG-008 requires organisations to define explicit degraded-mode behaviour for governance infrastructure and to ensure that no failure scenario permits unrestricted agent operation. The implementation centres on degraded-mode tier design, health monitoring, caching strategies, and recovery procedures.

Define a minimum of six degraded-mode tiers: (1) full operation, (2) read-only audit, (3) in-memory audit only, (4) cached mandate only, (5) static deny-all, (6) complete isolation. Each tier should have explicit trigger conditions, capability description, and escalation procedures. Default posture at any unrecognised failure state should be deny-all.

Recommended patterns:

Health-Monitored Tier Controller. A dedicated health service monitors all governance components for connectivity, functionality, and data currency. It maintains a state machine representing the current tier and transitions automatically when health changes. All agent-facing services enforce tier-appropriate behaviour. The health service must itself be monitored — a failure of the health service should trigger the most conservative tier, not permit agents to operate without tier management.
Layered Cache with Version Verification. Multiple cache layers provide continuity during outages of increasing duration: in-process cache (sub-second), local persistent cache (minutes), and static baseline configuration (last resort). Each layer includes a version identifier; if the cache is too stale, the system falls to a more conservative tier. This pattern addresses the split-brain governance scenario by ensuring cached mandates are verified for currency before use.
Independent Governance Watchdog. An independent watchdog on separate infrastructure periodically submits test actions that should be blocked. If a test action is permitted, the watchdog triggers emergency escalation — notifying humans and optionally issuing a platform-wide halt. This detects silent fail-open scenarios invisible to internal monitoring. The watchdog must operate on infrastructure independent of the primary governance system to avoid common-mode failures.

Anti-patterns to avoid:

Treating timeout fallbacks as non-governance decisions. Development teams routinely implement timeout fallbacks for performance reasons without considering their governance implications. A 500-millisecond timeout that defaults to "permit" is a governance decision — it defines agent behaviour when governance infrastructure is slow. Every timeout and fallback in the governance path must be evaluated as a degraded-mode behaviour and must default to conservative posture.
Testing only complete failure scenarios. Many organisations test what happens when a component is completely unavailable but not what happens when it is partially available — responding slowly, returning errors intermittently, or returning stale data. Partial failures are more common than complete failures and often trigger different (and untested) code paths.
Assuming the logging system is always available. Degraded-mode behaviour must include a plan for logging when the primary logging system is unavailable. Local buffering with eventual replay is the minimum requirement. Without this, degraded-mode events are themselves unauditable — creating a governance gap within the governance gap.
Not testing governance infrastructure failures independently of business infrastructure failures. Many disaster recovery tests simulate business infrastructure failures but leave governance infrastructure intact. AG-008 requires testing what happens when governance infrastructure fails while business infrastructure remains operational — the scenario where agents can act but governance cannot govern.
Implementing fail-closed as a hard block without graduated tiers. A system that blocks all agent activity when any governance component fails meets the minimum AG-008 requirement (governance does not fail open) but creates excessive business disruption. Graduated tiers allow the system to maintain appropriate governance while preserving as much operational capability as the failure scenario permits.

Industry Considerations

Financial Services. Regulators expect operational resilience under severe but plausible disruption scenarios. The FCA's impact tolerance framework requires firms to define the maximum tolerable duration of disrupted governance — this should inform caching durations and tier transition thresholds. MiFID II business continuity requirements (Article 18 of the Delegated Regulation) also apply. The FCA's operational resilience framework (PS21/3) requires firms to identify important business services, set impact tolerances, and ensure they can remain within those tolerances during severe but plausible disruption scenarios.

Healthcare. Degraded-mode tiers must balance governance conservatism with clinical safety. Critical clinical functions may continue under cached mandates with enhanced human oversight, while non-urgent functions are suspended. HIPAA contingency plan requirements (45 CFR 164.308(a)(7)) require procedures for operating during emergencies, mapping directly to AG-008's degraded-mode tier definitions.

Critical Infrastructure. A governance failure that shuts down a safety-monitoring agent could itself create a hazard. Degraded-mode design must coordinate with safety instrumented systems (SIS) per IEC 61511 and ensure governance failure does not interfere with safety-critical operations. Safety limits must be enforced even in the most degraded governance tiers.

Maturity Model

Basic Implementation — The organisation has documented degraded-mode behaviour for its governance infrastructure. When a governance component fails, agents are blocked from executing actions rather than operating without governance. The fail-closed behaviour is implemented as a hard stop: if the enforcement check cannot be completed, the action is rejected. Degraded-mode events are logged (if the logging system is available) and the operations team is notified. At this level, the organisation meets the core mandatory requirement — governance does not fail open — but the implementation is binary: either full governance or no agent operation. This creates business continuity risk, as any governance infrastructure outage halts all agent activity.

Intermediate Implementation — Degraded-mode tiers are defined with progressively more conservative governance postures. A minimum of four tiers are implemented: full operation, reduced capability (cached mandates with conservative defaults for uncached parameters), deny-new-only (in-flight actions complete but no new actions are initiated), and complete block. Each tier has defined trigger conditions based on the health of specific governance components. The system automatically transitions between tiers based on real-time health monitoring. In-memory caching provides continuity during brief outages (seconds to low minutes). Degraded-mode events are logged to a local buffer when the primary audit system is unavailable, and replayed when connectivity is restored. Recovery procedures have been tested under simulated failure conditions.

Advanced Implementation — All intermediate capabilities plus: degraded-mode tiers include six or more levels with granular capability definitions. Failure injection testing (chaos engineering) is conducted regularly against governance infrastructure, including targeted attacks against specific components. The system can operate in degraded mode for extended periods (hours) with cached mandates and local audit buffering. Cross-region or cross-zone redundancy ensures that single-region failures do not trigger degraded mode. Independent audit has verified that no failure scenario results in fail-open behaviour. The governance system can demonstrate to regulators that it maintains effective controls even under the most severe infrastructure failure scenarios it has tested.

7. Evidence Requirements

Required artefacts:

Degraded-mode tier definitions. Documented tiers with trigger conditions, permitted capabilities, and escalation procedures for each tier. Format: structured configuration or policy document showing each tier, its trigger conditions, and the governance posture at that tier.
Failure scenario coverage documentation. Mapping of every governance infrastructure component to the degraded-mode behaviour when that component fails, including combination failures. This must demonstrate that no failure combination results in fail-open behaviour.
Test results under simulated failure conditions. Evidence of failure injection testing showing that each failure scenario triggered the appropriate degraded-mode tier and that no scenario resulted in fail-open behaviour. Includes component isolation testing, cascading failure testing, recovery testing, and adversarial failure testing.
Human notification procedures for degraded-mode entry. Evidence that entry into any degraded-mode tier triggers notification to appropriate personnel, including the notification mechanism and escalation path.

Retention requirements:

Degraded-mode tier definitions and failure test results: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Testing AG-008 compliance requires deliberate injection of failures into governance infrastructure while monitoring agent behaviour. This is chaos engineering applied to governance.

Test 8.1: Component Isolation — Fail-Closed Verification

Stimulus: Systematically disable each governance infrastructure component (mandate enforcement database, audit logging service, identity verification system, monitoring infrastructure) individually while agents are actively operating.
Expected behaviour: The system transitions to the appropriate degraded-mode tier with conservative posture. Agents are blocked from executing actions that cannot be governed. Logging and operator notification occur.
Pass criteria: No agent action executes without governance. Each component failure triggers the documented degraded-mode tier.
Fail criteria: Any component failure results in unrestricted agent operation or undefined behaviour.

Test 8.2: Cascading Failure — Multi-Component Degradation

Stimulus: Disable multiple governance components simultaneously, including the specific combination where audit and enforcement fail together.
Expected behaviour: Graceful degradation through defined tiers to the most conservative posture. No fail-open path exists regardless of the combination of failed components.
Pass criteria: Every tested combination of component failures results in a defined, conservative governance posture. The system does not exhibit undefined behaviour.
Fail criteria: Any combination of failures results in unrestricted agent operation, or the system enters an undefined state.

Test 8.3: Timeout and Fallback — Latency-Induced Fail-Open

Stimulus: Introduce latency (not failure) into governance infrastructure components. Progressively increase latency to trigger timeout thresholds.
Expected behaviour: Timeout behaviour defaults to conservative posture. No timeout fallback permits unrestricted operation. The system transitions to an appropriate degraded-mode tier when latency exceeds thresholds.
Pass criteria: No latency condition causes a fail-open path. All timeout fallbacks default to deny.
Fail criteria: Any timeout fallback permits an action that would have been blocked under normal operation.

Test 8.4: Recovery and Replay

Stimulus: Trigger degraded mode, allow agents to operate under degraded governance with local audit buffering, then restore the failed component.
Expected behaviour: The system recovers to full operation. Locally buffered audit records are replayed to the primary audit system. No audit records are lost.
Pass criteria: Full recovery occurs within documented timeframes. All locally buffered records are successfully replayed. The audit trail is complete across the degraded-mode period.
Fail criteria: Audit records from the degraded-mode period are lost, or recovery does not restore full governance operation.

Test 8.5: Adversarial Failure — Targeted Infrastructure Attack

Stimulus: Simulate targeted attacks against governance infrastructure including DDoS against enforcement services, database corruption, network partitions, and certificate expiry.
Expected behaviour: No attack results in unrestricted agent operation. The system detects the attack condition and transitions to an appropriate degraded-mode tier.
Pass criteria: All simulated attacks result in fail-safe behaviour. No attack creates a window of unrestricted operation.
Fail criteria: Any simulated attack results in agents operating without governance controls.

Conformance Scoring

Score 0: No degraded-mode behaviour defined — governance infrastructure failure results in undefined agent behaviour or unrestricted operation.
Score 1: Degraded mode defined but fails open in some scenarios — governance fails closed for some component failures but not all, or timeout fallbacks create fail-open paths.
Score 2: Full fail-safe degraded mode across all infrastructure failure scenarios — no failure combination results in unrestricted agent operation, and degraded-mode tiers provide graduated response.
Score 3: Verified by independent failure injection testing — an independent party has conducted chaos engineering against governance infrastructure and confirmed that all failure scenarios result in fail-safe behaviour.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
SOC 2 Type II	Availability Criteria	Direct requirement
EU AI Act	Article 9 (Risk Management System)	Direct requirement
FCA SYSC	13 (Operational Resilience)	Direct requirement
DORA	Article 9 (ICT Risk Management Framework)	Supports compliance
HIPAA	45 CFR 164.308(a)(7) (Contingency Plan)	Supports compliance
IEC 62443	Security Levels (Critical Infrastructure)	Supports compliance

SOC 2 Type II — Availability Criteria

SOC 2 Type II examines whether the system is available for operation and use as committed or agreed. For AI agent governance, this includes the availability of governance infrastructure itself. An auditor assessing SOC 2 availability will examine whether the governance system has defined behaviour under failure conditions, whether that behaviour has been tested, and whether the system can demonstrate that availability incidents did not result in governance gaps. The availability criterion also requires that the system has mechanisms to detect and respond to availability incidents — which maps to AG-008's requirement for health monitoring and automated tier transitions.

SOC 2 Type II is an examination over a period (typically 6-12 months), not a point-in-time assessment. This means the auditor will examine all availability incidents during the period and assess whether the degraded-mode behaviour functioned as designed during each incident.

EU AI Act — Article 9 (Risk Management System)

Article 9 requires that the risk management system for high-risk AI systems include measures to address risks identified during risk assessment. Infrastructure failure is a foreseeable risk for any AI system, and the failure of governance infrastructure specifically creates a risk of uncontrolled AI operation. The EU AI Act requires that risk mitigation measures be implemented "as far as technically feasible." Fail-safe governance behaviour is technically feasible, so its absence would not meet the Article 9 standard.

Article 9(4)(d) specifically requires that risk management measures are such that any residual risk is judged acceptable. An AI agent governance system that fails open under infrastructure stress has a residual risk — unrestricted agent operation during outages — that would likely not be judged acceptable for high-risk applications.

FCA SYSC 13 — Operational Resilience

SYSC 13 requires firms to establish systems and controls to manage operational risk, including the risk of system failures. The FCA's operational resilience framework (PS21/3) requires firms to identify important business services, set impact tolerances, and ensure they can remain within those tolerances during severe but plausible disruption scenarios. AI agent governance is part of the control infrastructure for any important business service that uses AI agents. If governance fails during a disruption, the firm cannot demonstrate that the business service remained within its impact tolerance.

The FCA expects firms to conduct scenario testing that includes the failure of control systems, not just the failure of the business service itself. This directly maps to AG-008's requirement for failure injection testing.

DORA — Article 9 (ICT Risk Management Framework)

Article 9 of DORA requires financial entities to establish and maintain an ICT risk management framework that ensures resilience of ICT systems, including the control systems that govern AI agent operations. Governance infrastructure continuity under failure is a core element of ICT resilience for AI-driven financial operations.

HIPAA — 45 CFR 164.308(a)(7) (Contingency Plan)

The HIPAA Security Rule requires covered entities to establish and implement policies and procedures for responding to an emergency or other occurrence that damages systems containing electronic protected health information. For healthcare organisations deploying AI agents, this includes the governance systems that control agent access to patient data. AG-008's degraded-mode tiers map directly to the contingency plan requirement.

IEC 62443 — Security Levels (Critical Infrastructure)

For critical infrastructure deployments, IEC 62443 security levels should inform the enforcement architecture for degraded-mode behaviour. A governance failure that shuts down a safety-monitoring agent could itself create a hazard. Degraded-mode design must coordinate with safety instrumented systems and ensure governance failure does not interfere with safety-critical operations.

10. Failure Severity

Field	Value
Severity Rating	Critical
Blast Radius	Organisation-wide — a governance infrastructure failure affects all agents simultaneously, potentially across all business functions

Consequence chain: Without governance continuity controls, a targeted infrastructure attack — or even routine system failure — can disable governance entirely, creating an unprotected window for unrestricted agent operation. The severity of an AG-008 failure is directly proportional to the duration of the governance gap and the capability of the agents operating during that gap. An agent that can execute financial transactions without governance for even a few minutes can create exposure equivalent to months of human-speed activity. An agent with data access operating without audit for hours can exfiltrate datasets that would take a human insider weeks to collect.

AG-008 failures are particularly dangerous because they are self-concealing. If the audit system fails alongside the enforcement system, there may be no record that governance was absent. The failure mode is not just the absence of governance during the outage — it is the inability to demonstrate what happened during the outage, compounding the regulatory and legal exposure. Defence in depth requires that AG-008 be implemented on independent infrastructure so that a single root cause cannot disable both governance controls and the continuity mechanisms. A regulator who discovers that a routine infrastructure outage disabled governance entirely will infer that the organisation's governance framework is architecturally unsound — not merely that it experienced a technical failure.

Cross-references: AG-008 intersects with AG-001 (Operational Boundary Enforcement) for defining the boundaries that must remain enforced under failure; AG-006 (Tamper-Evident Record Integrity) for ensuring audit records continue to be captured during degraded operation; AG-007 (Governance Configuration Control) for governing which configuration version is used when the live source is unavailable; AG-012 (Agent Identity Assurance) for addressing how identity verification operates when identity infrastructure is degraded; and AG-019 (Human Escalation & Override Triggers) for escalation mechanisms during degraded-mode entry.

Cite this protocol

AgentGoverning. (2026). AG-008: Governance Continuity Under Failure. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-008

← Previous Protocol

AG-007

Governance Configuration Control

Next Protocol →

AG-009

Delegated Authority Governance