AG-008

Governance Continuity Under Failure

Group A — Mandate & Action Governance ~18 min read AGS v2.1 · April 2026
EU AI Act FCA HIPAA SOC 2

2. Summary

Governance Continuity Under Failure requires that every AI governance system defines explicit, tested behaviour for when governance infrastructure components are degraded or unavailable. The fundamental principle is that governance must not fail open: unavailability of an enforcement mechanism, audit system, or verification service must never result in unrestricted agent operation. When normal governance infrastructure is unavailable, the system must default to the most conservative available posture. This dimension introduces degraded-mode tiers — defined levels of reduced capability with progressively more conservative governance postures as failure severity increases — balancing risk management with operational continuity. Without AG-008, every other dimension in the framework has an implicit single point of failure: a routine infrastructure outage can disable governance entirely, creating an unprotected window for unrestricted agent operation.

3. Example

Scenario A — Timeout Fallback Creates Fail-Open Path: A healthcare AI agent platform processes patient data access requests through a governance gateway that checks each request against the agent's data access mandate stored in a database. The gateway has a 200-millisecond timeout — if the database does not respond within this window, the gateway returns a default response. During implementation, the development team set the default response to "permit" to avoid blocking clinical workflows. During a database performance degradation caused by a routine backup operation, the gateway begins timing out on approximately 30% of requests. For these requests, the agents receive automatic approval regardless of their mandate scope. Over a four-hour period, three agents access patient records outside their authorised scope — including mental health records that require elevated access controls under the organisation's clinical governance policy.

What went wrong: The timeout fallback was set to "permit" rather than "deny." The decision was made during development to avoid clinical workflow disruption, but it created a fail-open path. The database performance degradation was a routine operational event, not an attack, demonstrating that fail-open behaviour does not require adversarial action. Consequence: Unauthorised access to sensitive patient records, including specially protected mental health data. HIPAA breach notification required. Potential OCR investigation. Disciplinary action for clinical staff whose agents accessed records outside their scope. Remediation cost for implementing proper fail-safe behaviour.

Scenario B — Audit System Failure Creates Unmonitored Window: A financial services firm operates AI agents for automated compliance screening of customer transactions. The agents are governed by mandates enforced at the infrastructure layer (AG-001 compliant), and all actions are logged to a centralised audit system. However, the audit system experiences a failure due to a certificate expiry that was not renewed. The enforcement system continues operating — actions are still checked against mandates — but actions are not being logged. The agents continue operating for 18 hours before the certificate expiry is detected and remediated. During this window, the organisation has no audit trail of agent actions. When a regulatory inquiry later asks for a complete record of agent actions during this period, the organisation cannot produce it.

What went wrong: The enforcement system and the audit system were treated as independent components. When the audit system failed, the enforcement system continued operating without recognising that a critical governance component was unavailable. No health check monitored the audit system's availability or triggered degraded-mode behaviour when it became unavailable. Consequence: 18-hour gap in audit trail. Inability to respond to regulatory inquiry for that period. Regulatory finding for inadequate operational resilience.

Scenario C — Network Partition Creates Split-Brain Governance: An organisation operates AI agents across two data centres with a shared governance database. A network partition separates the two data centres for 23 minutes. Agents in data centre A continue operating with their local connection to the governance database. Agents in data centre B lose connectivity to the database and, under the degraded-mode policy, switch to cached mandates. However, the cached mandates in data centre B are from a prior version — a mandate update was applied 30 minutes before the partition but had not yet propagated to the cache. Agents in data centre B operate for 23 minutes under outdated mandates that permit higher transaction limits than the current approved configuration.

What went wrong: The cached mandate synchronisation was not immediate, creating a window where the cache held a prior version. The degraded-mode policy used the cached mandate without verifying its currency. No mechanism existed to detect that the cached mandate was stale relative to the active version. Consequence: 23 minutes of agent operation under outdated, more permissive mandates. Potential regulatory finding if any actions during this window exceeded the limits defined in the current mandate version. Demonstration that degraded-mode caching must include version verification or use the most conservative interpretation when version currency cannot be confirmed.

4. Requirement Statement

Scope: This dimension applies to all governance system components including mandate enforcement databases, audit logging services, identity verification systems, external verification dependencies, monitoring infrastructure, and communication systems used for human escalation. The scope extends to dependencies between components — a mandate enforcement service that depends on a database is affected by database failures even if the service itself is healthy. Any system where an AI agent operates under governance controls that depend on infrastructure availability is within scope. The scope includes partial failures (slow responses, intermittent errors, stale data) as well as complete failures (component unavailability).

4.1. A conforming system MUST define explicit degraded-mode behaviour for each governance infrastructure failure scenario, specifying the governance posture to adopt when each component is unavailable.

4.2. A conforming system MUST default degraded-mode operation to the most conservative available governance posture — governance MUST NOT fail open.

4.3. A conforming system MUST ensure that unavailability of any governance component does not result in unrestricted agent operation — agents MUST be blocked from executing actions rather than permitted to operate without governance.

4.4. A conforming system MUST log degraded-mode operation events and flag them for human review, using local buffering when the primary audit system is unavailable.

4.5. A conforming system SHOULD define degraded-mode tiers from partial degradation through complete infrastructure failure, with progressively more conservative governance postures at each tier.

4.6. A conforming system SHOULD test recovery procedures under simulated failure conditions, including replay of locally buffered audit records when connectivity is restored.

4.7. A conforming system SHOULD maintain a minimum capability in-memory buffer during brief outages to provide continuity without compromising governance.

4.8. A conforming system MAY implement automatic escalation to human oversight when degraded mode is entered.

5. Rationale

Governance Continuity Under Failure addresses the question every governance framework must confront: what happens when the governance system itself breaks? Without AG-008, every other protocol in the framework has an implicit single point of failure. If the infrastructure that enforces boundaries, tracks actions, verifies identities, or maintains audit records becomes unavailable, agents either stop entirely — causing business disruption — or continue without governance, creating uncontrolled risk exposure.

The critical distinction is between normal operating controls and fail-safe behaviour. Every other protocol in this framework defines how governance operates under normal conditions. AG-008 defines what happens when those normal conditions no longer exist. Infrastructure failures are routine in any distributed system. Databases go offline, network partitions occur, services crash, and external dependencies become unavailable. The question is not whether governance infrastructure will fail, but how agents behave when it does.

Consider a major cloud provider experiencing a regional database outage lasting 47 minutes. An investment management firm operates AI agents for portfolio rebalancing on this infrastructure. The agents' mandate enforcement relies on a database that stores mandate limits and tracks aggregate exposure. When the database becomes unavailable, the enforcement gateway — which was designed to check every action against the database — begins returning errors. The development team, aware that gateway errors would halt all agent operations, had implemented a timeout fallback: if the database does not respond within 500 milliseconds, the gateway assumes the action is within limits and permits it. During the 47-minute outage, the portfolio rebalancing agents execute 3,200 trades without mandate enforcement. Several agents exceed their aggregate exposure limits. The aggregate exposure across all agents reaches GBP 34 million against a total approved exposure of GBP 8 million. When the database recovers, the enforcement system catches up and begins flagging violations — but all 3,200 trades have already been executed and settled.

This scenario illustrates the core problem. The governance system was robust under normal conditions but had an implicit fail-open path triggered by a routine infrastructure event. The 500-millisecond timeout fallback was a well-intentioned engineering decision that created a catastrophic governance gap. AG-008 also recognises that binary fail-closed behaviour is often impractical. The protocol introduces degraded-mode tiers: defined levels of reduced capability with progressively more conservative governance postures as failure severity increases. Each tier has explicit trigger conditions, permitted capabilities, and escalation procedures. This graduated approach balances risk management with operational continuity.

6. Implementation Guidance

AG-008 requires organisations to define explicit degraded-mode behaviour for governance infrastructure and to ensure that no failure scenario permits unrestricted agent operation. The implementation centres on degraded-mode tier design, health monitoring, caching strategies, and recovery procedures.

Define a minimum of six degraded-mode tiers: (1) full operation, (2) read-only audit, (3) in-memory audit only, (4) cached mandate only, (5) static deny-all, (6) complete isolation. Each tier should have explicit trigger conditions, capability description, and escalation procedures. Default posture at any unrecognised failure state should be deny-all.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Regulators expect operational resilience under severe but plausible disruption scenarios. The FCA's impact tolerance framework requires firms to define the maximum tolerable duration of disrupted governance — this should inform caching durations and tier transition thresholds. MiFID II business continuity requirements (Article 18 of the Delegated Regulation) also apply. The FCA's operational resilience framework (PS21/3) requires firms to identify important business services, set impact tolerances, and ensure they can remain within those tolerances during severe but plausible disruption scenarios.

Healthcare. Degraded-mode tiers must balance governance conservatism with clinical safety. Critical clinical functions may continue under cached mandates with enhanced human oversight, while non-urgent functions are suspended. HIPAA contingency plan requirements (45 CFR 164.308(a)(7)) require procedures for operating during emergencies, mapping directly to AG-008's degraded-mode tier definitions.

Critical Infrastructure. A governance failure that shuts down a safety-monitoring agent could itself create a hazard. Degraded-mode design must coordinate with safety instrumented systems (SIS) per IEC 61511 and ensure governance failure does not interfere with safety-critical operations. Safety limits must be enforced even in the most degraded governance tiers.

Maturity Model

Basic Implementation — The organisation has documented degraded-mode behaviour for its governance infrastructure. When a governance component fails, agents are blocked from executing actions rather than operating without governance. The fail-closed behaviour is implemented as a hard stop: if the enforcement check cannot be completed, the action is rejected. Degraded-mode events are logged (if the logging system is available) and the operations team is notified. At this level, the organisation meets the core mandatory requirement — governance does not fail open — but the implementation is binary: either full governance or no agent operation. This creates business continuity risk, as any governance infrastructure outage halts all agent activity.

Intermediate Implementation — Degraded-mode tiers are defined with progressively more conservative governance postures. A minimum of four tiers are implemented: full operation, reduced capability (cached mandates with conservative defaults for uncached parameters), deny-new-only (in-flight actions complete but no new actions are initiated), and complete block. Each tier has defined trigger conditions based on the health of specific governance components. The system automatically transitions between tiers based on real-time health monitoring. In-memory caching provides continuity during brief outages (seconds to low minutes). Degraded-mode events are logged to a local buffer when the primary audit system is unavailable, and replayed when connectivity is restored. Recovery procedures have been tested under simulated failure conditions.

Advanced Implementation — All intermediate capabilities plus: degraded-mode tiers include six or more levels with granular capability definitions. Failure injection testing (chaos engineering) is conducted regularly against governance infrastructure, including targeted attacks against specific components. The system can operate in degraded mode for extended periods (hours) with cached mandates and local audit buffering. Cross-region or cross-zone redundancy ensures that single-region failures do not trigger degraded mode. Independent audit has verified that no failure scenario results in fail-open behaviour. The governance system can demonstrate to regulators that it maintains effective controls even under the most severe infrastructure failure scenarios it has tested.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Testing AG-008 compliance requires deliberate injection of failures into governance infrastructure while monitoring agent behaviour. This is chaos engineering applied to governance.

Test 8.1: Component Isolation — Fail-Closed Verification

Test 8.2: Cascading Failure — Multi-Component Degradation

Test 8.3: Timeout and Fallback — Latency-Induced Fail-Open

Test 8.4: Recovery and Replay

Test 8.5: Adversarial Failure — Targeted Infrastructure Attack

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
SOC 2 Type IIAvailability CriteriaDirect requirement
EU AI ActArticle 9 (Risk Management System)Direct requirement
FCA SYSC13 (Operational Resilience)Direct requirement
DORAArticle 9 (ICT Risk Management Framework)Supports compliance
HIPAA45 CFR 164.308(a)(7) (Contingency Plan)Supports compliance
IEC 62443Security Levels (Critical Infrastructure)Supports compliance

SOC 2 Type II — Availability Criteria

SOC 2 Type II examines whether the system is available for operation and use as committed or agreed. For AI agent governance, this includes the availability of governance infrastructure itself. An auditor assessing SOC 2 availability will examine whether the governance system has defined behaviour under failure conditions, whether that behaviour has been tested, and whether the system can demonstrate that availability incidents did not result in governance gaps. The availability criterion also requires that the system has mechanisms to detect and respond to availability incidents — which maps to AG-008's requirement for health monitoring and automated tier transitions.

SOC 2 Type II is an examination over a period (typically 6-12 months), not a point-in-time assessment. This means the auditor will examine all availability incidents during the period and assess whether the degraded-mode behaviour functioned as designed during each incident.

EU AI Act — Article 9 (Risk Management System)

Article 9 requires that the risk management system for high-risk AI systems include measures to address risks identified during risk assessment. Infrastructure failure is a foreseeable risk for any AI system, and the failure of governance infrastructure specifically creates a risk of uncontrolled AI operation. The EU AI Act requires that risk mitigation measures be implemented "as far as technically feasible." Fail-safe governance behaviour is technically feasible, so its absence would not meet the Article 9 standard.

Article 9(4)(d) specifically requires that risk management measures are such that any residual risk is judged acceptable. An AI agent governance system that fails open under infrastructure stress has a residual risk — unrestricted agent operation during outages — that would likely not be judged acceptable for high-risk applications.

FCA SYSC 13 — Operational Resilience

SYSC 13 requires firms to establish systems and controls to manage operational risk, including the risk of system failures. The FCA's operational resilience framework (PS21/3) requires firms to identify important business services, set impact tolerances, and ensure they can remain within those tolerances during severe but plausible disruption scenarios. AI agent governance is part of the control infrastructure for any important business service that uses AI agents. If governance fails during a disruption, the firm cannot demonstrate that the business service remained within its impact tolerance.

The FCA expects firms to conduct scenario testing that includes the failure of control systems, not just the failure of the business service itself. This directly maps to AG-008's requirement for failure injection testing.

DORA — Article 9 (ICT Risk Management Framework)

Article 9 of DORA requires financial entities to establish and maintain an ICT risk management framework that ensures resilience of ICT systems, including the control systems that govern AI agent operations. Governance infrastructure continuity under failure is a core element of ICT resilience for AI-driven financial operations.

HIPAA — 45 CFR 164.308(a)(7) (Contingency Plan)

The HIPAA Security Rule requires covered entities to establish and implement policies and procedures for responding to an emergency or other occurrence that damages systems containing electronic protected health information. For healthcare organisations deploying AI agents, this includes the governance systems that control agent access to patient data. AG-008's degraded-mode tiers map directly to the contingency plan requirement.

IEC 62443 — Security Levels (Critical Infrastructure)

For critical infrastructure deployments, IEC 62443 security levels should inform the enforcement architecture for degraded-mode behaviour. A governance failure that shuts down a safety-monitoring agent could itself create a hazard. Degraded-mode design must coordinate with safety instrumented systems and ensure governance failure does not interfere with safety-critical operations.

10. Failure Severity

FieldValue
Severity RatingCritical
Blast RadiusOrganisation-wide — a governance infrastructure failure affects all agents simultaneously, potentially across all business functions

Consequence chain: Without governance continuity controls, a targeted infrastructure attack — or even routine system failure — can disable governance entirely, creating an unprotected window for unrestricted agent operation. The severity of an AG-008 failure is directly proportional to the duration of the governance gap and the capability of the agents operating during that gap. An agent that can execute financial transactions without governance for even a few minutes can create exposure equivalent to months of human-speed activity. An agent with data access operating without audit for hours can exfiltrate datasets that would take a human insider weeks to collect.

AG-008 failures are particularly dangerous because they are self-concealing. If the audit system fails alongside the enforcement system, there may be no record that governance was absent. The failure mode is not just the absence of governance during the outage — it is the inability to demonstrate what happened during the outage, compounding the regulatory and legal exposure. Defence in depth requires that AG-008 be implemented on independent infrastructure so that a single root cause cannot disable both governance controls and the continuity mechanisms. A regulator who discovers that a routine infrastructure outage disabled governance entirely will infer that the organisation's governance framework is architecturally unsound — not merely that it experienced a technical failure.

Cross-references: AG-008 intersects with AG-001 (Operational Boundary Enforcement) for defining the boundaries that must remain enforced under failure; AG-006 (Tamper-Evident Record Integrity) for ensuring audit records continue to be captured during degraded operation; AG-007 (Governance Configuration Control) for governing which configuration version is used when the live source is unavailable; AG-012 (Agent Identity Assurance) for addressing how identity verification operates when identity infrastructure is degraded; and AG-019 (Human Escalation & Override Triggers) for escalation mechanisms during degraded-mode entry.

Cite this protocol
AgentGoverning. (2026). AG-008: Governance Continuity Under Failure. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-008