AG-166: Distributed Workflow Atomicity and Compensating-Action Governance

2. Summary

Distributed Workflow Atomicity and Compensating-Action Governance requires that every multi-step, multi-service workflow executed by an AI agent either completes in its entirety or is fully compensated — leaving no partial, orphaned, or inconsistent state across participating systems. When an AI agent orchestrates a workflow spanning multiple services (e.g., reserving inventory, charging a payment, sending a confirmation, updating a ledger), any failure at any step must trigger a deterministic compensating-action sequence that reverses or neutralises the effects of all previously completed steps. The compensating-action plan must be defined before the workflow begins, recorded immutably, and executable without the originating agent's cooperation. This dimension addresses the fundamental challenge that distributed systems cannot rely on traditional ACID transactions across service boundaries, and that AI agents operating at machine speed can create inconsistent distributed state far faster than human operators can detect and repair it.

3. Example

Scenario A — Partial Order Fulfilment Creates Phantom Charges: An enterprise procurement agent orchestrates a four-step purchase workflow: (1) reserve 500 units of inventory in the warehouse management system, (2) create a purchase order in the ERP, (3) initiate a GBP 47,500 payment to the supplier, and (4) send a shipment confirmation to the logistics partner. Step 1 succeeds — 500 units are reserved and become unavailable to other buyers. Step 2 succeeds — the purchase order is created. Step 3 fails — the payment gateway returns a timeout after 30 seconds. The agent retries step 3, which also fails. The agent logs the failure and stops.

The result: 500 units of inventory are reserved but will never be shipped. A purchase order exists with no corresponding payment. No shipment confirmation was sent, but the logistics partner's capacity planning system was already notified in a pre-step webhook. The inventory remains locked for 72 hours (the default reservation timeout), during which other legitimate orders cannot access those units. The organisation discovers the issue when a customer complaint arrives about unavailable stock.

What went wrong: No compensating-action plan existed. When step 3 failed, the agent had no instructions to release the inventory reservation (compensate step 1) or cancel the purchase order (compensate step 2). The pre-step webhook to logistics had no corresponding cancellation mechanism. Each participating system was internally consistent, but the overall distributed state was inconsistent. Consequence: GBP 47,500 in blocked working capital (inventory reserved but unsaleable), 72-hour stock availability gap affecting downstream orders estimated at GBP 120,000 in lost sales, manual intervention required across three systems to reconcile.

Scenario B — Compensating Actions Execute Out of Order: A financial services agent processes a client onboarding workflow: (1) create client record in the CRM, (2) open a trading account with a GBP 250,000 initial allocation, (3) configure compliance screening rules, (4) activate market data subscriptions. Step 4 fails because the market data vendor's API is down. The agent triggers compensating actions but executes them in forward order: it first attempts to delete the client record (compensate step 1), then close the trading account (compensate step 2). However, the trading account has a foreign-key dependency on the client record. Deleting the client record first causes the account closure to fail with a referential integrity violation. The trading account remains open with a GBP 250,000 allocation but no client record linking it to a responsible party, no compliance screening, and no market data — an orphaned account with allocated capital and no oversight.

What went wrong: Compensating actions must execute in reverse order of the original workflow steps to respect inter-step dependencies. The compensating-action plan was not ordered, and no dependency analysis was performed. Consequence: GBP 250,000 in unattributed capital allocation, regulatory finding for an account without KYC/AML screening, manual reconciliation requiring involvement from compliance, operations, and technology teams.

Scenario C — Agent Failure During Compensation: A cross-border payment agent executes a multi-currency transfer: (1) debit GBP 100,000 from the sender's account, (2) convert GBP to EUR at the current rate (EUR 116,400), (3) credit EUR 116,400 to the recipient's account at a partner bank. Step 3 fails — the partner bank rejects the credit due to sanctions screening. The agent begins compensating: it initiates a reverse currency conversion and re-credit to the sender. Midway through the compensation, the agent process crashes. The GBP 100,000 has been debited, EUR 116,400 exists in a holding account post-conversion, the reverse conversion has not occurred, and the re-credit has not occurred. No other system knows the compensation was in progress.

What went wrong: The compensating-action sequence was managed solely within the agent's runtime state. When the agent crashed, the compensation context was lost. No external orchestrator or durable state machine tracked the compensation progress. Consequence: GBP 100,000 in limbo, customer unable to access funds, FX exposure on the EUR holding position accumulating at approximately GBP 200 per hour of rate movement, 4-hour manual resolution requiring treasury, operations, and partner bank coordination.

4. Requirement Statement

Scope: This dimension applies to all AI agents that orchestrate workflows spanning two or more distinct services, systems, data stores, or external APIs where a failure at any step can leave the overall system in an inconsistent state. A "distinct service" is any component with independent state — a separate database, a separate API, a separate ledger, or a separate organisation's system. Single-service operations that rely on the service's own transactional guarantees (e.g., a single database transaction) are excluded, provided the agent does not combine that operation with actions on other services in the same logical workflow. The scope explicitly includes cross-organisational workflows where the agent interacts with external counterparties, partner APIs, or shared infrastructure — these are the highest-risk scenarios because the organisation does not control the external service's compensation capabilities.

4.1. A conforming system MUST define a compensating-action plan for every multi-step distributed workflow before the first step executes, specifying the reverse or neutralising action for each step.

4.2. A conforming system MUST record the compensating-action plan in a durable store independent of the agent's runtime state, such that compensation can proceed even if the originating agent fails.

4.3. A conforming system MUST execute compensating actions in reverse dependency order when a workflow step fails, respecting inter-step data dependencies and referential integrity constraints.

4.4. A conforming system MUST track the execution status of each workflow step and each compensating action in a durable, queryable state machine with at least three states per step: pending, completed, and compensated.

4.5. A conforming system MUST ensure that compensating actions are idempotent — repeated execution of the same compensating action produces the same result, preventing duplication errors during retry scenarios.

4.6. A conforming system MUST escalate to human review when a compensating action fails, rather than silently abandoning the compensation or retrying indefinitely.

4.7. A conforming system MUST log every workflow step execution and every compensating-action execution with timestamps, step identifiers, and outcome codes in a tamper-evident record per AG-006.

4.8. A conforming system SHOULD implement a timeout-based watchdog that detects stalled workflows (neither completing nor compensating) and triggers compensation automatically after a configurable threshold.

4.9. A conforming system SHOULD support semantic compensation where exact reversal is impossible — for example, sending a corrective notification when an erroneous notification cannot be unsent, or issuing a credit when a charge cannot be reversed.

4.10. A conforming system MAY implement workflow checkpointing that allows partial retry from the last successful step rather than full compensation and restart.

5. Rationale

Distributed Workflow Atomicity and Compensating-Action Governance addresses a fundamental challenge that intensifies when AI agents orchestrate business processes: the impossibility of traditional ACID transactions across service boundaries combined with the speed at which agents create distributed state. Human operators orchestrating multi-step processes naturally pause, verify intermediate results, and manually intervene when something goes wrong. AI agents operating at machine speed can create inconsistent distributed state across dozens of systems in seconds, leaving a tangled web of partial completions that requires hours or days of manual reconciliation.

The core problem is that modern architectures are distributed by design. An agent placing an order may touch an inventory service, a payment gateway, an ERP system, a logistics API, and a notification service — each with independent state, independent failure modes, and independent transaction boundaries. A traditional database transaction cannot span these boundaries. The saga pattern from distributed systems engineering provides the architectural answer: each step has a defined compensating action, and the overall workflow either completes fully or is fully compensated. AG-166 mandates this pattern for AI agent workflows.

The compensating-action plan must exist before the workflow begins because the agent may not be available to design compensation after a failure. If the agent crashes mid-workflow, the system must be able to complete the compensation without the agent's participation. This is why the plan must be recorded durably and independently — it is the distributed equivalent of a transaction rollback plan.

Ordering matters critically. Compensating actions must execute in reverse dependency order because later steps often depend on earlier steps. Deleting a parent record before closing a child record violates referential integrity. Reversing a currency conversion before reversing a debit leaves funds in the wrong denomination. The compensating-action plan must encode these dependencies explicitly.

The financial stakes are significant. In financial services, a single orphaned workflow can leave millions of pounds in limbo. In healthcare, a partial workflow could result in a medication being prescribed but the contraindication check not completing. In logistics, a partial shipment workflow could result in goods being dispatched without customs documentation. AG-166 ensures that the governance framework treats distributed consistency as a first-class concern, not an afterthought.

6. Implementation Guidance

The saga pattern is the established architectural approach for distributed workflow atomicity. There are two primary variants: choreography-based sagas (where each service publishes events and other services react) and orchestration-based sagas (where a central coordinator directs the workflow). For AI agent governance, the orchestration-based approach is strongly preferred because it provides a single point of visibility into workflow state and compensating-action progress.

Recommended patterns:

Durable state machine orchestrator. Implement a workflow orchestrator that persists the state of each workflow step to a durable store (e.g., a database table or message queue with delivery guarantees). The orchestrator holds the compensating-action plan and advances the workflow step by step. If a step fails, the orchestrator walks the plan backwards, executing compensating actions. The orchestrator is independent of the agent runtime — if the agent crashes, the orchestrator detects the stall via timeout and begins compensation autonomously. Example: a Temporal.io workflow, an AWS Step Functions state machine, or a custom orchestrator backed by PostgreSQL with row-level locking.
Compensation ledger pattern. Maintain a compensation ledger that records, for each workflow instance, every completed step and its corresponding compensating action with execution status. The ledger serves as both the plan and the audit trail. A background reconciliation process periodically scans the ledger for workflows that are neither fully completed nor fully compensated and triggers resolution — either completing stalled compensations or escalating to human review.
Semantic compensation registry. For actions that cannot be exactly reversed (e.g., a sent email, a published notification, a submitted regulatory filing), maintain a registry mapping each action type to its semantic compensating action. For example: "email sent" maps to "send correction email"; "regulatory filing submitted" maps to "submit amendment filing"; "notification published" maps to "publish retraction notification". The registry ensures that compensating actions are pre-defined and available without the agent needing to reason about compensation in real time.

Anti-patterns to avoid:

Agent-memory-only compensation. Storing the compensating-action plan only in the agent's context window or runtime memory. If the agent crashes, the plan is lost and no compensation can occur. This is the single most common failure mode in agent-orchestrated distributed workflows.
Forward-order compensation. Executing compensating actions in the same order as the original workflow steps rather than reverse order. This reliably produces referential integrity failures, orphaned resources, and inconsistent state that is harder to resolve than the original failure.
Fire-and-forget compensation. Triggering compensating actions without tracking their execution status. If a compensating action itself fails, the system has no mechanism to detect or resolve the failure. The result is a partially compensated workflow — arguably worse than an uncompensated one because it is harder to diagnose.
Infinite retry without escalation. Retrying a failed compensating action indefinitely without escalation. Some failures are permanent — the external service may have permanently rejected the reversal, the reversal window may have closed, or the compensating action may itself have a precondition that is no longer met. Infinite retry delays human intervention and can consume system resources.
Implicit compensation assumptions. Assuming that a service's timeout or expiry mechanism will serve as compensation (e.g., "the reservation will expire in 72 hours"). Expiry-based cleanup is not compensation — it introduces an uncontrolled window during which the inconsistent state persists and can cause downstream failures.

Industry Considerations

Financial Services. Payment workflows are the highest-risk scenario. A failed multi-leg payment can leave funds debited but not credited, creating reconciliation breaks that regulators treat as control failures. Compensating actions for payment workflows must account for settlement finality — some payments cannot be reversed after settlement, requiring a credit-based compensation rather than a reversal. FCA expectations under SYSC 6.1.1R and DORA Article 9 both require demonstrable controls for transaction integrity across distributed systems.

Healthcare. Clinical workflows spanning electronic health records, pharmacy systems, lab ordering systems, and billing systems require compensation that accounts for patient safety. A compensating action that cancels a medication order must also trigger notification to the prescribing clinician. HIPAA requires audit trails for all actions and compensating actions touching patient data.

Critical Infrastructure. Compensating actions in safety-critical systems (e.g., power grid, water treatment) must be validated for physical safety before execution. Reversing a valve opening is not simply "close the valve" — it requires consideration of current flow rates, pressure differentials, and downstream dependencies. Compensating-action plans in CPS environments must undergo safety review per IEC 61508 or equivalent.

Maturity Model

Basic Implementation — Compensating-action plans are defined for critical workflows in documentation. The agent runtime includes a try-catch mechanism that attempts compensating actions on failure. The compensating-action plan is held in the agent's memory. Step status is logged but not tracked in a durable state machine. Failed compensations are logged and manually triaged. Coverage: at least 80% of multi-step workflows have documented compensating actions.

Intermediate Implementation — A durable state machine orchestrator manages all multi-step workflows. Compensating-action plans are persisted before the first step executes. Step and compensation status is tracked in a queryable store with at least three states per step. Compensating actions execute in reverse dependency order automatically. Failed compensations escalate to human review. A watchdog detects stalled workflows and triggers compensation after a configurable timeout. Coverage: 100% of multi-step workflows are orchestrated.

Advanced Implementation — All intermediate capabilities plus: semantic compensation is supported for non-reversible actions with a pre-defined compensation registry. Compensating actions have been verified through chaos engineering — deliberate injection of failures at every workflow step to validate that compensation completes correctly. Cross-organisational workflows include contractual SLAs for compensation endpoints. The system can demonstrate to auditors that no workflow can leave persistent inconsistent state across participating systems. Automated reconciliation runs continuously, detecting and resolving any drift between expected and actual distributed state.

7. Evidence Requirements

Required artefacts:

Compensating-action plan repository. The collection of compensating-action plans for all registered multi-step workflows, showing the reverse action for each step, dependency ordering, and timeout thresholds. Format: structured data (JSON, YAML, or database export).
Workflow state machine logs. Timestamped records of every workflow instance showing step-by-step execution status, compensating-action invocations, and final outcome (completed or fully compensated). Minimum 12 months retention.
Failed compensation escalation records. Records of every instance where a compensating action failed and was escalated to human review, including resolution actions taken and time to resolution.
Reconciliation reports. Periodic reconciliation reports demonstrating that no persistent inconsistent state exists across participating systems.

Retention requirements:

Workflow state logs and compensating-action plans: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Test 8.1: Compensating-Action Plan Existence

Stimulus: Initiate a multi-step workflow and inspect the durable store before the first step executes.
Expected behaviour: A compensating-action plan exists in the durable store specifying the reverse action for each step, ordered by reverse dependency.
Pass criteria: The plan is persisted before the first step begins. The plan includes a compensating action for every step.
Fail criteria: The first step executes before the compensating-action plan is persisted, or any step lacks a defined compensating action.

Test 8.2: Mid-Workflow Failure Triggers Compensation

Stimulus: Execute a 5-step workflow. Inject a failure at step 3 (e.g., return HTTP 500 from the target service).
Expected behaviour: Steps 1 and 2 are compensated in reverse order (step 2 first, then step 1). Step 3 is not compensated (it did not complete). Steps 4 and 5 are not attempted.
Pass criteria: After compensation, all participating systems are in a state consistent with the workflow never having started. The state machine shows steps 1 and 2 as "compensated."
Fail criteria: Any completed step remains uncompensated, compensating actions execute in forward order, or the system attempts steps 4 or 5 after step 3 fails.

Test 8.3: Agent Crash During Workflow

Stimulus: Execute a 4-step workflow. After step 2 completes, terminate the agent process.
Expected behaviour: The watchdog detects the stalled workflow after the configured timeout. Compensation begins autonomously without the agent. Steps 1 and 2 are compensated in reverse order.
Pass criteria: Compensation completes without agent participation. The state machine accurately reflects the compensated state.
Fail criteria: The workflow remains in a stalled state indefinitely, or compensation requires agent restart.

Test 8.4: Agent Crash During Compensation

Stimulus: Execute a workflow, inject a failure to trigger compensation, then terminate the agent process during compensation (after compensating step 3 but before compensating step 2).
Expected behaviour: The orchestrator resumes compensation from the last incomplete compensating action. Steps 2 and 1 are compensated. No compensating action is executed twice (idempotency).
Pass criteria: Compensation resumes and completes. No duplicated compensating actions. Final state is fully compensated.
Fail criteria: Compensation stalls permanently, compensating actions are duplicated, or the system loses track of compensation progress.

Test 8.5: Compensating-Action Idempotency

Stimulus: Execute a compensating action for the same workflow step three times in succession.
Expected behaviour: The first execution performs the compensation. The second and third executions detect that compensation is already complete and return success without side effects.
Pass criteria: Exactly one compensating effect occurs regardless of invocation count. No errors on repeated invocation.
Fail criteria: Repeated invocation causes duplicate effects (e.g., double refund) or raises errors.

Test 8.6: Failed Compensation Escalates to Human Review

Stimulus: Execute a workflow, inject a failure at step 3, then configure the step 2 compensating action to fail permanently (e.g., return HTTP 410 Gone).
Expected behaviour: The system attempts the compensating action, detects the permanent failure, and escalates to human review per AG-019. The escalation includes the workflow ID, the failed compensating action, the current state of all steps, and the remaining compensating actions that have not been attempted.
Pass criteria: Human escalation is triggered within 5 minutes of the permanent failure. The escalation contains sufficient context for manual resolution.
Fail criteria: The system retries indefinitely, silently abandons the compensation, or escalates without sufficient context.

Test 8.7: Reverse Dependency Ordering Validation

Stimulus: Define a workflow where step 2 creates a child record dependent on step 1's parent record. Inject a failure at step 3 to trigger compensation.
Expected behaviour: Step 2 is compensated first (delete child), then step 1 (delete parent). Referential integrity is maintained throughout.
Pass criteria: No referential integrity violations during compensation. Compensating actions execute in reverse dependency order.
Fail criteria: Referential integrity violation occurs, or compensating actions execute in forward order.

Conformance Scoring

Score 0: No compensating-action governance exists — multi-step workflows have no defined compensation mechanism, and partial failures leave inconsistent distributed state.
Score 1: Compensating actions are defined but managed within the agent's runtime memory — agent failure during a workflow results in uncompensated partial state.
Score 2: Compensating-action plans are persisted in a durable store independent of the agent runtime, compensating actions execute in reverse dependency order, and failed compensations escalate to human review.
Score 3: Verified through chaos engineering — deliberate failure injection at every workflow step confirms that compensation completes correctly under all failure scenarios, including agent crash, network partition, and external service unavailability.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Supports compliance
EU AI Act	Article 15 (Accuracy, Robustness and Cybersecurity)	Direct requirement
SOX	Section 404 (Internal Controls Over Financial Reporting)	Direct requirement
FCA SYSC	6.1.1R (Systems and Controls)	Direct requirement
NIST AI RMF	MANAGE 2.2, MANAGE 2.4	Supports compliance
ISO 42001	Clause 6.1 (Actions to Address Risks), Clause 8.2 (AI Risk Assessment)	Supports compliance
DORA	Article 9 (ICT Risk Management Framework)	Direct requirement

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires that high-risk AI systems achieve an appropriate level of accuracy, robustness, and cybersecurity. Distributed workflow atomicity directly supports robustness — an agent that leaves inconsistent distributed state on failure is not robust. The requirement for compensating-action plans that survive agent failure addresses the resilience expectation. The requirement for reverse-order compensation addresses the accuracy of recovery operations.

SOX — Section 404 (Internal Controls Over Financial Reporting)

For AI agents executing financial workflows, SOX Section 404 requires internal controls that prevent unauthorised or erroneous transactions from affecting financial reporting. A partial workflow that debits an account but fails to complete the corresponding credit creates a reconciliation break that directly affects financial reporting accuracy. AG-166 provides the control framework to ensure that financial workflows either complete fully or are fully compensated, maintaining ledger integrity.

FCA SYSC — 6.1.1R (Systems and Controls)

SYSC 6.1.1R requires firms to maintain adequate systems and controls. For distributed agent workflows, this includes controls ensuring transaction integrity across service boundaries. A firm that cannot demonstrate that partial workflow failures are automatically detected and compensated has inadequate systems and controls for AI-driven operations.

DORA — Article 9 (ICT Risk Management Framework)

DORA requires financial entities to maintain ICT resilience, including the ability to recover from operational disruptions. Distributed workflow compensation is a core resilience capability — it ensures that a failure in any participating system does not leave the overall business process in an unrecoverable state.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Cross-system — potentially cross-organisation where workflows involve external counterparties

Consequence chain: Without distributed workflow atomicity governance, a single step failure in a multi-step agent workflow creates inconsistent state across every participating system. The immediate technical failure is orphaned resources — reserved inventory that will never be used, debited funds that were never credited, created records with no corresponding downstream records. The operational impact scales with workflow frequency and the number of participating systems: an agent executing 200 workflows per hour with a 2% failure rate generates 4 inconsistent states per hour, each requiring manual reconciliation across multiple systems. At GBP 25,000 average workflow value, this represents GBP 100,000 per hour in potentially stranded or misallocated funds. The business consequence includes regulatory findings for inadequate transaction integrity controls, customer impact from locked resources and delayed processing, reconciliation costs estimated at 2-4 hours of specialist time per incident, and reputational damage from visible service failures. In cross-organisational workflows, the blast radius extends to counterparty systems that the organisation does not control, creating legal and contractual exposure.

Cross-references: AG-011 (Action Reversibility and Settlement Integrity) for reversibility classification of individual steps; AG-006 (Tamper-Evident Record Integrity) for immutable logging of workflow and compensation events; AG-019 (Human Escalation & Override Triggers) for escalation when compensation fails; AG-049 (Governance Decision Explainability) for explaining compensation decisions; AG-164 (Idempotency and Exactly-Once Execution Governance) for ensuring compensating actions are idempotent; AG-165 (Concurrency Control and Distributed Lock Governance) for preventing concurrent compensation conflicts.

Cite this protocol

AgentGoverning. (2026). AG-166: Distributed Workflow Atomicity and Compensating-Action Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-166

← Previous Protocol

AG-165

Concurrency and Shared-State Locking Governance

Next Protocol →

AG-167

Sensor, Telemetry and External State Integrity Governance