AG-381

Retry Budget by Error Class Governance

Runtime Execution, Workflow & State ~21 min read AGS v2.1 · April 2026
EU AI Act SOX FCA NIST ISO 42001

2. Summary

Retry Budget by Error Class Governance requires that every AI agent system classifies execution failures into distinct error classes — transient infrastructure errors, semantic reasoning errors, and policy or governance violations — and enforces separate, finite retry budgets for each class at the infrastructure layer. Undifferentiated retry logic treats a temporary network timeout identically to a permanent policy denial, causing agents to waste resources hammering immovable barriers, amplifying downstream load, and accumulating cost or risk with each attempt. This dimension ensures that retry behaviour is structurally bounded per error class, that policy failures are never retried without human review, and that the total retry envelope across all classes is capped to prevent resource exhaustion and cost overruns.

3. Example

Scenario A — Undifferentiated Retry Floods Payment Gateway: A financial-value agent processes settlement payments through an external payment gateway. The gateway returns HTTP 503 (Service Unavailable) due to scheduled maintenance. The agent's retry logic treats all failures identically with an exponential backoff capped at 10 retries. The agent has 2,400 pending settlements queued, each of which independently initiates its own retry chain. Within 8 minutes, the agent generates 24,000 retry attempts against the gateway. When the gateway recovers, it processes a burst of duplicate settlement instructions because 340 of the original requests had actually been received and queued on the gateway side before the 503 responses were issued. The organisation discovers £2.3 million in duplicate settlements the following business day during reconciliation.

What went wrong: The retry logic did not distinguish between a clean rejection (request never received) and an ambiguous failure (request may have been received but acknowledgement failed). All 2,400 workflows retried independently with no aggregate retry cap across the agent's entire pending queue. The per-workflow retry budget of 10 was individually reasonable but collectively catastrophic — 2,400 workflows times 10 retries generated a 24,000-request storm. No idempotency check existed to detect that some original requests had already been accepted. Consequence: £2.3 million in duplicate settlements requiring manual reversal, payment gateway provider imposed penalty fees of £47,000 for excessive API calls during recovery window, FCA supervisory inquiry into systems and controls adequacy.

Scenario B — Policy Denial Retried Into Compliance Violation: An enterprise workflow agent processes employee expense reimbursements. An employee submits a £12,000 expense claim for a client dinner that exceeds the £5,000 policy limit. The governance layer correctly rejects the claim with error code POLICY_LIMIT_EXCEEDED. The agent's retry logic, designed for transient infrastructure failures, interprets the rejection as a temporary error and retries the submission 5 times over 15 minutes. On the third retry, a concurrent policy configuration deployment temporarily widens the limit to £15,000 as part of a staged rollout for a different business unit. The agent's retry succeeds, and the £12,000 claim is approved and paid. The configuration is corrected 4 minutes later, but the payment has already been initiated.

What went wrong: The agent retried a policy denial as though it were a transient failure. Policy denials are by definition not transient — they reflect a governance decision, not an infrastructure condition. Retrying a policy denial is functionally equivalent to repeatedly attempting to bypass a governance control. The timing coincidence with the configuration deployment created a window where the retry succeeded against the wrong policy version. Consequence: £12,000 in policy-violating reimbursement paid, SOX audit finding for inadequate segregation between retry logic and policy enforcement, remediation cost of £180,000 for implementing error-class-aware retry across all agent workflows.

Scenario C — Semantic Error Retry Loop Exhausts Token Budget: A customer-facing agent assists users with insurance claim assessments. The agent calls an underwriting model that returns a semantic error: "Unable to assess risk — property type 'floating home' not in risk model taxonomy." The agent's undifferentiated retry logic retries the identical request 8 times, consuming 45,000 tokens per attempt for context assembly and response parsing. Total token consumption for the failed operation reaches 360,000 tokens at a cost of $14.40 per claim. The agent processes 1,200 similar claims for floating homes over a weekend (a regional flooding event drives the volume), each burning 360,000 tokens on futile retries. The weekend's wasted token spend reaches $17,280, and no claims are actually processed — every one ends in the same semantic failure after exhausting retries.

What went wrong: The semantic error — an input outside the model's domain — was structurally irrecoverable through retry. Submitting the identical request repeatedly could never produce a different result because the failure was in the request's content, not in infrastructure availability. The retry logic should have classified this as a semantic error after the first failure, ceased retrying, and routed the claim to a human underwriter. Consequence: $17,280 in wasted compute spend, 1,200 customer claims unprocessed for 72+ hours during a flood event, regulatory complaint from the insurance ombudsman for unreasonable processing delays, reputational damage from customer social media posts about abandoned claims.

4. Requirement Statement

Scope: This dimension applies to all AI agent systems that implement retry logic for failed operations, including but not limited to: external API calls, internal service invocations, database operations, model inference calls, tool executions, and inter-agent communications. Any agent that can re-attempt a failed operation — whether through explicit retry logic, workflow re-execution, or queue-based redelivery — is within scope. The scope extends to implicit retries: an agent that re-submits a request by constructing a new workflow step that performs the same operation as a failed prior step is retrying, even if the retry logic is not explicit. The scope also covers retry amplification across agent populations: when multiple agents share a downstream dependency, the aggregate retry load across all agents must be governed, not only per-agent retry budgets.

4.1. A conforming system MUST classify every execution failure into one of at least three error classes — transient (infrastructure-level, likely to resolve without intervention), semantic (request-level, structurally irrecoverable without modification), and policy (governance-level, denied by policy or mandate) — before any retry decision is made.

4.2. A conforming system MUST enforce a finite, configurable retry budget for each error class independently, where the budget for policy errors is zero unless explicitly overridden by an authorised human reviewer.

4.3. A conforming system MUST route policy-class errors to human review or a dead-letter queue (per AG-382) rather than retrying them, ensuring that governance denials are never circumvented through repeated submission.

4.4. A conforming system MUST enforce an aggregate retry ceiling across all error classes and all concurrent workflows for a given agent, preventing collective retry amplification from exceeding infrastructure or cost thresholds.

4.5. A conforming system MUST halt retries for semantic errors after the first failure when the request payload is unchanged, unless the retry includes a modified payload that addresses the semantic failure reason.

4.6. A conforming system MUST log every retry attempt with the error class, attempt number, original failure reason, and time elapsed since the initial failure, in a tamper-evident record consistent with AG-006.

4.7. A conforming system SHOULD implement backoff strategies for transient-class retries that include jitter to prevent synchronised retry storms across concurrent agent instances.

4.8. A conforming system SHOULD expose retry budget consumption metrics in real time to monitoring infrastructure, enabling operators to detect retry budget exhaustion before it causes workflow starvation.

4.9. A conforming system SHOULD implement circuit-breaker patterns that suspend retries for a dependency when transient failure rates exceed a configurable threshold, preventing continued load on a degraded service.

4.10. A conforming system MAY implement adaptive retry budgets that reduce permitted retries when aggregate system load exceeds defined thresholds, shedding retry load to preserve capacity for primary execution.

5. Rationale

Retry logic is among the most consequential and least governed aspects of autonomous agent systems. When an agent's action fails, the decision of whether and how to retry determines whether the agent wastes resources on irrecoverable failures, amplifies load on degraded infrastructure, or circumvents governance controls through persistence. In traditional software systems, retry logic is a reliability engineering concern. In autonomous agent systems, it becomes a governance concern because the agent's retry decisions have direct financial, operational, and compliance consequences.

The critical insight motivating AG-381 is that not all failures are created equal. A transient network timeout will likely resolve if retried after a brief delay. A semantic error — such as an input outside a model's domain — will never resolve through retry of an identical request. A policy denial — such as a mandate limit violation per AG-001 — should never be retried by the agent because doing so constitutes an attempt to circumvent a governance control. Treating these three classes identically, as most naive retry implementations do, creates three distinct failure modes: resource waste (retrying semantic errors), infrastructure amplification (retrying transient errors without backoff or aggregate caps), and governance circumvention (retrying policy denials).

The governance risk of undifferentiated retry is particularly acute in agent systems because agents operate at machine speed and may manage thousands of concurrent workflows. A single agent with 2,000 pending workflows, each allowed 10 retries, can generate 20,000 retry attempts in minutes — a volume that can overwhelm downstream services, exhaust API rate limits, and accumulate significant cost. When multiple agents share a downstream dependency, the amplification effect compounds multiplicatively.

Regulatory frameworks increasingly recognise operational resilience as a governance requirement. DORA explicitly requires financial entities to manage ICT-related incidents, including cascading failures caused by inappropriate retry behaviour. The EU AI Act's risk management requirements under Article 9 extend to the operational behaviour of AI systems, including how they respond to failures. SOX Section 404 internal control requirements demand that automated financial processes cannot circumvent controls through repeated attempts. AG-381 translates these regulatory expectations into concrete, testable retry governance requirements.

The relationship between retry governance and cost governance (AG-375) is direct: every retry consumes resources — compute cycles, API calls, tokens, bandwidth, and time. An agent that retries a $0.05 API call 10 times across 2,000 workflows has consumed $1,000 in retry costs alone. When the retried operation involves large language model inference at $0.04 per 1,000 tokens, the cost multiplication is severe. Retry budgets are therefore both a reliability mechanism and a cost control mechanism.

6. Implementation Guidance

AG-381 establishes error classification and differentiated retry budgets as the central mechanism for governing agent retry behaviour. The error classification must occur at the infrastructure layer, not within the agent's reasoning process, to prevent the agent from misclassifying errors to obtain more retries. The retry budget enforcement must be structural — a counter maintained outside the agent's control — not advisory guidance that the agent's logic may override.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Payment processing retries carry direct financial risk — duplicate payment execution, counterparty exposure accumulation, and settlement system overload. Retry budgets for financial operations should be conservative (typically 1-3 retries for transient errors, zero for policy errors) and must include idempotency enforcement. The FCA expects firms to demonstrate that automated retry logic cannot circumvent transaction controls or create undetected duplicate exposure. PSD2 strong customer authentication requirements mean that retrying an authentication-dependent operation after timeout may violate the authentication session's validity window.

Healthcare and Life Sciences. Clinical decision support retries carry patient safety implications. Retrying a drug interaction check that failed due to a semantic error (e.g., unrecognised drug identifier) must not result in the interaction check being bypassed. FDA 21 CFR Part 11 requirements for electronic records mean that every retry attempt must be logged with the same rigour as the original attempt.

Critical Infrastructure and Robotics. Physical actuator commands must never be retried without confirming the state of the physical system. Retrying a "move arm to position X" command when the original command may have partially executed can cause dangerous double-movement. IEC 61508 functional safety requirements impose strict constraints on retry behaviour for safety-instrumented functions.

Crypto and Web3. Blockchain transaction retries carry unique risks because nonce management determines transaction ordering. Retrying a transaction with the same nonce replaces the original; retrying with a new nonce creates a parallel transaction. Gas cost accumulates on every retry attempt regardless of success. Smart contract interactions that modify on-chain state must be idempotent or non-retriable.

Maturity Model

Basic Implementation — The organisation has defined at least three error classes (transient, semantic, policy) and implemented per-class retry limits in the agent application layer. Policy errors are configured with zero retries and route to a review queue. Retry counts are logged. Aggregate retry caps are not yet implemented — each workflow manages its own budget independently. Error classification is based on HTTP status codes and a static mapping table. This level prevents the worst failure modes (retrying policy denials, unbounded retries on semantic errors) but does not address collective amplification or adaptive behaviour.

Intermediate Implementation — Error classification is implemented at the infrastructure layer, external to the agent runtime. Retry budgets are maintained as atomic counters in a data store the agent cannot modify. Aggregate retry ceilings are enforced across all concurrent workflows per agent. Circuit-breaker patterns suspend retries when a dependency's transient failure rate exceeds thresholds. Retry budget consumption is exposed to monitoring infrastructure in real time. Idempotency keys are attached to all retriable operations. Dead-letter routing per AG-382 is integrated for all budget-exhausted workflows.

Advanced Implementation — All intermediate capabilities plus: adaptive retry budgets that respond to system load and dependency health signals. Error classification uses machine learning models trained on historical error patterns to detect novel error classes and reclassify ambiguous failures. Retry behaviour is subject to independent adversarial testing, including scenarios where an agent attempts to circumvent retry budgets through request reformulation, error misclassification, or workflow duplication. Cross-agent retry coordination prevents aggregate retry storms across the entire agent population sharing a dependency. Retry cost attribution is integrated with AG-375 billing governance, and retry budgets automatically tighten when spend approaches configured caps.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Testing AG-381 compliance requires validating that error classification is accurate, retry budgets are enforced per class, and collective retry amplification is controlled. A comprehensive test programme should include the following tests.

Test 8.1: Error Classification Accuracy

Test 8.2: Policy Error Zero-Retry Enforcement

Test 8.3: Semantic Error Identical-Payload Halt

Test 8.4: Per-Class Budget Exhaustion

Test 8.5: Aggregate Retry Ceiling Enforcement

Test 8.6: Retry Logging Completeness

Test 8.7: Circuit-Breaker Activation Under Sustained Transient Failure

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Supports compliance
EU AI ActArticle 15 (Accuracy, Robustness, Cybersecurity)Direct requirement
SOXSection 404 (Internal Controls Over Financial Reporting)Direct requirement
FCA SYSC6.1.1R (Systems and Controls)Direct requirement
NIST AI RMFMANAGE 2.2, MANAGE 2.4Supports compliance
ISO 42001Clause 6.1 (Actions to Address Risks), Clause 8.4 (AI System Operation)Supports compliance
DORAArticle 9 (ICT Risk Management Framework), Article 10 (ICT-related Incident Management)Direct requirement

EU AI Act — Article 9 (Risk Management System)

Article 9 requires providers to establish risk management systems that identify and mitigate risks throughout the AI system lifecycle. Uncontrolled retry behaviour represents an operational risk that can cascade into financial, safety, and compliance risks. Retry budget governance implements a risk mitigation measure for the specific risk of failure-mode amplification — where an AI system's response to failure creates greater harm than the original failure itself. The requirement that risk management measures be "tested with a view to identifying the most appropriate risk management measures" maps directly to the adversarial testing requirement at Score 3, which verifies that retry budget controls cannot be circumvented.

EU AI Act — Article 15 (Accuracy, Robustness, Cybersecurity)

Article 15 requires high-risk AI systems to achieve appropriate levels of robustness against errors and faults. Retry governance directly implements robustness requirements by ensuring that the system's response to errors is controlled, bounded, and proportionate to the error type. An AI system that retries policy denials or amplifies load on degraded infrastructure fails the robustness requirement. The cybersecurity provisions of Article 15 are also relevant: uncontrolled retry behaviour can be exploited as an amplification vector in denial-of-service attacks against downstream systems.

SOX — Section 404 (Internal Controls Over Financial Reporting)

For AI agents executing financial operations, retry logic that can circumvent policy controls represents a material weakness in internal controls. If an agent can retry a denied financial transaction until a timing coincidence permits its execution, the control is inadequate. SOX auditors must be able to verify that policy denials are terminal — that the agent cannot overcome a governance rejection through persistence. The retry log serves as evidence that denied transactions were not retried and that approved retries (for transient errors) operated within defined budgets. A finding that policy denials were retried without human authorisation would be reportable as a control deficiency.

FCA SYSC — 6.1.1R (Systems and Controls)

SYSC 6.1.1R requires firms to maintain adequate systems and controls. For firms deploying AI agents in financial operations, this includes controls over retry behaviour that could create duplicate transactions, circumvent approval limits, or amplify operational incidents. The FCA's operational resilience framework (PS21/3) specifically addresses the management of ICT-related disruptions, including the risk that automated recovery mechanisms — such as retries — can amplify disruption rather than resolve it. Retry budget governance demonstrates that the firm's agent systems respond to failure in a controlled manner consistent with the firm's operational resilience strategy.

NIST AI RMF — MANAGE 2.2, MANAGE 2.4

MANAGE 2.2 addresses mechanisms for tracking and responding to known AI risks. MANAGE 2.4 addresses risk treatment through organisational controls. Retry budget governance implements risk treatment for the operational risk of uncontrolled failure recovery. The differentiated retry budget approach — different controls for different risk types — aligns with NIST's emphasis on risk-proportionate controls rather than one-size-fits-all measures.

ISO 42001 — Clause 6.1, Clause 8.4

Clause 6.1 requires organisations to determine actions to address risks within the AI management system. Clause 8.4 addresses the operation of AI systems, including operational controls that ensure systems behave as intended under failure conditions. Retry budget governance satisfies the operational control requirement by ensuring that failure recovery behaviour is defined, bounded, and governed rather than left to ad hoc agent logic.

DORA — Article 9, Article 10

Article 9 requires financial entities to establish ICT risk management frameworks that address all relevant ICT-related risks. Uncontrolled retry behaviour is an ICT-related risk that can cause cascading failures, duplicate transactions, and resource exhaustion. Article 10 specifically addresses ICT-related incident management, including the requirement to classify incidents and apply proportionate response measures. AG-381's error classification requirement directly implements this: different error classes receive different retry treatment, ensuring that the response to each failure type is proportionate to the risk it represents. DORA's emphasis on preventing ICT incidents from cascading across the financial system makes aggregate retry ceiling enforcement particularly relevant — an agent population's retry storm can propagate beyond the originating firm's infrastructure.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusService-wide — extends to shared downstream dependencies and potentially to external counterparties through duplicate submissions or retry-induced load

Consequence chain: Without differentiated retry budgets, a single category of failure triggers a cascade of consequences that scales with the number of concurrent workflows and the speed of the agent runtime. The immediate technical failure is retry amplification — individually reasonable retry decisions that collectively overwhelm downstream services, exhaust API rate limits, or consume compute budgets. When the retried operation involves financial transactions, each retry carries the risk of duplicate execution, creating exposure that accumulates silently until reconciliation detects the discrepancies — typically hours or days later. When the retried operation targets a governance control (policy denial), each retry is a structural attempt to circumvent that control, and any success — through timing coincidence, configuration race condition, or enforcement gap — constitutes a governance breach. The operational consequence is threefold: cost multiplication (every retry consumes resources without producing value), infrastructure degradation (retry load compounds service instability), and governance erosion (policy denials that succeed on retry undermine the entire control framework). The business consequence includes duplicate governed exposure requiring manual reversal, regulatory findings for inadequate systems and controls, service-level agreement breaches caused by retry-induced latency, and potential personal liability for senior managers who failed to ensure that automated retry behaviour was governed with the same rigour as primary execution.

Cite this protocol
AgentGoverning. (2026). AG-381: Retry Budget by Error Class Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-381