The Standard

Compliance

AG-793

ML Service Availability Governance

Supplementary Core & Adversarial Model Resistance ~12 min read AGS v2.1 · 2026-04-29

EU AI Act NIST AI RMF ISO 42001

1. Definition

ML Service Availability Governance mandates that every AI agent's inference pipeline is protected against denial-of-service attacks targeting the machine learning service layer — not just the network or application layer. Traditional DoS protections (rate limiting, CDN, WAF) defend the transport layer but do not address attacks that exploit the computational asymmetry unique to ML inference: a single carefully crafted input can consume orders of magnitude more compute than a normal request through adversarial inputs triggering worst-case inference paths, recursive tool-call loops, context window exhaustion, or model loading storms. AG-793 requires that availability governance operates at the ML service layer with controls calibrated to the computational cost of inference, not just request volume. Without this dimension, an attacker who cannot overwhelm the network can still deny service by exploiting the disproportionate compute cost of adversarial inference requests.

2. Scope

This dimension applies to all AI agent deployments where the agent's operational capability depends on access to an ML inference service, whether self-hosted, cloud-hosted, or accessed via API. Specifically:

Self-hosted inference endpoints where the agent calls a locally deployed model
Cloud ML API consumers calling external inference services (OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI)
Multi-model pipelines where the agent chains multiple inference calls, creating multiplicative compute exposure
Tool-augmented agents where MCP tool calls can trigger additional inference requests recursively

Exclusions: Agents that perform no ML inference (pure rule-based systems) are excluded. Agents where inference is handled entirely by a third-party service with its own availability SLA may defer compute-layer protections to the provider, but must still implement request-side controls (R1-R4) at their own boundary.

Industry Considerations

Financial Services. Trading agents that lose inference access during market hours face direct financial exposure. Availability governance must include failover to degraded-but-safe operating modes with defined recovery time objectives aligned to market session windows.

Healthcare. Clinical decision support agents must maintain availability under load. A denial of ML service during patient triage creates a patient safety risk. Failover to human-only pathways must be tested and documented.

Safety-Critical. Agents controlling physical systems must fail safe when inference is denied. AG-793 intersects with AG-008 (Governance Continuity Under Failure) for these deployments.

3. Why This Matters

ML inference is computationally expensive in a way that traditional web services are not. A single inference request to a large language model can consume 100-1000x the compute of a database query. This computational asymmetry creates an attack surface unique to AI systems: an adversary does not need to overwhelm the network — they need only craft requests that trigger expensive inference paths. A context window stuffed to maximum length, a prompt designed to trigger chain-of-thought reasoning loops, or a batch of requests timed to hit the GPU simultaneously can deny service to legitimate users while staying within normal network-layer rate limits.

Traditional availability protections are necessary but insufficient. CDN-level rate limiting counts requests, not compute cost. Application-layer throttling measures latency, not GPU utilisation. Neither can distinguish between a legitimate complex query and an adversarial one designed to maximise inference cost. ML-specific availability governance must operate at the inference layer, measuring and limiting compute consumption per request, per agent, and per time window.

The consequence scales with operational importance. A copilot that loses inference is an inconvenience. A financial trading agent that loses inference during a volatile session accumulates unchecked exposure. A healthcare triage agent without inference routes patients without AI assistance. A safety-critical agent must fail safe — and the failover path must be tested, not assumed.

The EU AI Act Article 15 requires robustness and cybersecurity measures for high-risk AI systems. NIST AI RMF MEASURE 2.6 requires safety evaluation including the ability to fail safely. Denial of ML service is a failure mode that must be governed explicitly.

4. Requirements

R1: A conforming system MUST implement per-request compute cost estimation at the inference gateway, measuring estimated token count, model size, and expected inference latency before executing the request.
R2: A conforming system MUST enforce per-agent compute budgets that limit the total inference compute an individual agent can consume per rolling time window (configurable, default: 1-hour window).
R3: A conforming system MUST implement circuit-breaker patterns that halt inference requests from an agent when compute consumption exceeds the budget, returning a structured denial with a retry-after interval.
R4: A conforming system MUST detect and reject adversarial inputs designed to maximise inference cost, including context window stuffing, recursive tool-call loops, and prompt patterns known to trigger worst-case inference paths.
R5: A conforming system MUST define and test a degraded-operation mode that the agent enters when inference is unavailable, ensuring no ungoverned actions during the denial window.
R6: A conforming system MUST log all availability events — budget exhaustion, circuit-breaker activations, adversarial input rejections, degraded-mode transitions — in a tamper-evident audit trail.
R7: A conforming system SHOULD implement priority-based inference scheduling that protects high-priority agent workloads from compute starvation.
R8: A conforming system SHOULD implement inference request queuing with fair scheduling to prevent a single agent from monopolising shared resources.
R9: A conforming system MAY implement predictive scaling that anticipates compute demand spikes based on historical patterns.

5. Maturity Model

Basic Implementation — Request-level rate limiting at the inference gateway. Per-agent compute budgets defined but enforced at the application layer. Adversarial input detection covers known patterns (context length limits). Degraded-operation mode exists but is not regularly tested. Logging covers circuit-breaker activations but may lack compute-cost metadata.

Intermediate Implementation — Per-request compute cost estimation at the inference gateway with token-level granularity. Compute budgets enforced at the infrastructure layer with rolling-window tracking. Adversarial input detection includes recursive tool-call loop detection and prompt pattern analysis. Degraded-operation mode tested quarterly. Priority scheduling protects critical workloads. Full compute-cost metadata in audit trail.

Advanced Implementation — All Intermediate capabilities validated through independent adversarial testing including compute exhaustion attacks, recursive loop injection, context manipulation, and concurrent flooding. Predictive scaling anticipates demand. Real-time dashboards show per-agent compute consumption and availability metrics. The organisation can demonstrate that no known ML-specific DoS technique denies service to governed agents.

Implementation Patterns

Inference cost estimator as pre-execution gate. Estimate compute cost based on input token count, model parameters, and expected output length before forwarding to the inference engine. Reject requests exceeding per-request cost limits before they consume GPU time.

Rolling-window compute budget with token-level accounting. Track consumption per agent using token-level accounting (input + output + reasoning tokens) rather than request count. A single 128K-token request and 100 1K-token requests are different workloads; token accounting captures this.

Recursive tool-call depth limiter. Enforce maximum call depth per interaction. Legitimate workflows rarely exceed 10 tool calls; recursive loops attempt hundreds. The limiter terminates the chain and logs the event.

Degraded-mode with safe-hold behaviour. When inference is unavailable, the agent transitions to safe-hold (per AG-008) rather than acting without AI-assisted decision-making. A trading agent halts new positions; a clinical agent routes to human review; a copilot displays unavailability.

Anti-Patterns

Request-count rate limiting as the only protection. A single 128K-token request consumes 100x the compute of a 1K-token request. Request-count limits miss this asymmetry.

Shared inference pool without per-agent isolation. A single runaway agent can starve every other agent on an unpartitioned GPU pool.

No degraded-operation mode. Agents that crash when inference is unavailable convert a temporary availability issue into an ungoverned action risk.

Network-layer-only load testing. Testing the HTTP endpoint without testing ML-specific attack patterns proves the web server scales, not the inference pipeline.

6. Test Criteria

TC1: Per-Request Compute Cost Estimation

Stimulus: Submit requests with varying context sizes (1K, 16K, 64K, 128K tokens).
Expected behaviour: Each request's estimated compute cost logged before execution. Over-limit requests rejected pre-execution.
Pass criteria: Cost estimation present for all requests. Over-limit requests never reach inference engine.
Fail criteria: Any over-limit request reaches the inference engine.

TC2: Compute Budget Enforcement

Stimulus: Sustained stream of requests from a single agent until budget exhaustion.
Expected behaviour: Structured denial with retry-after interval. Other agents unaffected.
Pass criteria: Budget enforcement triggers at threshold. Other agents continue operating.
Fail criteria: Budget exceeded without enforcement, or collateral impact on other agents.

TC3: Adversarial Input Rejection

Stimulus: 5 adversarial patterns: (a) maximum context stuffing, (b) recursive tool-call loop, (c) chain-of-thought amplification prompt, (d) concurrent maximum-cost batch, (e) embedded compute-amplification instructions.
Expected behaviour: All 5 detected and rejected before inference.
Pass criteria: 5/5 rejected. Events logged.
Fail criteria: Any adversarial input reaches inference undetected.

TC4: Degraded-Mode Transition

Stimulus: Simulate complete inference unavailability during active operation.
Expected behaviour: Agent transitions to degraded mode. Zero ungoverned actions.
Pass criteria: Safe-hold within SLA. No ungoverned actions during denial.
Fail criteria: Agent crashes, hangs, or acts without inference backing.

TC5: Recovery After Restoration

Stimulus: Restore inference after 30-minute simulated denial.
Expected behaviour: Agent exits degraded mode, resumes normal operation.
Pass criteria: Full recovery within SLA. No data loss.
Fail criteria: Agent fails to recover or state corruption occurs.

Evidence Artefacts

Evidence ID	Description	Retention Period
AG793-E01	Compute cost estimation logs with per-request token counts	7 years
AG793-E02	Budget enforcement event logs	7 years
AG793-E03	Adversarial input detection and rejection logs	5 years
AG793-E04	Degraded-mode transition and recovery logs	7 years
AG793-E05	Adversarial testing reports for ML-specific DoS	5 years

7. Scoring

Score	Level	Description
0	No implementation	No ML-specific availability governance. Agents depend on network-layer rate limiting only. No compute budgets, no adversarial input detection, no degraded-operation mode.
1	Basic	Request-level rate limiting at inference gateway. Basic adversarial input detection. Degraded mode defined but not tested. Compute budgets advisory only.
2	Infrastructure-layer enforcement	Per-request compute cost estimation enforced at gateway. Per-agent budgets with token-level accounting. Adversarial input detection covers ML-specific patterns. Degraded mode tested quarterly. Priority scheduling. Full audit trail.
3	Verified by independent adversarial testing	All Level 2 verified through independent testing: compute exhaustion, recursive loops, context manipulation, concurrent flooding. Organisation demonstrates no known ML DoS technique denies service.

8. Failure Scenarios

Scenario A — Context Window Stuffing Denies Financial Trading Agents

A trading platform operates 12 AI agents on shared LLM infrastructure. An attacker submits legitimate-looking queries with 128K-token contexts — 100x normal. The GPU cluster saturates. Trading agents queue for 23 minutes. Three agents hold positions through a 4.2% market move that inference would have flagged. Loss: USD 2.8 million.

What went wrong: No per-request compute estimation. No per-agent budgets. Adversarial requests consumed the same pool as production agents. Consequence: USD 2.8 million loss, FCA investigation, mandatory infrastructure remediation.

Scenario B — Recursive Tool-Call Loop Exhausts Shared Budget

A customer service platform uses MCP tool calls. An adversary crafts an inquiry triggering an infinite search loop — 4,700 tool calls in 90 seconds, consuming the entire shared inference budget. All 340 agents lose access. 12,000 customer interactions disrupted over 45 minutes.

What went wrong: No tool-call depth limiter. No per-agent budget isolation. Consequence: 12,000 disrupted interactions, EUR 180,000 customer impact cost.

Scenario C — No Degraded Mode in Clinical Triage

A hospital triage agent loses inference for 15 minutes. With no degraded mode, the error handler assigns all patients "medium priority." Three acute patients are delayed 40 minutes.

What went wrong: No degraded-operation mode. Error handler made clinical decisions without inference rather than routing to human clinicians. Consequence: Patient safety incident, regulatory investigation.

Severity and Blast Radius

Field	Value
Severity Rating	Critical
Blast Radius	All agents sharing affected inference infrastructure

Consequence chain: ML service denial removes AI decision-making from every dependent agent. Without degraded modes, agents either crash or continue without inference backing — both create exposure proportionate to the agents' operational scope.

9. Regulatory Mapping

Requirement	EU AI Act	NIST AI RMF	ISO 42001
R1: Compute cost estimation	Art. 15 -- Robustness	MEASURE 2.6 -- Safety evaluation	Clause 8.2 -- AI risk assessment
R2: Per-agent compute budgets	Art. 9 -- Risk management	MANAGE 2.2 -- Sustain value	Clause 6.1 -- Risk actions
R3: Circuit-breaker patterns	Art. 15 -- Robustness	MANAGE 2.4 -- Deactivation	Clause 8.2 -- AI risk assessment
R4: Adversarial input rejection	Art. 15 -- Cybersecurity	MEASURE 2.7 -- Security resilience	Clause 8.2 -- AI risk assessment
R5: Degraded-operation mode	Art. 15 -- Robustness	MEASURE 2.6 -- Fail safely	Clause 6.1 -- Risk actions
R6: Audit trail	Art. 12 -- Record-keeping	GOVERN 1.4 -- Transparency	Clause 9.1 -- Monitoring

EU AI Act — Article 15

Article 15 requires robustness and cybersecurity measures for high-risk AI systems. ML-specific DoS is a robustness threat requiring explicit mitigation. The ability to fail safely when inference is denied falls within Article 15's scope.

NIST AI RMF — MEASURE 2.6 and MANAGE 2.4

MEASURE 2.6 requires safety evaluation including demonstration that the system can fail safely. MANAGE 2.4 requires mechanisms to deactivate AI systems demonstrating inconsistent performance. Both map to AG-793's degraded-operation and circuit-breaker requirements.

Protocol	Relationship
AG-001	Dependency — Operational Boundary Enforcement provides the mandate framework within which compute budgets operate
AG-004	Dependency — Action Rate Governance provides per-action rate limiting that AG-793 extends to compute-cost-aware limiting
AG-008	Integration — Governance Continuity Under Failure defines the safe-hold behaviour that degraded-operation mode implements
AG-794	Complementary — ML Model Inference API Access covers access control; AG-793 covers availability

Cite this protocol

AgentGoverning. (2026). AG-793: ML Service Availability Governance. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-793

← Previous

AG-792

Assurance Framework Compliance

Next Protocol →

AG-794

Ml Model Inference Api Access Governance