ML Service Availability Governance mandates that every AI agent's inference pipeline is protected against denial-of-service attacks targeting the machine learning service layer — not just the network or application layer. Traditional DoS protections (rate limiting, CDN, WAF) defend the transport layer but do not address attacks that exploit the computational asymmetry unique to ML inference: a single carefully crafted input can consume orders of magnitude more compute than a normal request through adversarial inputs triggering worst-case inference paths, recursive tool-call loops, context window exhaustion, or model loading storms. AG-793 requires that availability governance operates at the ML service layer with controls calibrated to the computational cost of inference, not just request volume. Without this dimension, an attacker who cannot overwhelm the network can still deny service by exploiting the disproportionate compute cost of adversarial inference requests.
This dimension applies to all AI agent deployments where the agent's operational capability depends on access to an ML inference service, whether self-hosted, cloud-hosted, or accessed via API. Specifically:
Exclusions: Agents that perform no ML inference (pure rule-based systems) are excluded. Agents where inference is handled entirely by a third-party service with its own availability SLA may defer compute-layer protections to the provider, but must still implement request-side controls (R1-R4) at their own boundary.
Financial Services. Trading agents that lose inference access during market hours face direct financial exposure. Availability governance must include failover to degraded-but-safe operating modes with defined recovery time objectives aligned to market session windows.
Healthcare. Clinical decision support agents must maintain availability under load. A denial of ML service during patient triage creates a patient safety risk. Failover to human-only pathways must be tested and documented.
Safety-Critical. Agents controlling physical systems must fail safe when inference is denied. AG-793 intersects with AG-008 (Governance Continuity Under Failure) for these deployments.
ML inference is computationally expensive in a way that traditional web services are not. A single inference request to a large language model can consume 100-1000x the compute of a database query. This computational asymmetry creates an attack surface unique to AI systems: an adversary does not need to overwhelm the network — they need only craft requests that trigger expensive inference paths. A context window stuffed to maximum length, a prompt designed to trigger chain-of-thought reasoning loops, or a batch of requests timed to hit the GPU simultaneously can deny service to legitimate users while staying within normal network-layer rate limits.
Traditional availability protections are necessary but insufficient. CDN-level rate limiting counts requests, not compute cost. Application-layer throttling measures latency, not GPU utilisation. Neither can distinguish between a legitimate complex query and an adversarial one designed to maximise inference cost. ML-specific availability governance must operate at the inference layer, measuring and limiting compute consumption per request, per agent, and per time window.
The consequence scales with operational importance. A copilot that loses inference is an inconvenience. A financial trading agent that loses inference during a volatile session accumulates unchecked exposure. A healthcare triage agent without inference routes patients without AI assistance. A safety-critical agent must fail safe — and the failover path must be tested, not assumed.
The EU AI Act Article 15 requires robustness and cybersecurity measures for high-risk AI systems. NIST AI RMF MEASURE 2.6 requires safety evaluation including the ability to fail safely. Denial of ML service is a failure mode that must be governed explicitly.
Basic Implementation — Request-level rate limiting at the inference gateway. Per-agent compute budgets defined but enforced at the application layer. Adversarial input detection covers known patterns (context length limits). Degraded-operation mode exists but is not regularly tested. Logging covers circuit-breaker activations but may lack compute-cost metadata.
Intermediate Implementation — Per-request compute cost estimation at the inference gateway with token-level granularity. Compute budgets enforced at the infrastructure layer with rolling-window tracking. Adversarial input detection includes recursive tool-call loop detection and prompt pattern analysis. Degraded-operation mode tested quarterly. Priority scheduling protects critical workloads. Full compute-cost metadata in audit trail.
Advanced Implementation — All Intermediate capabilities validated through independent adversarial testing including compute exhaustion attacks, recursive loop injection, context manipulation, and concurrent flooding. Predictive scaling anticipates demand. Real-time dashboards show per-agent compute consumption and availability metrics. The organisation can demonstrate that no known ML-specific DoS technique denies service to governed agents.
Inference cost estimator as pre-execution gate. Estimate compute cost based on input token count, model parameters, and expected output length before forwarding to the inference engine. Reject requests exceeding per-request cost limits before they consume GPU time.
Rolling-window compute budget with token-level accounting. Track consumption per agent using token-level accounting (input + output + reasoning tokens) rather than request count. A single 128K-token request and 100 1K-token requests are different workloads; token accounting captures this.
Recursive tool-call depth limiter. Enforce maximum call depth per interaction. Legitimate workflows rarely exceed 10 tool calls; recursive loops attempt hundreds. The limiter terminates the chain and logs the event.
Degraded-mode with safe-hold behaviour. When inference is unavailable, the agent transitions to safe-hold (per AG-008) rather than acting without AI-assisted decision-making. A trading agent halts new positions; a clinical agent routes to human review; a copilot displays unavailability.
Request-count rate limiting as the only protection. A single 128K-token request consumes 100x the compute of a 1K-token request. Request-count limits miss this asymmetry.
Shared inference pool without per-agent isolation. A single runaway agent can starve every other agent on an unpartitioned GPU pool.
No degraded-operation mode. Agents that crash when inference is unavailable convert a temporary availability issue into an ungoverned action risk.
Network-layer-only load testing. Testing the HTTP endpoint without testing ML-specific attack patterns proves the web server scales, not the inference pipeline.
TC1: Per-Request Compute Cost Estimation
TC2: Compute Budget Enforcement
TC3: Adversarial Input Rejection
TC4: Degraded-Mode Transition
TC5: Recovery After Restoration
| Evidence ID | Description | Retention Period |
|---|---|---|
| AG793-E01 | Compute cost estimation logs with per-request token counts | 7 years |
| AG793-E02 | Budget enforcement event logs | 7 years |
| AG793-E03 | Adversarial input detection and rejection logs | 5 years |
| AG793-E04 | Degraded-mode transition and recovery logs | 7 years |
| AG793-E05 | Adversarial testing reports for ML-specific DoS | 5 years |
| Score | Level | Description |
|---|---|---|
| 0 | No implementation | No ML-specific availability governance. Agents depend on network-layer rate limiting only. No compute budgets, no adversarial input detection, no degraded-operation mode. |
| 1 | Basic | Request-level rate limiting at inference gateway. Basic adversarial input detection. Degraded mode defined but not tested. Compute budgets advisory only. |
| 2 | Infrastructure-layer enforcement | Per-request compute cost estimation enforced at gateway. Per-agent budgets with token-level accounting. Adversarial input detection covers ML-specific patterns. Degraded mode tested quarterly. Priority scheduling. Full audit trail. |
| 3 | Verified by independent adversarial testing | All Level 2 verified through independent testing: compute exhaustion, recursive loops, context manipulation, concurrent flooding. Organisation demonstrates no known ML DoS technique denies service. |
Scenario A — Context Window Stuffing Denies Financial Trading Agents
A trading platform operates 12 AI agents on shared LLM infrastructure. An attacker submits legitimate-looking queries with 128K-token contexts — 100x normal. The GPU cluster saturates. Trading agents queue for 23 minutes. Three agents hold positions through a 4.2% market move that inference would have flagged. Loss: USD 2.8 million.
What went wrong: No per-request compute estimation. No per-agent budgets. Adversarial requests consumed the same pool as production agents. Consequence: USD 2.8 million loss, FCA investigation, mandatory infrastructure remediation.
Scenario B — Recursive Tool-Call Loop Exhausts Shared Budget
A customer service platform uses MCP tool calls. An adversary crafts an inquiry triggering an infinite search loop — 4,700 tool calls in 90 seconds, consuming the entire shared inference budget. All 340 agents lose access. 12,000 customer interactions disrupted over 45 minutes.
What went wrong: No tool-call depth limiter. No per-agent budget isolation. Consequence: 12,000 disrupted interactions, EUR 180,000 customer impact cost.
Scenario C — No Degraded Mode in Clinical Triage
A hospital triage agent loses inference for 15 minutes. With no degraded mode, the error handler assigns all patients "medium priority." Three acute patients are delayed 40 minutes.
What went wrong: No degraded-operation mode. Error handler made clinical decisions without inference rather than routing to human clinicians. Consequence: Patient safety incident, regulatory investigation.
| Field | Value |
|---|---|
| Severity Rating | Critical |
| Blast Radius | All agents sharing affected inference infrastructure |
Consequence chain: ML service denial removes AI decision-making from every dependent agent. Without degraded modes, agents either crash or continue without inference backing — both create exposure proportionate to the agents' operational scope.
| Requirement | EU AI Act | NIST AI RMF | ISO 42001 |
|---|---|---|---|
| R1: Compute cost estimation | Art. 15 -- Robustness | MEASURE 2.6 -- Safety evaluation | Clause 8.2 -- AI risk assessment |
| R2: Per-agent compute budgets | Art. 9 -- Risk management | MANAGE 2.2 -- Sustain value | Clause 6.1 -- Risk actions |
| R3: Circuit-breaker patterns | Art. 15 -- Robustness | MANAGE 2.4 -- Deactivation | Clause 8.2 -- AI risk assessment |
| R4: Adversarial input rejection | Art. 15 -- Cybersecurity | MEASURE 2.7 -- Security resilience | Clause 8.2 -- AI risk assessment |
| R5: Degraded-operation mode | Art. 15 -- Robustness | MEASURE 2.6 -- Fail safely | Clause 6.1 -- Risk actions |
| R6: Audit trail | Art. 12 -- Record-keeping | GOVERN 1.4 -- Transparency | Clause 9.1 -- Monitoring |
Article 15 requires robustness and cybersecurity measures for high-risk AI systems. ML-specific DoS is a robustness threat requiring explicit mitigation. The ability to fail safely when inference is denied falls within Article 15's scope.
MEASURE 2.6 requires safety evaluation including demonstration that the system can fail safely. MANAGE 2.4 requires mechanisms to deactivate AI systems demonstrating inconsistent performance. Both map to AG-793's degraded-operation and circuit-breaker requirements.
| Protocol | Relationship |
|---|---|
| AG-001 | Dependency — Operational Boundary Enforcement provides the mandate framework within which compute budgets operate |
| AG-004 | Dependency — Action Rate Governance provides per-action rate limiting that AG-793 extends to compute-cost-aware limiting |
| AG-008 | Integration — Governance Continuity Under Failure defines the safe-hold behaviour that degraded-operation mode implements |
| AG-794 | Complementary — ML Model Inference API Access covers access control; AG-793 covers availability |