AG-402

Model Serving Rate Partitioning Governance

Infrastructure, Platform & Network ~21 min read AGS v2.1 · April 2026
EU AI Act SOX FCA NIST ISO 42001

2. Summary

Model Serving Rate Partitioning Governance requires that organisations operating AI agent infrastructure partition inference serving capacity into isolated rate buckets so that one workload, tenant, or agent class cannot exhaust shared capacity and starve another critical workload of inference throughput. Without explicit partitioning, a single runaway workload — whether caused by a misconfigured retry loop, a traffic spike from a customer-facing agent, or an adversarial denial-of-service pattern — can consume the entirety of available inference capacity, rendering safety-critical agents, financial-value agents, and emergency override channels inoperable. This dimension mandates that serving capacity is structurally divided with guaranteed minimum allocations, priority-based preemption rules, and continuous monitoring to ensure that critical workloads always retain access to the inference resources they require.

3. Example

Scenario A — Runaway Retry Loop Starves Safety-Critical Agent: An organisation operates six AI agents on a shared inference cluster with a total throughput capacity of 4,800 requests per second. An enterprise workflow agent responsible for document summarisation encounters a transient upstream API failure. Its retry logic, configured with no exponential backoff and no per-agent rate ceiling, begins resubmitting failed requests at an exponential rate. Within 90 seconds the workflow agent is consuming 4,650 requests per second — 96.9% of cluster capacity. A safety-critical CPS agent responsible for monitoring industrial equipment telemetry submits an anomaly-detection inference request triggered by a temperature exceedance event. The request is queued behind 12,400 pending summarisation retries and is not served for 47 seconds. The acceptable latency for the CPS agent is 800 milliseconds. During the 47-second delay, the industrial equipment exceeds its thermal operating envelope, causing a forced shutdown that destroys £340,000 of in-process materials and triggers an HSE investigation.

What went wrong: No partition existed between the workflow agent's inference quota and the safety-critical agent's inference quota. The retry storm consumed shared capacity without limit. The safety-critical agent had no guaranteed minimum allocation and no priority preemption over lower-criticality workloads. Consequence: £340,000 material loss, HSE regulatory finding, 3-week production stoppage, and remediation costs of £185,000 for infrastructure re-architecture.

Scenario B — Customer-Facing Traffic Spike Blocks Financial Trading Agent: A retail bank operates a customer-facing chatbot agent and an internal financial-value agent that executes algorithmic trading decisions on the same inference endpoint. On a Monday morning following a widely publicised interest rate announcement, customer traffic to the chatbot increases 11x from a baseline of 800 requests per second to 8,800 requests per second. The inference endpoint has a rated capacity of 9,600 requests per second. The chatbot consumes 8,800 requests per second, leaving only 800 requests per second for all other workloads. The financial-value agent requires 1,200 requests per second to maintain its trading strategy within risk parameters. Throughput-starved, the trading agent misses 3 execution windows over 14 minutes. The missed executions result in a net position drift of £2.3 million. The bank's risk desk manually unwinds the position at a realised loss of £417,000 and the FCA opens an inquiry into whether the firm's systems and controls were adequate.

What went wrong: The chatbot and the trading agent shared an undifferentiated inference endpoint. No rate partition guaranteed the trading agent a minimum allocation of 1,200 requests per second. The chatbot's traffic surge was legitimate but consumed capacity needed by a higher-criticality workload. The firm had no priority preemption mechanism to protect financial-value inference. Consequence: £417,000 trading loss, FCA inquiry, mandatory independent systems review costing £290,000.

Scenario C — Crypto Arbitrage Bot Monopolises Shared GPU Cluster: A crypto exchange operates a shared inference cluster serving three agents: a customer-facing support agent (Tier 2), a compliance surveillance agent (Tier 1), and a crypto arbitrage execution agent (Tier 1). The arbitrage agent's request volume is proportional to market volatility. During a flash crash event, volatility spikes 40x, and the arbitrage agent submits 15,000 requests per second against a cluster capacity of 6,000 requests per second. The cluster queue saturates. The compliance surveillance agent, which monitors for wash trading and market manipulation during exactly these high-volatility periods, is unable to process inference requests for 8 minutes. During those 8 minutes, three wash-trading patterns that the surveillance agent would have flagged go undetected. A post-incident review by the financial regulator reveals that the exchange's surveillance system was non-functional during the period of highest risk. The regulator imposes a £1.8 million fine for surveillance system inadequacy and requires the exchange to engage an independent auditor at a cost of £220,000.

What went wrong: The arbitrage agent and the compliance surveillance agent shared capacity with no partitioning. The compliance agent — whose function is most critical during high-volatility events — was starved precisely when it was needed most. No guaranteed minimum allocation existed for the compliance workload, and no preemption mechanism prioritised surveillance over trading execution. Consequence: £1.8 million regulatory fine, £220,000 audit costs, mandatory remediation, reputational damage.

4. Requirement Statement

Scope: This dimension applies to any organisation where two or more AI agent workloads share inference serving infrastructure — whether a shared GPU cluster, a multi-tenant inference endpoint, a load-balanced pool of model servers, or a cloud-hosted inference API with a shared quota. The scope includes all layers of the inference serving stack where contention can occur: request admission, queue management, compute scheduling, memory allocation, and network bandwidth. The dimension applies regardless of whether the agents are operated by a single team or by multiple teams within the same organisation. It also applies to organisations consuming third-party inference services where multiple internal workloads share a single API key, subscription tier, or endpoint. The critical test is: can one agent workload's demand reduce the inference throughput available to another agent workload? If yes, this dimension applies in full.

4.1. A conforming system MUST define named rate partitions for each distinct criticality class of agent workload, with each partition assigned a guaranteed minimum inference throughput allocation expressed in quantifiable units (requests per second, tokens per second, or equivalent capacity metric).

4.2. A conforming system MUST enforce partition boundaries such that no workload can consume more than its allocated maximum throughput under contention conditions, even if the workload submits requests exceeding its allocation.

4.3. A conforming system MUST guarantee that higher-criticality partitions receive their minimum allocation before lower-criticality partitions receive any allocation beyond their own minimum, implementing priority-based preemption when total demand exceeds total capacity.

4.4. A conforming system MUST reserve a dedicated partition for human override and emergency intervention channels (per AG-008) that cannot be consumed by any automated agent workload under any conditions, including total capacity saturation.

4.5. A conforming system MUST monitor per-partition utilisation, queue depth, and latency in real time, generating alerts when any partition's utilisation exceeds 80% of its allocation or when any request in a critical partition exceeds its defined latency threshold.

4.6. A conforming system MUST reject or shed load from lower-criticality partitions before any request in a higher-criticality partition experiences degradation beyond its defined service-level objective, implementing explicit load-shedding rules with documented priority ordering.

4.7. A conforming system MUST log all partition enforcement events — admission denials, preemptions, load-shedding decisions, and capacity reallocation actions — with timestamps, affected workloads, and the enforcement rule that triggered the action.

4.8. A conforming system SHOULD implement dynamic partition resizing that adjusts guaranteed minimums based on time-of-day demand patterns, active agent count, and observed utilisation, while maintaining an absolute floor below which no critical partition's allocation may fall.

4.9. A conforming system SHOULD implement burst allowances that permit workloads to temporarily exceed their guaranteed allocation when other partitions have unused capacity, without compromising the guaranteed minimums of any partition.

4.10. A conforming system SHOULD conduct quarterly capacity planning reviews that project per-partition demand growth and trigger infrastructure scaling before any partition's headroom falls below a defined threshold (recommended: 25% of allocated capacity).

4.11. A conforming system MAY implement cross-region or cross-cluster partition failover, where a partition that exhausts local capacity can overflow to a secondary cluster while maintaining its priority class and latency constraints.

5. Rationale

Shared inference infrastructure is the economic norm for AI agent deployments. Dedicated inference clusters for every individual agent would be prohibitively expensive — a single high-end GPU node serving a large language model costs between £15,000 and £45,000 per month. Organisations therefore share inference capacity across multiple agents, accepting the engineering complexity of multi-tenant serving in exchange for cost efficiency. This sharing creates a fundamental risk: contention. When total demand exceeds total capacity, some workloads must wait or be denied service. Without explicit governance of which workloads receive priority, the outcome is determined by whichever workload happens to arrive first or submit the most requests — a race condition that systematically favours high-volume workloads over high-criticality workloads.

The risk is not theoretical. Every major cloud inference outage affecting multiple customers traces to some form of capacity contention. The pattern is consistent: a noisy neighbour — a single workload with unexpectedly high or unbounded demand — consumes shared resources, degrading or denying service to co-located workloads. In agent governance terms, the noisy-neighbour problem is particularly dangerous because the workloads most likely to generate traffic spikes (customer-facing agents during demand surges, trading agents during market volatility, retry storms from transient failures) are often not the workloads with the highest governance criticality (safety-critical agents, compliance surveillance agents, human override channels).

Regulatory frameworks reinforce this requirement from multiple angles. The EU AI Act Article 15 requires that high-risk AI systems maintain accuracy and robustness under foreseeable operating conditions — a condition that cannot be met if the system's inference infrastructure is vulnerable to starvation by co-located workloads. The FCA's SYSC rules require firms to maintain effective systems and controls; an inference endpoint that serves a trading agent reliably under light load but fails under contention is not an effective system. DORA (Digital Operational Resilience Act) Article 11 requires ICT capacity management that prevents service degradation from resource contention — a direct mandate for rate partitioning. SOX Section 404 requires that internal controls over financial reporting remain effective; if an AI agent generating financial reports is starved of inference capacity during period-end processing, the control has failed.

The governance requirement is structural, not behavioural. It is not sufficient to rely on well-behaved workloads or manual intervention during incidents. The organisation must implement infrastructure-level enforcement that guarantees critical workloads receive their required inference throughput regardless of what other workloads do. This is the operational equivalent of quality-of-service partitioning in network infrastructure — a mature engineering discipline now being applied to AI inference serving.

6. Implementation Guidance

Model Serving Rate Partitioning Governance requires infrastructure-level mechanisms that structurally guarantee inference throughput for critical workloads. The core principle is that criticality determines allocation priority, not arrival order or request volume.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Trading agents and compliance surveillance agents must be in the highest non-override partition tier. Firms should model worst-case concurrent demand during market stress events (when both trading volume and surveillance demand spike simultaneously) and size partitions to accommodate that concurrency. The FCA expects firms to demonstrate that their systems can handle peak load without degradation of critical functions.

Healthcare and Safety-Critical. Patient monitoring agents, clinical decision support agents, and CPS control agents must have guaranteed partitions with latency ceilings (e.g., 200 milliseconds for clinical alerts). These partitions should be sized for peak concurrent patient load, not average load. Failover to a secondary inference cluster should be automatic if the primary cluster's safety-critical partition exceeds its latency ceiling.

Crypto and Web3. Market surveillance and compliance agents must maintain guaranteed capacity independent of trading bot activity. Flash crash events simultaneously increase trading bot demand and compliance surveillance demand — the partition architecture must accommodate both peaks concurrently.

Cross-Border Deployments. Organisations serving agents across jurisdictions must consider per-region partition requirements. A European compliance agent subject to EU AI Act obligations requires its partition guarantee to be met within European infrastructure, independent of demand from agents serving other regions.

Maturity Model

Basic Implementation — The organisation has defined named rate partitions for at least two criticality tiers (critical and non-critical). Guaranteed minimum allocations are documented and enforced at the inference gateway. A dedicated partition for human override channels exists and is tested monthly. Per-partition utilisation is monitored and alerts fire when utilisation exceeds 80%. Load shedding rejects lower-priority requests before higher-priority requests are degraded. All enforcement events are logged.

Intermediate Implementation — All basic capabilities plus: three or more partition tiers reflecting the full criticality spectrum. Burst sharing allows underutilised partition capacity to serve other tiers without compromising guarantees. Priority-aware queuing at the backend scheduler ensures GPU compute respects partition priorities. Circuit-breaker integration with AG-004 throttles workloads at the agent level before partition ceilings are reached. Quarterly capacity planning reviews project demand growth per partition.

Advanced Implementation — All intermediate capabilities plus: dynamic partition resizing adjusts allocations based on real-time demand patterns and time-of-day profiles. Cross-cluster partition failover ensures that a critical partition can overflow to secondary infrastructure. Partition enforcement is independently tested under adversarial conditions (simulated noisy-neighbour attacks) at least annually. Per-partition latency percentiles (p50, p95, p99) are tracked continuously. The organisation can demonstrate through load testing that all critical partitions maintain their service-level objectives under simultaneous peak demand from every partition.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: Guaranteed Minimum Allocation Under Contention

Test 8.2: Human Override Partition Isolation

Test 8.3: Priority Preemption Under Escalating Load

Test 8.4: Real-Time Monitoring and Alerting

Test 8.5: Enforcement Event Logging Completeness

Test 8.6: Noisy-Neighbour Resilience

Test 8.7: Load-Shedding Priority Ordering

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Supports compliance
EU AI ActArticle 15 (Accuracy, Robustness and Cybersecurity)Direct requirement
SOXSection 404 (Internal Controls Over Financial Reporting)Supports compliance
FCA SYSC6.1.1R (Systems and Controls)Direct requirement
NIST AI RMFMANAGE 2.2 (Mechanisms for Tracking Risks), MANAGE 4.1 (Risk Treatments Monitored)Supports compliance
ISO 42001Clause 6.1 (Actions to Address Risks and Opportunities)Supports compliance
DORAArticle 11 (ICT Capacity and Performance Management)Direct requirement

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires that high-risk AI systems are resilient against errors, faults, and inconsistencies that may occur within the system or the environment in which it operates. An inference serving architecture where a safety-critical agent can be starved of compute capacity by a co-located low-criticality workload is not resilient — it contains a foreseeable failure mode that degrades the high-risk system's accuracy and availability under normal operational conditions. Rate partitioning is a direct implementation of the resilience requirement: it ensures that the high-risk system's inference throughput is structurally guaranteed regardless of the behaviour of co-located workloads. Organisations must demonstrate that their high-risk AI systems maintain operational capability under capacity contention scenarios.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA requires firms to establish and maintain systems and controls appropriate to their business. For firms deploying AI agents in trading, surveillance, or customer-facing functions, this includes ensuring that critical AI systems are not vulnerable to capacity starvation from co-located workloads. A trading agent that operates reliably under normal conditions but becomes inoperable during market stress — precisely when it is needed most — does not satisfy the FCA's requirement for adequate systems and controls. Rate partitioning ensures that financial-value agents maintain guaranteed throughput during peak demand.

SOX — Section 404 (Internal Controls Over Financial Reporting)

AI agents involved in financial reporting, reconciliation, or transaction processing are internal controls. If these agents can be starved of inference capacity during period-end processing (when demand from other agents also peaks), the control has failed. SOX auditors will assess whether the control operates effectively under all foreseeable conditions, including capacity contention. Guaranteed rate partitions for financial reporting agents provide the structural assurance required.

DORA — Article 11 (ICT Capacity and Performance Management)

DORA Article 11 explicitly requires financial entities to ensure adequate ICT capacity to meet current and anticipated demand, and to manage performance to prevent service degradation. Rate partitioning is the AI inference equivalent of network quality-of-service controls that DORA contemplates. The requirement to monitor capacity utilisation, project demand growth, and scale proactively aligns directly with DORA's capacity management obligations.

10. Failure Severity

FieldValue
Severity RatingCritical
Blast RadiusInfrastructure-wide — affects all agents sharing the starved inference endpoint, with disproportionate impact on the highest-criticality workloads that are denied service

Consequence chain: A workload exceeds its expected inference demand — whether through a retry storm (Scenario A), a legitimate traffic spike (Scenario B), or a volatility-driven demand surge (Scenario C). Without rate partitioning, the excess demand consumes shared inference capacity. The immediate technical failure is throughput starvation: critical workloads cannot obtain inference results within their required latency. The operational impact cascades by criticality: safety-critical agents cannot process real-time telemetry or anomaly detection (Scenario A: £340,000 material loss); financial-value agents miss execution windows or generate stale decisions (Scenario B: £417,000 trading loss); compliance surveillance agents become non-functional during high-risk periods (Scenario C: £1.8 million regulatory fine). The business consequence includes direct financial loss from missed actions, regulatory enforcement for inadequate systems and controls, mandatory independent reviews, forced remediation, and reputational damage. The failure is structurally correlated with worst-case conditions: traffic spikes, market stress, and equipment anomalies all simultaneously increase demand and increase the criticality of timely inference — meaning that the infrastructure fails precisely when reliability matters most. Without partitioning, the organisation has no structural guarantee that any workload will be served, reducing the reliability of every co-located agent to the behaviour of the least-well-governed workload on the cluster.

Cross-references: AG-004 (Action Rate Governance), AG-008 (Human Override Governance), AG-375 (Network Segmentation Governance), AG-383 (Runtime Scheduler Fairness Governance), AG-399 (Inference Pipeline Capacity Governance), AG-401 (GPU Tenancy Isolation Governance), AG-403 (Inference Endpoint High-Availability Governance).

Cite this protocol
AgentGoverning. (2026). AG-402: Model Serving Rate Partitioning Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-402