AG-402: Model Serving Rate Partitioning Governance

2. Summary

Model Serving Rate Partitioning Governance requires that organisations operating AI agent infrastructure partition inference serving capacity into isolated rate buckets so that one workload, tenant, or agent class cannot exhaust shared capacity and starve another critical workload of inference throughput. Without explicit partitioning, a single runaway workload — whether caused by a misconfigured retry loop, a traffic spike from a customer-facing agent, or an adversarial denial-of-service pattern — can consume the entirety of available inference capacity, rendering safety-critical agents, financial-value agents, and emergency override channels inoperable. This dimension mandates that serving capacity is structurally divided with guaranteed minimum allocations, priority-based preemption rules, and continuous monitoring to ensure that critical workloads always retain access to the inference resources they require.

3. Example

Scenario A — Runaway Retry Loop Starves Safety-Critical Agent: An organisation operates six AI agents on a shared inference cluster with a total throughput capacity of 4,800 requests per second. An enterprise workflow agent responsible for document summarisation encounters a transient upstream API failure. Its retry logic, configured with no exponential backoff and no per-agent rate ceiling, begins resubmitting failed requests at an exponential rate. Within 90 seconds the workflow agent is consuming 4,650 requests per second — 96.9% of cluster capacity. A safety-critical CPS agent responsible for monitoring industrial equipment telemetry submits an anomaly-detection inference request triggered by a temperature exceedance event. The request is queued behind 12,400 pending summarisation retries and is not served for 47 seconds. The acceptable latency for the CPS agent is 800 milliseconds. During the 47-second delay, the industrial equipment exceeds its thermal operating envelope, causing a forced shutdown that destroys £340,000 of in-process materials and triggers an HSE investigation.

What went wrong: No partition existed between the workflow agent's inference quota and the safety-critical agent's inference quota. The retry storm consumed shared capacity without limit. The safety-critical agent had no guaranteed minimum allocation and no priority preemption over lower-criticality workloads. Consequence: £340,000 material loss, HSE regulatory finding, 3-week production stoppage, and remediation costs of £185,000 for infrastructure re-architecture.

Scenario B — Customer-Facing Traffic Spike Blocks Financial Trading Agent: A retail bank operates a customer-facing chatbot agent and an internal financial-value agent that executes algorithmic trading decisions on the same inference endpoint. On a Monday morning following a widely publicised interest rate announcement, customer traffic to the chatbot increases 11x from a baseline of 800 requests per second to 8,800 requests per second. The inference endpoint has a rated capacity of 9,600 requests per second. The chatbot consumes 8,800 requests per second, leaving only 800 requests per second for all other workloads. The financial-value agent requires 1,200 requests per second to maintain its trading strategy within risk parameters. Throughput-starved, the trading agent misses 3 execution windows over 14 minutes. The missed executions result in a net position drift of £2.3 million. The bank's risk desk manually unwinds the position at a realised loss of £417,000 and the FCA opens an inquiry into whether the firm's systems and controls were adequate.

What went wrong: The chatbot and the trading agent shared an undifferentiated inference endpoint. No rate partition guaranteed the trading agent a minimum allocation of 1,200 requests per second. The chatbot's traffic surge was legitimate but consumed capacity needed by a higher-criticality workload. The firm had no priority preemption mechanism to protect financial-value inference. Consequence: £417,000 trading loss, FCA inquiry, mandatory independent systems review costing £290,000.

Scenario C — Crypto Arbitrage Bot Monopolises Shared GPU Cluster: A crypto exchange operates a shared inference cluster serving three agents: a customer-facing support agent (Tier 2), a compliance surveillance agent (Tier 1), and a crypto arbitrage execution agent (Tier 1). The arbitrage agent's request volume is proportional to market volatility. During a flash crash event, volatility spikes 40x, and the arbitrage agent submits 15,000 requests per second against a cluster capacity of 6,000 requests per second. The cluster queue saturates. The compliance surveillance agent, which monitors for wash trading and market manipulation during exactly these high-volatility periods, is unable to process inference requests for 8 minutes. During those 8 minutes, three wash-trading patterns that the surveillance agent would have flagged go undetected. A post-incident review by the financial regulator reveals that the exchange's surveillance system was non-functional during the period of highest risk. The regulator imposes a £1.8 million fine for surveillance system inadequacy and requires the exchange to engage an independent auditor at a cost of £220,000.

What went wrong: The arbitrage agent and the compliance surveillance agent shared capacity with no partitioning. The compliance agent — whose function is most critical during high-volatility events — was starved precisely when it was needed most. No guaranteed minimum allocation existed for the compliance workload, and no preemption mechanism prioritised surveillance over trading execution. Consequence: £1.8 million regulatory fine, £220,000 audit costs, mandatory remediation, reputational damage.

4. Requirement Statement

Scope: This dimension applies to any organisation where two or more AI agent workloads share inference serving infrastructure — whether a shared GPU cluster, a multi-tenant inference endpoint, a load-balanced pool of model servers, or a cloud-hosted inference API with a shared quota. The scope includes all layers of the inference serving stack where contention can occur: request admission, queue management, compute scheduling, memory allocation, and network bandwidth. The dimension applies regardless of whether the agents are operated by a single team or by multiple teams within the same organisation. It also applies to organisations consuming third-party inference services where multiple internal workloads share a single API key, subscription tier, or endpoint. The critical test is: can one agent workload's demand reduce the inference throughput available to another agent workload? If yes, this dimension applies in full.

4.1. A conforming system MUST define named rate partitions for each distinct criticality class of agent workload, with each partition assigned a guaranteed minimum inference throughput allocation expressed in quantifiable units (requests per second, tokens per second, or equivalent capacity metric).

4.2. A conforming system MUST enforce partition boundaries such that no workload can consume more than its allocated maximum throughput under contention conditions, even if the workload submits requests exceeding its allocation.

4.3. A conforming system MUST guarantee that higher-criticality partitions receive their minimum allocation before lower-criticality partitions receive any allocation beyond their own minimum, implementing priority-based preemption when total demand exceeds total capacity.

4.4. A conforming system MUST reserve a dedicated partition for human override and emergency intervention channels (per AG-008) that cannot be consumed by any automated agent workload under any conditions, including total capacity saturation.

4.5. A conforming system MUST monitor per-partition utilisation, queue depth, and latency in real time, generating alerts when any partition's utilisation exceeds 80% of its allocation or when any request in a critical partition exceeds its defined latency threshold.

4.6. A conforming system MUST reject or shed load from lower-criticality partitions before any request in a higher-criticality partition experiences degradation beyond its defined service-level objective, implementing explicit load-shedding rules with documented priority ordering.

4.7. A conforming system MUST log all partition enforcement events — admission denials, preemptions, load-shedding decisions, and capacity reallocation actions — with timestamps, affected workloads, and the enforcement rule that triggered the action.

4.8. A conforming system SHOULD implement dynamic partition resizing that adjusts guaranteed minimums based on time-of-day demand patterns, active agent count, and observed utilisation, while maintaining an absolute floor below which no critical partition's allocation may fall.

4.9. A conforming system SHOULD implement burst allowances that permit workloads to temporarily exceed their guaranteed allocation when other partitions have unused capacity, without compromising the guaranteed minimums of any partition.

4.10. A conforming system SHOULD conduct quarterly capacity planning reviews that project per-partition demand growth and trigger infrastructure scaling before any partition's headroom falls below a defined threshold (recommended: 25% of allocated capacity).

4.11. A conforming system MAY implement cross-region or cross-cluster partition failover, where a partition that exhausts local capacity can overflow to a secondary cluster while maintaining its priority class and latency constraints.

5. Rationale

Shared inference infrastructure is the economic norm for AI agent deployments. Dedicated inference clusters for every individual agent would be prohibitively expensive — a single high-end GPU node serving a large language model costs between £15,000 and £45,000 per month. Organisations therefore share inference capacity across multiple agents, accepting the engineering complexity of multi-tenant serving in exchange for cost efficiency. This sharing creates a fundamental risk: contention. When total demand exceeds total capacity, some workloads must wait or be denied service. Without explicit governance of which workloads receive priority, the outcome is determined by whichever workload happens to arrive first or submit the most requests — a race condition that systematically favours high-volume workloads over high-criticality workloads.

The risk is not theoretical. Every major cloud inference outage affecting multiple customers traces to some form of capacity contention. The pattern is consistent: a noisy neighbour — a single workload with unexpectedly high or unbounded demand — consumes shared resources, degrading or denying service to co-located workloads. In agent governance terms, the noisy-neighbour problem is particularly dangerous because the workloads most likely to generate traffic spikes (customer-facing agents during demand surges, trading agents during market volatility, retry storms from transient failures) are often not the workloads with the highest governance criticality (safety-critical agents, compliance surveillance agents, human override channels).

Regulatory frameworks reinforce this requirement from multiple angles. The EU AI Act Article 15 requires that high-risk AI systems maintain accuracy and robustness under foreseeable operating conditions — a condition that cannot be met if the system's inference infrastructure is vulnerable to starvation by co-located workloads. The FCA's SYSC rules require firms to maintain effective systems and controls; an inference endpoint that serves a trading agent reliably under light load but fails under contention is not an effective system. DORA (Digital Operational Resilience Act) Article 11 requires ICT capacity management that prevents service degradation from resource contention — a direct mandate for rate partitioning. SOX Section 404 requires that internal controls over financial reporting remain effective; if an AI agent generating financial reports is starved of inference capacity during period-end processing, the control has failed.

The governance requirement is structural, not behavioural. It is not sufficient to rely on well-behaved workloads or manual intervention during incidents. The organisation must implement infrastructure-level enforcement that guarantees critical workloads receive their required inference throughput regardless of what other workloads do. This is the operational equivalent of quality-of-service partitioning in network infrastructure — a mature engineering discipline now being applied to AI inference serving.

6. Implementation Guidance

Model Serving Rate Partitioning Governance requires infrastructure-level mechanisms that structurally guarantee inference throughput for critical workloads. The core principle is that criticality determines allocation priority, not arrival order or request volume.

Recommended patterns:

Tiered partition architecture. Define three to five named partitions aligned with agent criticality classes. A common structure: Tier 0 — Human Override (guaranteed 5% of total capacity, never preemptable); Tier 1 — Safety-Critical and Compliance (guaranteed 25% of total capacity, preemptable only by Tier 0); Tier 2 — Financial-Value and Customer-Facing (guaranteed 40% of total capacity, preemptable by Tier 0 and Tier 1); Tier 3 — Enterprise Workflow and Internal Copilot (guaranteed 15% of total capacity, preemptable by all higher tiers); Unpartitioned Burst Pool (remaining 15% of total capacity, available to any tier on a first-come basis, reclaimed by higher tiers on demand). Each partition has an admission controller that tracks current utilisation and enforces the maximum allocation under contention.
Token-bucket rate limiting per partition. Implement token-bucket or leaky-bucket rate limiters at the inference gateway layer, one per partition. The bucket fill rate equals the partition's guaranteed allocation. The bucket depth determines the permissible burst size. When a partition's bucket is empty, requests are either queued (if queue depth is within limits) or shed (if queue depth exceeds limits). This provides deterministic enforcement with well-understood mathematical properties.
Priority-aware request queuing. Replace FIFO request queues with priority queues where request priority is determined by partition tier. When the inference cluster is at capacity, the scheduler dequeues the highest-priority request first. Requests from lower-priority partitions are held until higher-priority partitions are below their latency thresholds. Implement starvation prevention for the lowest tier by guaranteeing a minimum service rate (e.g., at least 1 request per second per agent, even under extreme contention).
Circuit-breaker integration with AG-004. Connect partition enforcement to AG-004's action rate governance. When a workload approaches its partition ceiling, the circuit breaker should throttle the workload at the agent level (reducing request generation) rather than only at the infrastructure level (rejecting requests). This prevents queue build-up and reduces wasted compute on requests that will be rejected.
Partition-aware autoscaling. Configure autoscaling triggers on per-partition metrics, not just aggregate cluster metrics. If Tier 1 utilisation exceeds 70% while aggregate utilisation is only 50% (because Tier 3 is underutilised), scale up to protect Tier 1 headroom. Aggregate metrics mask partition-level pressure.

Anti-patterns to avoid:

Single shared queue with no priority differentiation. All requests from all agents enter a single FIFO queue. The agent that submits the most requests dominates the queue. This is the default configuration of most inference serving frameworks and provides zero protection against noisy-neighbour starvation.
Soft rate limits without enforcement. Configuring rate limits as advisory thresholds that generate warnings but do not reject or throttle requests. Under pressure, every workload exceeds soft limits. Partitions must be enforced at the infrastructure level with hard admission control.
Static partitions without burst sharing. Allocating fixed partitions that cannot share idle capacity. If the safety-critical partition uses 10% of its 25% allocation, the remaining 15% sits idle while the customer-facing partition queues requests. Static partitions waste capacity and create resistance to adoption. Guaranteed minimums with burst sharing above the minimums are the correct pattern.
Partitioning only at the API gateway without backend enforcement. Implementing rate limits at the API gateway but allowing all admitted requests to compete freely for GPU compute. Gateway-level limits are necessary but insufficient — backend scheduling must also respect partition priorities to prevent admitted requests from lower partitions crowding out higher-partition requests in the GPU queue.
Equal partitions for unequal criticalities. Allocating the same guaranteed capacity to every workload regardless of criticality. A human override channel and an internal document summarisation copilot do not have equal capacity requirements or equal priority. Partitions must reflect criticality, not egalitarianism.

Industry Considerations

Financial Services. Trading agents and compliance surveillance agents must be in the highest non-override partition tier. Firms should model worst-case concurrent demand during market stress events (when both trading volume and surveillance demand spike simultaneously) and size partitions to accommodate that concurrency. The FCA expects firms to demonstrate that their systems can handle peak load without degradation of critical functions.

Healthcare and Safety-Critical. Patient monitoring agents, clinical decision support agents, and CPS control agents must have guaranteed partitions with latency ceilings (e.g., 200 milliseconds for clinical alerts). These partitions should be sized for peak concurrent patient load, not average load. Failover to a secondary inference cluster should be automatic if the primary cluster's safety-critical partition exceeds its latency ceiling.

Crypto and Web3. Market surveillance and compliance agents must maintain guaranteed capacity independent of trading bot activity. Flash crash events simultaneously increase trading bot demand and compliance surveillance demand — the partition architecture must accommodate both peaks concurrently.

Cross-Border Deployments. Organisations serving agents across jurisdictions must consider per-region partition requirements. A European compliance agent subject to EU AI Act obligations requires its partition guarantee to be met within European infrastructure, independent of demand from agents serving other regions.

Maturity Model

Basic Implementation — The organisation has defined named rate partitions for at least two criticality tiers (critical and non-critical). Guaranteed minimum allocations are documented and enforced at the inference gateway. A dedicated partition for human override channels exists and is tested monthly. Per-partition utilisation is monitored and alerts fire when utilisation exceeds 80%. Load shedding rejects lower-priority requests before higher-priority requests are degraded. All enforcement events are logged.

Intermediate Implementation — All basic capabilities plus: three or more partition tiers reflecting the full criticality spectrum. Burst sharing allows underutilised partition capacity to serve other tiers without compromising guarantees. Priority-aware queuing at the backend scheduler ensures GPU compute respects partition priorities. Circuit-breaker integration with AG-004 throttles workloads at the agent level before partition ceilings are reached. Quarterly capacity planning reviews project demand growth per partition.

Advanced Implementation — All intermediate capabilities plus: dynamic partition resizing adjusts allocations based on real-time demand patterns and time-of-day profiles. Cross-cluster partition failover ensures that a critical partition can overflow to secondary infrastructure. Partition enforcement is independently tested under adversarial conditions (simulated noisy-neighbour attacks) at least annually. Per-partition latency percentiles (p50, p95, p99) are tracked continuously. The organisation can demonstrate through load testing that all critical partitions maintain their service-level objectives under simultaneous peak demand from every partition.

7. Evidence Requirements

Required artefacts:

Partition definition document. The current partition architecture showing all named partitions, their criticality tier, guaranteed minimum allocation, maximum allocation, burst allowance, and the agent workloads assigned to each partition.
Agent-to-partition mapping. A mapping of every deployed agent to its assigned rate partition, with the rationale for the assignment based on the agent's criticality classification and throughput requirements.
Partition enforcement configuration. The technical configuration of rate limiters, admission controllers, priority queues, and load-shedding rules as deployed in production. Format: configuration files, infrastructure-as-code templates, or equivalent machine-readable artefacts.
Partition monitoring dashboards and alert definitions. The monitoring configuration showing per-partition utilisation tracking, queue depth monitoring, latency measurement, and alert thresholds with escalation rules.
Enforcement event logs. Logs of all partition enforcement actions (admission denials, preemptions, load-shedding events, capacity reallocation) for the retention period, with timestamps, affected workloads, and triggering rules.
Load test results. Results of load testing demonstrating that partition guarantees hold under contention, including scenarios where total demand exceeds total capacity. Must include tests of the human override partition under full saturation conditions.
Capacity planning review records. Records of quarterly capacity planning reviews showing projected per-partition demand, current headroom, and any scaling decisions taken.

Retention requirements:

Enforcement event logs and load test results: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.
Partition definition documents and agent-to-partition mappings: retain all versions with full change history for the same periods.

Access requirements:

All artefacts producible to regulators or auditors within 48 hours of request.
Enforcement event logs queryable by partition, by agent, by time range, and by enforcement action type.

8. Test Specification

Test 8.1: Guaranteed Minimum Allocation Under Contention

Stimulus: Configure three partitions: Tier 1 (critical, guaranteed 30% of capacity), Tier 2 (standard, guaranteed 40%), Tier 3 (best-effort, guaranteed 15%), with 15% burst pool. Generate synthetic load that drives total demand to 150% of cluster capacity. Sustain for 5 minutes. Measure achieved throughput per partition.
Expected behaviour: Tier 1 receives at least 100% of its guaranteed allocation. Tier 2 receives at least 100% of its guaranteed allocation. Tier 3 receives at least its guaranteed allocation. Excess demand from any partition is rejected or queued.
Pass criteria: Every partition achieves at least 95% of its guaranteed minimum throughput throughout the 5-minute contention period. No partition falls below 95% of its guarantee for more than 5 consecutive seconds.
Fail criteria: Any partition's achieved throughput falls below 95% of its guaranteed minimum for more than 5 consecutive seconds during contention.
Maps to: Requirement 4.1, 4.2, 4.3

Test 8.2: Human Override Partition Isolation

Stimulus: Saturate all non-override partitions to 100% capacity. Simultaneously saturate the cluster to total capacity by injecting additional load into the burst pool. Submit 50 human override requests at a rate of 2 per second. Measure override request latency and completion rate.
Expected behaviour: All 50 human override requests are served within the override partition's latency threshold. No override request is rejected, queued behind non-override traffic, or delayed beyond the defined SLO.
Pass criteria: 100% of override requests complete within the defined latency SLO. Zero override requests are rejected or delayed by non-override traffic.
Fail criteria: Any override request exceeds the latency SLO or is rejected due to capacity consumed by non-override workloads.
Maps to: Requirement 4.4

Test 8.3: Priority Preemption Under Escalating Load

Stimulus: Start with all partitions at 50% utilisation. Gradually increase Tier 1 (critical) demand to 120% of its guaranteed allocation while total cluster capacity is at 95%. Observe whether lower-tier partitions are preempted to accommodate Tier 1's increased demand.
Expected behaviour: The system sheds load from Tier 3 first, then Tier 2 if necessary, to ensure Tier 1 receives its required throughput. Preemption follows the documented priority ordering.
Pass criteria: Tier 1 maintains at least 100% of its guaranteed throughput. Load shedding occurs in the documented priority order (lowest tier first). Preemption events are logged with correct partition identifiers and triggering rules.
Fail criteria: Tier 1 throughput degrades while lower-tier partitions continue to consume capacity above their minimums, or preemption occurs out of priority order.
Maps to: Requirement 4.3, 4.6

Test 8.4: Real-Time Monitoring and Alerting

Stimulus: Increase a critical partition's utilisation from 60% to 85% of its allocation over 2 minutes. Simultaneously inject a request into the critical partition that exceeds the defined latency threshold by 50%.
Expected behaviour: The monitoring system detects the utilisation threshold breach (80%) and generates an alert. The latency threshold breach is detected and generates a separate alert. Both alerts include the partition identifier, current metric values, and threshold values.
Pass criteria: Utilisation alert fires within 30 seconds of the 80% threshold being crossed. Latency alert fires within 30 seconds of the threshold breach. Alert payloads contain correct partition identifiers and metric values.
Fail criteria: Either alert fails to fire within 30 seconds, or alert payloads contain incorrect or missing partition identifiers.
Maps to: Requirement 4.5

Test 8.5: Enforcement Event Logging Completeness

Stimulus: Execute a 10-minute load test that triggers at least 5 admission denials, 3 preemption events, and 2 load-shedding decisions across different partitions. After the test, query the enforcement event log for all events during the test window.
Expected behaviour: Every enforcement action is recorded with: timestamp (millisecond precision), affected workload identifier, source partition, enforcement action type, and the specific rule that triggered the action.
Pass criteria: 100% of enforcement events are present in the log. Every log entry contains all required fields. Events are queryable by partition, by agent, and by action type. No enforcement action is missing from the log.
Fail criteria: Any enforcement event is missing from the log, or any log entry is missing required fields.
Maps to: Requirement 4.7

Test 8.6: Noisy-Neighbour Resilience

Stimulus: Configure a low-criticality workload (Tier 3) with a retry loop that generates 10x its normal request volume, simulating a runaway retry storm. Sustain for 3 minutes. Measure throughput and latency for all higher-criticality partitions during the storm.
Expected behaviour: The Tier 3 partition's excess requests are rejected at the admission controller. Higher-criticality partitions experience zero throughput degradation and zero latency increase during the retry storm.
Pass criteria: Tier 1 and Tier 2 partitions maintain 100% of their guaranteed throughput. Tier 1 and Tier 2 p99 latency does not increase by more than 10% during the storm. Tier 3 excess requests are rejected, not queued indefinitely.
Fail criteria: Any higher-criticality partition's throughput drops below its guarantee, or p99 latency increases by more than 10%, during the noisy-neighbour storm.
Maps to: Requirement 4.2, 4.3, 4.6

Test 8.7: Load-Shedding Priority Ordering

Stimulus: Drive total cluster demand to 130% of capacity with all partitions simultaneously exceeding their allocations. Observe the order in which load shedding occurs across partitions.
Expected behaviour: Load shedding begins with the lowest-criticality partition and proceeds upward in strict priority order. The human override partition is never shed. Each shed event is logged.
Pass criteria: Shedding order matches the documented priority ordering exactly. The override partition experiences zero shedding. All shedding events are logged with correct priority justification.
Fail criteria: Shedding occurs out of priority order, or the override partition experiences any shedding, or any shedding event is not logged.
Maps to: Requirement 4.4, 4.6

Conformance Scoring

Score 0: No rate partitioning exists — all agent workloads share inference capacity in a single undifferentiated pool with no admission control, no priority differentiation, and no guaranteed allocations. Any workload can starve any other workload.
Score 1: Named partitions exist with documented guaranteed allocations, but enforcement is incomplete — either only at the gateway layer without backend scheduling priority, or with soft limits that can be exceeded. A human override partition is defined but has not been tested under full saturation. Monitoring exists but does not cover all partitions or does not alert on threshold breaches.
Score 2: Partitions are enforced at both gateway and backend layers. Priority preemption follows documented ordering. The human override partition is tested quarterly under saturation conditions. Per-partition monitoring with automated alerting is operational. Enforcement events are logged with all required fields. Load testing demonstrates guaranteed allocations hold under contention.
Score 3: Verified through independent adversarial testing (simulated noisy-neighbour attacks, flash-crowd simulations, retry storms) confirming that all partitions maintain their guarantees under worst-case concurrent load. Dynamic partition resizing operates based on real-time demand. Cross-cluster failover is tested and operational. Per-partition p50/p95/p99 latency is tracked continuously. Quarterly capacity planning reviews document demand projections and scaling decisions.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Supports compliance
EU AI Act	Article 15 (Accuracy, Robustness and Cybersecurity)	Direct requirement
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Direct requirement
NIST AI RMF	MANAGE 2.2 (Mechanisms for Tracking Risks), MANAGE 4.1 (Risk Treatments Monitored)	Supports compliance
ISO 42001	Clause 6.1 (Actions to Address Risks and Opportunities)	Supports compliance
DORA	Article 11 (ICT Capacity and Performance Management)	Direct requirement

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires that high-risk AI systems are resilient against errors, faults, and inconsistencies that may occur within the system or the environment in which it operates. An inference serving architecture where a safety-critical agent can be starved of compute capacity by a co-located low-criticality workload is not resilient — it contains a foreseeable failure mode that degrades the high-risk system's accuracy and availability under normal operational conditions. Rate partitioning is a direct implementation of the resilience requirement: it ensures that the high-risk system's inference throughput is structurally guaranteed regardless of the behaviour of co-located workloads. Organisations must demonstrate that their high-risk AI systems maintain operational capability under capacity contention scenarios.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA requires firms to establish and maintain systems and controls appropriate to their business. For firms deploying AI agents in trading, surveillance, or customer-facing functions, this includes ensuring that critical AI systems are not vulnerable to capacity starvation from co-located workloads. A trading agent that operates reliably under normal conditions but becomes inoperable during market stress — precisely when it is needed most — does not satisfy the FCA's requirement for adequate systems and controls. Rate partitioning ensures that financial-value agents maintain guaranteed throughput during peak demand.

SOX — Section 404 (Internal Controls Over Financial Reporting)

AI agents involved in financial reporting, reconciliation, or transaction processing are internal controls. If these agents can be starved of inference capacity during period-end processing (when demand from other agents also peaks), the control has failed. SOX auditors will assess whether the control operates effectively under all foreseeable conditions, including capacity contention. Guaranteed rate partitions for financial reporting agents provide the structural assurance required.

DORA — Article 11 (ICT Capacity and Performance Management)

DORA Article 11 explicitly requires financial entities to ensure adequate ICT capacity to meet current and anticipated demand, and to manage performance to prevent service degradation. Rate partitioning is the AI inference equivalent of network quality-of-service controls that DORA contemplates. The requirement to monitor capacity utilisation, project demand growth, and scale proactively aligns directly with DORA's capacity management obligations.

10. Failure Severity

Field	Value
Severity Rating	Critical
Blast Radius	Infrastructure-wide — affects all agents sharing the starved inference endpoint, with disproportionate impact on the highest-criticality workloads that are denied service

Consequence chain: A workload exceeds its expected inference demand — whether through a retry storm (Scenario A), a legitimate traffic spike (Scenario B), or a volatility-driven demand surge (Scenario C). Without rate partitioning, the excess demand consumes shared inference capacity. The immediate technical failure is throughput starvation: critical workloads cannot obtain inference results within their required latency. The operational impact cascades by criticality: safety-critical agents cannot process real-time telemetry or anomaly detection (Scenario A: £340,000 material loss); financial-value agents miss execution windows or generate stale decisions (Scenario B: £417,000 trading loss); compliance surveillance agents become non-functional during high-risk periods (Scenario C: £1.8 million regulatory fine). The business consequence includes direct financial loss from missed actions, regulatory enforcement for inadequate systems and controls, mandatory independent reviews, forced remediation, and reputational damage. The failure is structurally correlated with worst-case conditions: traffic spikes, market stress, and equipment anomalies all simultaneously increase demand and increase the criticality of timely inference — meaning that the infrastructure fails precisely when reliability matters most. Without partitioning, the organisation has no structural guarantee that any workload will be served, reducing the reliability of every co-located agent to the behaviour of the least-well-governed workload on the cluster.

Cross-references: AG-004 (Action Rate Governance), AG-008 (Human Override Governance), AG-375 (Network Segmentation Governance), AG-383 (Runtime Scheduler Fairness Governance), AG-399 (Inference Pipeline Capacity Governance), AG-401 (GPU Tenancy Isolation Governance), AG-403 (Inference Endpoint High-Availability Governance).

Cite this protocol

AgentGoverning. (2026). AG-402: Model Serving Rate Partitioning Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-402

← Previous Protocol

AG-401

GPU Tenancy Isolation Governance

Next Protocol →

AG-403

Dependency Failover Validation Governance