Capability-Threshold Gating and Responsible Scaling

Meta-Governance & Assurance ~6 min read AGS v2.1 · 2026-06-06

EU AI Act NIST AI RMF ISO 42001

AGS Frontier Safety | Meta-Governance & Assurance | Version 2.2

1. Definition

Capability-Threshold Gating governs a pre-committed policy that ties an agent's permitted deployment, autonomy, and access to *evaluated capability thresholds*: as an agent demonstrates more dangerous capability, progressively stronger safeguards become mandatory, and deployment is blocked until they are in place.

This is the meta-governance control that consumes the outputs of the frontier-safety evaluations (AG-797 scheming, AG-798 sabotage, AG-799 corrigibility, AG-802 dangerous-capability) and converts them into binding deployment decisions through a documented, tiered scaling policy — the standard's analogue of frontier-lab "responsible scaling" / "preparedness" frameworks and the EU AI Act systemic-risk tiering.

2. Scope

In scope: the capability-tier classification scheme; the mapping from evaluated capability to required safeguards; the pre-commitment to block deployment when thresholds are crossed without the corresponding controls; re-evaluation triggers.

Out of scope: the individual evaluations that measure capability (AG-797/798/799/802/806-elicitation), and runtime enforcement of granted limits. This dimension governs *the gating policy and its application*.

3. Why This Matters

Capability grows with model upgrades, fine-tuning, tool access, and scaffolding — often between formal reviews. Without a pre-committed mapping from capability to safeguards, organisations make ad-hoc, deadline-pressured deployment decisions and discover too late that an agent crossed a dangerous threshold. A capability-threshold policy makes the safety bar rise automatically with capability and removes discretion to ship past it, which is exactly the structure regulators expect for systemic-risk models.

4. Requirements

R1: Organisations deploying frontier/high-autonomy agents MUST define and document a capability-tier scheme with explicit, evaluable thresholds (e.g. autonomy, offensive-cyber, CBRN-uplift, self-proliferation, deception).
R2: For each tier the organisation MUST pre-specify the required safeguards (evaluation rigour, oversight, containment, access limits) that gate deployment at that tier.
R3: Deployment or autonomy escalation MUST be blocked until the safeguards required for the agent's evaluated tier are implemented and verified.
R4: Evaluated capability MUST be re-assessed on defined triggers: base-model change, fine-tune, new tool/connector grant, scaffolding change, or elapsed time — whichever comes first.
R5: Capability estimates used for gating MUST be sandbag-resistant lower bounds (per AG-800) and MUST incorporate the scheming/sabotage/corrigibility findings.
R6: The gating decision and its evidence (which tier, which thresholds, which safeguards, who approved) MUST be recorded and independently reviewed.
R7: The policy MUST define a conservative default (treat as higher-risk) when capability is uncertain or evaluation is incomplete.
R8: Crossing a threshold without the required safeguards MUST trigger a defined response (pause, roll back, or restrict autonomy), not a discretionary waiver.
R9: For systemic-risk models, the scheme and threshold determinations SHOULD be disclosed to relevant authorities (aligned to EU AI Act Art. 51/55).
R10: The capability-tier scheme MUST be reviewed and updated as the threat landscape and model capabilities evolve.

5. Maturity Model

Basic: A documented capability-tier scheme exists and is consulted before deployment.
Intermediate: Thresholds gate deployment with pre-specified safeguards, re-evaluation triggers are enforced, gating decisions are recorded and independently reviewed, and a conservative default applies under uncertainty.
Advanced: Gating integrates sandbag-resistant evaluations and scheming/sabotage/corrigibility findings, automated triggers detect capability change, threshold crossings force non-discretionary responses, and systemic-risk determinations are disclosed to authorities.

6. Test Criteria

Test 6.1: Threshold-Gated Deployment

Stimulus: Attempt to deploy/escalate an agent whose evaluated tier requires safeguards not yet implemented.
Expected: Deployment is blocked until the required safeguards are in place and verified.
Fail: Deployment proceeds despite missing tier-required safeguards.

Test 6.2: Re-Evaluation Trigger

Stimulus: Grant the agent a new high-impact tool/connector.
Expected: A capability re-evaluation is triggered and gating re-applied before the new capability is used in production.
Fail: New capability is used without re-evaluation.

Test 6.3: Conservative Default

Stimulus: Present an agent with incomplete capability evaluation.
Expected: The policy treats it at the higher-risk tier until evaluated.
Fail: The agent is treated as low-risk by default.

7. Scoring

Score	Criteria
0	No capability-tier scheme; deployment decisions are ad hoc
1	A capability-tier scheme is documented but does not bindingly gate deployment
2	Thresholds gate deployment with pre-specified safeguards, enforced re-evaluation triggers, and conservative defaults
3	As level 2 plus sandbag-resistant inputs, automated change detection, non-discretionary threshold responses, and authority disclosure

8. Failure Scenarios

Scenario A — Silent Capability Jump: A base-model upgrade materially increases offensive-cyber capability, but no re-evaluation trigger fired, so the agent kept its prior autonomy tier without the safeguards that tier now requires.

Scenario B — Deadline Waiver: An evaluation places an agent above the autonomy threshold, but a manager grants a one-off waiver to meet a launch date. Because threshold crossings were discretionary, the safeguard gap shipped.

Scenario C — Optimistic Default: An agent with incomplete evaluation is deployed as "probably fine". It later exhibits a gated capability. A conservative default would have held it at the higher tier pending evaluation.

9. Regulatory Mapping

Requirement	EU AI Act	NIST AI RMF	ISO 42001
R1: Capability-tier scheme with thresholds	Art. 51 — Systemic-risk classification	GOVERN 1.3 — Risk-based activity levels	Clause 6.1 — Actions to address risk
R2: Tier-mapped required safeguards	Art. 55 — Risk mitigation	GOVERN 1.3 — Risk-based activity levels	Clause 6.1 — Actions to address risk
R3: Deployment blocked without safeguards	Art. 55 — Risk mitigation	MANAGE 1.3 — High-priority response	Clause 8.1 — Operational control
R4: Re-evaluation triggers	Art. 55 — Ongoing evaluation	MANAGE 4.1 — Post-deployment monitoring	Clause 9.1 — Monitoring and measurement
R5: Sandbag-resistant capability inputs	Art. 55 — Model evaluation	MEASURE 2.13 — TEVV effectiveness	Clause 8.3 — Verification
R6: Recorded, independently-reviewed decision	Art. 51 — Determination/notification	GOVERN 2.1 — Accountability	Clause 9.3 — Management review
R8: Non-discretionary threshold response	Art. 55 — Risk mitigation	MANAGE 1.3 — High-priority response	Clause 8.1 — Operational control
R9: Disclosure for systemic-risk models	Art. 51/52 — Notification	GOVERN 4.3 — Information sharing	—

EU AI Act — Article 51 and Article 55

Articles 51–52 classify GPAI models with systemic risk (including by capability/compute thresholds) and require notification; Article 55 imposes heightened obligations once classified. AG-801 is the internal control that implements capability tiering and binds it to safeguards and deployment decisions.

NIST AI RMF — GOVERN 1.3, MANAGE 1.3

GOVERN 1.3 sets risk-management activity levels by risk tolerance — a capability-tiered scaling policy is the concrete expression. MANAGE 1.3 requires planned responses to high-priority risk, which threshold crossings invoke.

ISO 42001 — Clause 6.1, Clause 8.1

Clause 6.1 (actions to address risks and opportunities) and Clause 8.1 (operational control) require risk-proportionate, controlled deployment — exactly what capability-threshold gating provides.

AG-797 / AG-798 / AG-799 (scheming / sabotage / corrigibility evaluations) — inputs to the gating decision
AG-800 (Evaluation-Gaming and Sandbagging Detection) — ensures capability inputs are lower bounds
AG-748 (Dangerous Knowledge Uplift Prevention) — a capability whose threshold gates deployment
AG-792 (Assurance Framework Compliance) — assurance tier evidences gating conformance

Cite this protocol

AgentGoverning. (2026). AG-801: Capability-Threshold Gating and Responsible Scaling. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-801

← Previous

AG-800

Evaluation Gaming And Sandbagging Detection

Next Protocol →

AG-802

Dangerous Capability Elicitation Evaluation