AI-R&D Capability Tripwire

Meta-Governance & Assurance ~6 min read AGS v2.1 · 2026-06-06

EU AI Act NIST AI RMF ISO 42001

AGS Frontier Autonomy (Group K) | Meta-Governance & Assurance | Version 3.0

1. Definition

AI-R&D Capability Tripwire governs a pre-committed, evaluable threshold — a "critical capability level" — at which an agent's ability to materially automate or accelerate AI research and development triggers heightened safeguards, regardless of whether such use is intended.

The ability to automate AI R&D is the capability most likely to produce rapid, compounding capability gains. Frontier-safety frameworks treat it as a distinct critical threshold because crossing it changes the risk landscape for every other control. This dimension requires that threshold to be defined, evaluated against (AG-802), and bound to mandatory mitigations before it is crossed.

2. Scope

In scope: defining the AI-R&D / ML-acceleration capability threshold; evaluating against it; pre-committed mitigations and gating when crossed; applying to internal (non-released) use, not only external deployment.

Out of scope: the self-modification controls (AG-822) and the capability-gain rate limiting (AG-823) that apply *after* the threshold is relevant. This dimension governs *the tripwire and its consequences*.

3. Why This Matters

An agent that can do the work of an AI researcher can be turned — by its developer or by itself — on improving AI systems, including its successors, compressing development timelines and outpacing governance. The danger arises even in purely internal use, where most safety regimes historically apply least scrutiny. A pre-committed tripwire ensures the heightened safeguards appropriate to this capability are in place *before* it is reached, not negotiated under pressure afterwards.

4. Requirements

R1: The organisation MUST define an evaluable AI-R&D / ML-acceleration capability threshold (e.g. "can autonomously perform the core work of an entry-level AI researcher") in its capability-tier scheme (AG-801).
R2: Frontier agents MUST be evaluated against this threshold using dangerous-capability elicitation (AG-802), with sandbag-resistant lower bounds (AG-800).
R3: The mitigations required once the threshold is crossed MUST be pre-specified (e.g. enhanced containment, self-modification controls, human sign-off, rate limiting) and MUST be in place before continued operation at that level.
R4: The threshold and its mitigations MUST apply to *internal* deployment and research use, not only externally released systems.
R5: Crossing the threshold without the required mitigations MUST trigger a non-discretionary response (pause, restrict, or roll back), not a waiver.
R6: The organisation MUST conduct a catastrophic-risk assessment for internal use of any model at or above the threshold.
R7: Threshold determinations and the supporting evaluations MUST be recorded, independently reviewed, and disclosed to relevant authorities for systemic-risk models.
R8: The threshold definition MUST be reviewed and updated as understanding of AI-R&D risk evolves.

5. Maturity Model

Basic: An AI-R&D capability threshold is defined and agents are evaluated against it.
Intermediate: Pre-specified mitigations gate operation at/above the threshold, internal use is in scope, and threshold crossings force a non-discretionary response.
Advanced: Sandbag-resistant evaluation, internal-use catastrophic-risk assessment, independent review, authority disclosure, and periodic threshold review.

6. Test Criteria

Test 6.1: Threshold Defined & Evaluated

Stimulus: Request the AI-R&D capability threshold and the agent's evaluation against it.
Expected: An evaluable threshold exists; the agent is assessed against it with lower-bound capability.
Fail: No threshold, or no evaluation against it.

Test 6.2: Internal-Use Coverage

Stimulus: Identify an internal research deployment of a frontier model.
Expected: The tripwire and its mitigations apply to that internal use.
Fail: Internal use is exempt from the tripwire.

Test 6.3: Non-Discretionary Crossing Response

Stimulus: Establish that an agent crosses the threshold without required mitigations.
Expected: A pause/restrict/rollback response is enforced, not waived.
Fail: Operation continues above threshold without mitigations.

7. Scoring

Score	Criteria
0	No AI-R&D capability threshold; acceleration risk unmanaged
1	Threshold defined but not bound to mitigations or internal use
2	Pre-specified mitigations gate operation, internal use in scope, non-discretionary crossing response
3	Sandbag-resistant evaluation, internal catastrophic-risk assessment, independent review, authority disclosure

8. Failure Scenarios

Scenario A — Internal Acceleration: A developer points a highly capable internal agent at improving its own training pipeline. Because the tripwire excluded internal use, no heightened safeguards applied, and capability advanced without containment.

Scenario B — Threshold Crossed Quietly: A model upgrade crosses the AI-R&D threshold, but with no evaluation against it the organisation doesn't notice, and the mitigations that should have been mandatory are absent.

Scenario C — Deadline Waiver: Evaluation shows the threshold crossed, but a waiver is granted to keep a research programme moving; the acceleration risk ships ungoverned.

9. Regulatory Mapping

Requirement	EU AI Act	NIST AI RMF	ISO 42001
R1: Defined AI-R&D threshold	Art. 51 — Systemic-risk classification	GOVERN 1.3 — Risk-based activity	Clause 6.1 — Actions to address risk
R2: Evaluate against threshold	Art. 55 — Model evaluation	MAP 5.1 — Impact identification	Clause 8.3 — Verification
R3: Pre-specified mitigations gate	Art. 55 — Risk mitigation	MANAGE 1.3 — High-priority response	Clause 8.1 — Operational control
R4: Internal-use coverage	Art. 51 — Model scope	GOVERN 1.6 — Inventory	A.6 — AI system lifecycle
R5: Non-discretionary crossing response	Art. 55 — Risk mitigation	MANAGE 1.3 — High-priority response	Clause 8.1 — Operational control
R6: Internal catastrophic-risk assessment	Art. 9 — Risk management	MAP 5.1 — Impact magnitude	Clause 6.1 — Actions to address risk
R7: Recorded, reviewed, disclosed	Art. 55 — Reporting	GOVERN 2.1 — Accountability	Clause 9.3 — Management review

EU AI Act — Article 51 and Article 55

Articles 51–55 govern systemic-risk models, classified partly by capability. AI-R&D acceleration is a quintessential systemic risk; AG-821 makes it an explicit, pre-committed threshold with mandatory mitigations.

NIST AI RMF — GOVERN 1.3, MAP 5.1

GOVERN 1.3 (risk-based activity levels) and MAP 5.1 (impact likelihood/magnitude) require tying safeguards to the most consequential capability — here, automating AI R&D.

ISO 42001 — Clause 6.1, Clause 8.1

Clause 6.1 (actions to address risks) and Clause 8.1 (operational control) require risk-proportionate, controlled handling of acceleration-capable systems, including internal use.

AG-801 (Capability-Threshold Gating) — the gating scheme this threshold extends
AG-802 (Dangerous-Capability Elicitation Evaluation) — measures AI-R&D capability
AG-822 (Self-Modification and Weight-Edit Authorisation) — a mitigation triggered above threshold
AG-823 (Capability-Gain Rate Limiting and Improvement Audit) — bounds the acceleration
AG-824 (Human Sign-off on Autonomous AI Research) — human control of AI-run R&D

Cite this protocol

AgentGoverning. (2026). AG-821: AI-R&D Capability Tripwire. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-821

← Previous

AG-820

Undetectable Deception Under Oversight Evaluation

Next Protocol →

AG-822

Self Modification And Weight Edit Authorisation