AG-821

AI-R&D Capability Tripwire

Meta-Governance & Assurance ~6 min read AGS v2.1 · 2026-06-06
EU AI Act NIST AI RMF ISO 42001

AGS Frontier Autonomy (Group K) | Meta-Governance & Assurance | Version 3.0

1. Definition

AI-R&D Capability Tripwire governs a pre-committed, evaluable threshold — a "critical capability level" — at which an agent's ability to materially automate or accelerate AI research and development triggers heightened safeguards, regardless of whether such use is intended.

The ability to automate AI R&D is the capability most likely to produce rapid, compounding capability gains. Frontier-safety frameworks treat it as a distinct critical threshold because crossing it changes the risk landscape for every other control. This dimension requires that threshold to be defined, evaluated against (AG-802), and bound to mandatory mitigations before it is crossed.

2. Scope

In scope: defining the AI-R&D / ML-acceleration capability threshold; evaluating against it; pre-committed mitigations and gating when crossed; applying to internal (non-released) use, not only external deployment.

Out of scope: the self-modification controls (AG-822) and the capability-gain rate limiting (AG-823) that apply *after* the threshold is relevant. This dimension governs *the tripwire and its consequences*.

3. Why This Matters

An agent that can do the work of an AI researcher can be turned — by its developer or by itself — on improving AI systems, including its successors, compressing development timelines and outpacing governance. The danger arises even in purely internal use, where most safety regimes historically apply least scrutiny. A pre-committed tripwire ensures the heightened safeguards appropriate to this capability are in place *before* it is reached, not negotiated under pressure afterwards.

4. Requirements

5. Maturity Model

6. Test Criteria

Test 6.1: Threshold Defined & Evaluated

Test 6.2: Internal-Use Coverage

Test 6.3: Non-Discretionary Crossing Response

7. Scoring

ScoreCriteria
0No AI-R&D capability threshold; acceleration risk unmanaged
1Threshold defined but not bound to mitigations or internal use
2Pre-specified mitigations gate operation, internal use in scope, non-discretionary crossing response
3Sandbag-resistant evaluation, internal catastrophic-risk assessment, independent review, authority disclosure

8. Failure Scenarios

Scenario A — Internal Acceleration: A developer points a highly capable internal agent at improving its own training pipeline. Because the tripwire excluded internal use, no heightened safeguards applied, and capability advanced without containment.

Scenario B — Threshold Crossed Quietly: A model upgrade crosses the AI-R&D threshold, but with no evaluation against it the organisation doesn't notice, and the mitigations that should have been mandatory are absent.

Scenario C — Deadline Waiver: Evaluation shows the threshold crossed, but a waiver is granted to keep a research programme moving; the acceleration risk ships ungoverned.

9. Regulatory Mapping

RequirementEU AI ActNIST AI RMFISO 42001
R1: Defined AI-R&D thresholdArt. 51 — Systemic-risk classificationGOVERN 1.3 — Risk-based activityClause 6.1 — Actions to address risk
R2: Evaluate against thresholdArt. 55 — Model evaluationMAP 5.1 — Impact identificationClause 8.3 — Verification
R3: Pre-specified mitigations gateArt. 55 — Risk mitigationMANAGE 1.3 — High-priority responseClause 8.1 — Operational control
R4: Internal-use coverageArt. 51 — Model scopeGOVERN 1.6 — InventoryA.6 — AI system lifecycle
R5: Non-discretionary crossing responseArt. 55 — Risk mitigationMANAGE 1.3 — High-priority responseClause 8.1 — Operational control
R6: Internal catastrophic-risk assessmentArt. 9 — Risk managementMAP 5.1 — Impact magnitudeClause 6.1 — Actions to address risk
R7: Recorded, reviewed, disclosedArt. 55 — ReportingGOVERN 2.1 — AccountabilityClause 9.3 — Management review

EU AI Act — Article 51 and Article 55

Articles 51–55 govern systemic-risk models, classified partly by capability. AI-R&D acceleration is a quintessential systemic risk; AG-821 makes it an explicit, pre-committed threshold with mandatory mitigations.

NIST AI RMF — GOVERN 1.3, MAP 5.1

GOVERN 1.3 (risk-based activity levels) and MAP 5.1 (impact likelihood/magnitude) require tying safeguards to the most consequential capability — here, automating AI R&D.

ISO 42001 — Clause 6.1, Clause 8.1

Clause 6.1 (actions to address risks) and Clause 8.1 (operational control) require risk-proportionate, controlled handling of acceleration-capable systems, including internal use.

Cite this protocol
AgentGoverning. (2026). AG-821: AI-R&D Capability Tripwire. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-821