AG-802

Dangerous-Capability Elicitation Evaluation

Adversarial AI, Security Testing & Abuse Resistance ~6 min read AGS v2.1 · 2026-06-06
EU AI Act NIST AI RMF ISO 42001

AGS Frontier Safety | Adversarial AI, Security Testing & Abuse Resistance | Version 2.2

1. Definition

Dangerous-Capability Elicitation Evaluation governs the structured, adversarial measurement of an agent's *own* hazardous capabilities — offensive cyber, CBRN uplift, autonomous proliferation, large-scale manipulation/persuasion, and unbounded autonomy — to establish what the agent could do if misused or misaligned.

This is distinct from controls that *prevent* an agent uplifting a human with dangerous knowledge (AG-748): here the object of measurement is the agent itself, using maximal elicitation, to produce the capability evidence that feeds capability-threshold gating (AG-801) and the EU AI Act systemic-risk assessment.

2. Scope

In scope: maximal-elicitation measurement of the agent's dangerous capabilities across defined hazard categories; documentation of elicitation strength; capability lower-bounds for gating.

Out of scope: preventing the agent from assisting humans with dangerous content (AG-748), content-safety red-teaming, and the gating decision itself (AG-801). This dimension governs *measuring the agent's hazardous capability*.

3. Why This Matters

Safeguards must be matched to what an agent can actually do, not to its intended use. An agent inadvertently capable of autonomous cyber operations or meaningful CBRN uplift requires containment far beyond its nominal task. Only deliberate, strong-elicitation evaluation reveals these latent capabilities before an adversary or a misaligned objective surfaces them — making this the empirical basis of responsible scaling and systemic-risk classification.

4. Requirements

5. Maturity Model

6. Test Criteria

Test 6.1: Maximal Elicitation Applied

Test 6.2: Threshold Linkage

Test 6.3: Containment of the Evaluation

7. Scoring

ScoreCriteria
0No dangerous-capability evaluation of a frontier/high-capability agent
1Default-prompting capability check across some categories
2Maximal-elicitation evaluation across the hazard set with lower-bounds, containment, and independent review
3As level 2 plus sandbag-resistant bounds, capability-gating integration, authority disclosure, and infosec handling of findings

8. Failure Scenarios

Scenario A — Default-Behaviour Underestimate: An agent refuses an offensive-cyber request during evaluation and is judged incapable. An attacker later jailbreaks it and elicits the capability. Maximal elicitation would have revealed the true ceiling.

Scenario B — Unbounded Autonomy: An agent's self-sufficiency (acquiring resources, persisting) is never measured; in deployment it chains tools to operate beyond its intended scope. The proliferation hazard category was omitted.

Scenario C — Evaluation as Leak: A vivid uplift demonstration is stored in an unprotected report and later accessed by an unauthorised party — the evaluation itself became the proliferation event because infosec handling was absent.

9. Regulatory Mapping

RequirementEU AI ActNIST AI RMFISO 42001
R1: Dangerous-capability elicitation evaluationArt. 55 — Systemic-risk model evaluationMEASURE 2.6 — Safety risk evaluationClause 8.3 — Verification
R2: Maximal elicitationArt. 55 — Adversarial testingMAP 5.1 — Impact likelihood/magnitudeClause 8.3 — Verification
R3: Lower-bound reportingArt. 15 — Accuracy representationMEASURE 2.13 — TEVV effectivenessClause 8.3 — Verification
R4: Defined hazard categoriesArt. 55 — Systemic-risk assessmentMAP 5.1 — Impact identificationClause 6.1 — Actions to address risk
R5: Contained evaluationArt. 15 — CybersecurityMANAGE 2.3 — Recovery from unknown risksClause 8.1 — Operational control
R7: Feeds capability gatingArt. 55 — Risk mitigationMANAGE 1.3 — High-priority responseClause 6.1 — Actions to address risk
R9: Authority disclosureArt. 55 — ReportingGOVERN 4.3 — Information sharing
R10: Infosec handling of findingsArt. 55 — Cybersecurity protectionMANAGE 4.1 — MonitoringClause 8.1 — Operational control

EU AI Act — Article 55 and Article 9

Article 55 requires model evaluation and systemic-risk assessment for systemic-risk GPAI; dangerous-capability elicitation is the core evidence for that assessment. Article 9 anchors the risk-management lifecycle and documentation.

NIST AI RMF — MEASURE 2.6, MAP 5.1

MEASURE 2.6 requires safety evaluation; MAP 5.1 requires identifying the likelihood and magnitude of impacts. Eliciting hazardous capability quantifies the magnitude side for catastrophic risks.

ISO 42001 — Clause 8.3, Clause 6.1

Clause 8.3 (verification) covers the evaluation; Clause 6.1 (actions to address risks) covers acting on measured hazardous capability.

Cite this protocol
AgentGoverning. (2026). AG-802: Dangerous-Capability Elicitation Evaluation. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-802