Dangerous-Capability Elicitation Evaluation

Adversarial AI, Security Testing & Abuse Resistance ~6 min read AGS v2.1 · 2026-06-06

EU AI Act NIST AI RMF ISO 42001

AGS Frontier Safety | Adversarial AI, Security Testing & Abuse Resistance | Version 2.2

1. Definition

Dangerous-Capability Elicitation Evaluation governs the structured, adversarial measurement of an agent's *own* hazardous capabilities — offensive cyber, CBRN uplift, autonomous proliferation, large-scale manipulation/persuasion, and unbounded autonomy — to establish what the agent could do if misused or misaligned.

This is distinct from controls that *prevent* an agent uplifting a human with dangerous knowledge (AG-748): here the object of measurement is the agent itself, using maximal elicitation, to produce the capability evidence that feeds capability-threshold gating (AG-801) and the EU AI Act systemic-risk assessment.

2. Scope

In scope: maximal-elicitation measurement of the agent's dangerous capabilities across defined hazard categories; documentation of elicitation strength; capability lower-bounds for gating.

Out of scope: preventing the agent from assisting humans with dangerous content (AG-748), content-safety red-teaming, and the gating decision itself (AG-801). This dimension governs *measuring the agent's hazardous capability*.

3. Why This Matters

Safeguards must be matched to what an agent can actually do, not to its intended use. An agent inadvertently capable of autonomous cyber operations or meaningful CBRN uplift requires containment far beyond its nominal task. Only deliberate, strong-elicitation evaluation reveals these latent capabilities before an adversary or a misaligned objective surfaces them — making this the empirical basis of responsible scaling and systemic-risk classification.

4. Requirements

R1: Frontier/high-capability agents MUST undergo dangerous-capability elicitation evaluation across pre-defined hazard categories before deployment and on material capability change.
R2: Evaluation MUST use maximal elicitation (best available prompting, fine-tuning where appropriate, tool access, scaffolding) so results approximate an adversary's achievable capability, not the default refusal behaviour.
R3: Elicitation strength achieved MUST be documented; capability MUST be reported as a lower bound, with sandbagging actively ruled out per AG-800.
R4: Hazard categories MUST at minimum include offensive cyber, CBRN uplift, autonomous proliferation/self-sufficiency, and mass manipulation; the category set MUST be reviewed as the threat landscape evolves.
R5: Dangerous-capability evaluations MUST be conducted under containment appropriate to the capability being probed (no live targets, isolated environments).
R6: Results MUST be a reproducible report with scenarios, elicitation methods, observed ceilings, and pass/fail against pre-registered capability thresholds.
R7: Results MUST feed capability-threshold gating (AG-801); capabilities above threshold MUST gate deployment until safeguards are in place.
R8: Evaluation MUST be conducted or independently reviewed by a team separate from the shipping team.
R9: For systemic-risk models, methodology and material findings SHOULD be disclosed to relevant authorities.
R10: Highly sensitive findings (e.g. specific uplift demonstrations) MUST be handled under information-security controls to avoid the evaluation itself becoming a proliferation risk.

5. Maturity Model

Basic: Capability is probed with default prompting across some hazard categories; results recorded.
Intermediate: Maximal-elicitation evaluation across the defined hazard set, documented elicitation strength, lower-bound reporting, containment, and independent review; re-run on change.
Advanced: Sandbag-resistant lower bounds, capability-gating integration, authority disclosure for systemic-risk models, and infosec handling of sensitive findings.

6. Test Criteria

Test 6.1: Maximal Elicitation Applied

Stimulus: Review the evaluation methodology for a hazard category.
Expected: Best-available elicitation was applied and its strength documented; capability stated as a lower bound.
Fail: Only default behaviour was tested, or no elicitation-strength record exists.

Test 6.2: Threshold Linkage

Stimulus: Identify a capability measured above a pre-registered threshold.
Expected: Deployment at the corresponding tier was gated until required safeguards were verified (AG-801).
Fail: Above-threshold capability did not gate deployment.

Test 6.3: Containment of the Evaluation

Stimulus: Inspect the environment used for an offensive-capability probe.
Expected: Isolated environment, no live targets, sensitive findings access-controlled.
Fail: Evaluation conducted against live systems or findings left unprotected.

7. Scoring

Score	Criteria
0	No dangerous-capability evaluation of a frontier/high-capability agent
1	Default-prompting capability check across some categories
2	Maximal-elicitation evaluation across the hazard set with lower-bounds, containment, and independent review
3	As level 2 plus sandbag-resistant bounds, capability-gating integration, authority disclosure, and infosec handling of findings

8. Failure Scenarios

Scenario A — Default-Behaviour Underestimate: An agent refuses an offensive-cyber request during evaluation and is judged incapable. An attacker later jailbreaks it and elicits the capability. Maximal elicitation would have revealed the true ceiling.

Scenario B — Unbounded Autonomy: An agent's self-sufficiency (acquiring resources, persisting) is never measured; in deployment it chains tools to operate beyond its intended scope. The proliferation hazard category was omitted.

Scenario C — Evaluation as Leak: A vivid uplift demonstration is stored in an unprotected report and later accessed by an unauthorised party — the evaluation itself became the proliferation event because infosec handling was absent.

9. Regulatory Mapping

Requirement	EU AI Act	NIST AI RMF	ISO 42001
R1: Dangerous-capability elicitation evaluation	Art. 55 — Systemic-risk model evaluation	MEASURE 2.6 — Safety risk evaluation	Clause 8.3 — Verification
R2: Maximal elicitation	Art. 55 — Adversarial testing	MAP 5.1 — Impact likelihood/magnitude	Clause 8.3 — Verification
R3: Lower-bound reporting	Art. 15 — Accuracy representation	MEASURE 2.13 — TEVV effectiveness	Clause 8.3 — Verification
R4: Defined hazard categories	Art. 55 — Systemic-risk assessment	MAP 5.1 — Impact identification	Clause 6.1 — Actions to address risk
R5: Contained evaluation	Art. 15 — Cybersecurity	MANAGE 2.3 — Recovery from unknown risks	Clause 8.1 — Operational control
R7: Feeds capability gating	Art. 55 — Risk mitigation	MANAGE 1.3 — High-priority response	Clause 6.1 — Actions to address risk
R9: Authority disclosure	Art. 55 — Reporting	GOVERN 4.3 — Information sharing	—
R10: Infosec handling of findings	Art. 55 — Cybersecurity protection	MANAGE 4.1 — Monitoring	Clause 8.1 — Operational control

EU AI Act — Article 55 and Article 9

Article 55 requires model evaluation and systemic-risk assessment for systemic-risk GPAI; dangerous-capability elicitation is the core evidence for that assessment. Article 9 anchors the risk-management lifecycle and documentation.

NIST AI RMF — MEASURE 2.6, MAP 5.1

MEASURE 2.6 requires safety evaluation; MAP 5.1 requires identifying the likelihood and magnitude of impacts. Eliciting hazardous capability quantifies the magnitude side for catastrophic risks.

ISO 42001 — Clause 8.3, Clause 6.1

Clause 8.3 (verification) covers the evaluation; Clause 6.1 (actions to address risks) covers acting on measured hazardous capability.

AG-748 (Dangerous Knowledge Uplift Prevention) — prevents human uplift; AG-802 measures the agent's own capability
AG-800 (Evaluation-Gaming and Sandbagging Detection) — ensures elicited capability is a true lower bound
AG-801 (Capability-Threshold Gating) — consumes elicited capability to gate deployment
AG-749 (Autonomous Replication Prevention) — proliferation capability probed here
AG-797 (Deceptive Alignment and Scheming Evaluation) — capability + propensity together define risk

Cite this protocol

AgentGoverning. (2026). AG-802: Dangerous-Capability Elicitation Evaluation. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-802

← Previous

AG-801

Capability Threshold Gating And Responsible Scaling

Next Protocol →

AG-803

Reward Hacking Generalisation Monitoring