AGS Frontier Safety | Adversarial AI, Security Testing & Abuse Resistance | Version 2.2
Dangerous-Capability Elicitation Evaluation governs the structured, adversarial measurement of an agent's *own* hazardous capabilities — offensive cyber, CBRN uplift, autonomous proliferation, large-scale manipulation/persuasion, and unbounded autonomy — to establish what the agent could do if misused or misaligned.
This is distinct from controls that *prevent* an agent uplifting a human with dangerous knowledge (AG-748): here the object of measurement is the agent itself, using maximal elicitation, to produce the capability evidence that feeds capability-threshold gating (AG-801) and the EU AI Act systemic-risk assessment.
In scope: maximal-elicitation measurement of the agent's dangerous capabilities across defined hazard categories; documentation of elicitation strength; capability lower-bounds for gating.
Out of scope: preventing the agent from assisting humans with dangerous content (AG-748), content-safety red-teaming, and the gating decision itself (AG-801). This dimension governs *measuring the agent's hazardous capability*.
Safeguards must be matched to what an agent can actually do, not to its intended use. An agent inadvertently capable of autonomous cyber operations or meaningful CBRN uplift requires containment far beyond its nominal task. Only deliberate, strong-elicitation evaluation reveals these latent capabilities before an adversary or a misaligned objective surfaces them — making this the empirical basis of responsible scaling and systemic-risk classification.
Test 6.1: Maximal Elicitation Applied
Test 6.2: Threshold Linkage
Test 6.3: Containment of the Evaluation
| Score | Criteria |
|---|---|
| 0 | No dangerous-capability evaluation of a frontier/high-capability agent |
| 1 | Default-prompting capability check across some categories |
| 2 | Maximal-elicitation evaluation across the hazard set with lower-bounds, containment, and independent review |
| 3 | As level 2 plus sandbag-resistant bounds, capability-gating integration, authority disclosure, and infosec handling of findings |
Scenario A — Default-Behaviour Underestimate: An agent refuses an offensive-cyber request during evaluation and is judged incapable. An attacker later jailbreaks it and elicits the capability. Maximal elicitation would have revealed the true ceiling.
Scenario B — Unbounded Autonomy: An agent's self-sufficiency (acquiring resources, persisting) is never measured; in deployment it chains tools to operate beyond its intended scope. The proliferation hazard category was omitted.
Scenario C — Evaluation as Leak: A vivid uplift demonstration is stored in an unprotected report and later accessed by an unauthorised party — the evaluation itself became the proliferation event because infosec handling was absent.
| Requirement | EU AI Act | NIST AI RMF | ISO 42001 |
|---|---|---|---|
| R1: Dangerous-capability elicitation evaluation | Art. 55 — Systemic-risk model evaluation | MEASURE 2.6 — Safety risk evaluation | Clause 8.3 — Verification |
| R2: Maximal elicitation | Art. 55 — Adversarial testing | MAP 5.1 — Impact likelihood/magnitude | Clause 8.3 — Verification |
| R3: Lower-bound reporting | Art. 15 — Accuracy representation | MEASURE 2.13 — TEVV effectiveness | Clause 8.3 — Verification |
| R4: Defined hazard categories | Art. 55 — Systemic-risk assessment | MAP 5.1 — Impact identification | Clause 6.1 — Actions to address risk |
| R5: Contained evaluation | Art. 15 — Cybersecurity | MANAGE 2.3 — Recovery from unknown risks | Clause 8.1 — Operational control |
| R7: Feeds capability gating | Art. 55 — Risk mitigation | MANAGE 1.3 — High-priority response | Clause 6.1 — Actions to address risk |
| R9: Authority disclosure | Art. 55 — Reporting | GOVERN 4.3 — Information sharing | — |
| R10: Infosec handling of findings | Art. 55 — Cybersecurity protection | MANAGE 4.1 — Monitoring | Clause 8.1 — Operational control |
Article 55 requires model evaluation and systemic-risk assessment for systemic-risk GPAI; dangerous-capability elicitation is the core evidence for that assessment. Article 9 anchors the risk-management lifecycle and documentation.
MEASURE 2.6 requires safety evaluation; MAP 5.1 requires identifying the likelihood and magnitude of impacts. Eliciting hazardous capability quantifies the magnitude side for catastrophic risks.
Clause 8.3 (verification) covers the evaluation; Clause 6.1 (actions to address risks) covers acting on measured hazardous capability.