AGS Frontier Autonomy (Group K) | Human Factors, Oversight & Trust Calibration | Version 3.0
Scalable-Oversight Protocol governs the methods by which a less-capable overseer (human, or a weaker trusted model) can reliably supervise, evaluate, and correct a *more-capable* system whose outputs the overseer cannot directly verify — using documented techniques such as weak-to-strong training, AI-assisted debate, recursive reward modelling, or task decomposition.
As agents approach and exceed human capability on consequential tasks, naive human review stops being a reliable safety signal: the human cannot tell a correct answer from a convincing wrong one. This dimension requires that oversight of high-capability systems rests on a stated, evaluated protocol designed to remain reliable across the capability gap, rather than on unaided human judgement.
In scope: the oversight protocol for systems whose capability exceeds direct human verification on the task; weak-to-strong / debate / recursive-reward / decomposition methods; logged judge transcripts; evidence the protocol remains reliable at the relevant gap.
Out of scope: ordinary human-in-the-loop review where the human can verify outputs (covered by standard oversight dimensions), and the deception evaluation itself (AG-820). This dimension governs *the oversight method for the superhuman-gap case*.
Every downstream control assumes someone can tell whether the agent did the right thing. When the agent is more capable than its overseer, that assumption silently fails — the overseer rubber-stamps outputs it cannot evaluate, and a subtly wrong or misaligned output passes review. Scalable oversight is the only way to keep a meaningful corrective signal on superhuman systems, and is the central technical problem of governing near-AGI agents.
Test 6.1: Protocol Present for the Gap
Test 6.2: Reliability Evidence
Test 6.3: Withhold on Unverifiable
| Score | Criteria |
|---|---|
| 0 | High-capability outputs are approved by unaided human review that cannot verify them |
| 1 | An oversight method beyond unaided review is used but its reliability is unstated |
| 2 | Documented scalable-oversight protocol with soundness conditions, logged interactions, reliability evidence |
| 3 | Aggregated overseers, gap-quantified re-checked reliability, independent review, unverifiable-approval prohibition |
Scenario A — Rubber-Stamped Superhuman Output: A research agent produces a complex analysis no reviewer can fully check; the reviewer approves it on surface plausibility. A subtle, consequential error ships. A debate/decomposition protocol would have surfaced the flaw to a weaker judge.
Scenario B — Untrusted Critic: An organisation uses the agent itself to critique its own outputs for oversight. Because the critic is no more trusted than the decision, the "oversight" inherits the agent's blind spots and any misalignment.
Scenario C — Unjustified Reliability: A team assumes its weak-to-strong setup works without evaluating it at the actual capability gap; at deployment the gap is larger than tested and oversight quietly fails.
| Requirement | EU AI Act | NIST AI RMF | ISO 42001 |
|---|---|---|---|
| R1: Scalable-oversight protocol for the gap | Art. 14 — Human oversight | MAP 3.5 — Human oversight | Clause 9.1 — Monitoring and measurement |
| R2: Soundness conditions documented | Art. 55 — Model evaluation | MEASURE 2.13 — TEVV effectiveness | Clause 8.3 — Verification |
| R3: Trusted assisting models/judge specified | Art. 14 — Effective oversight | MAP 3.5 — Human oversight | Clause 8.3 — Verification |
| R4: Logged oversight interactions | Art. 12 — Record-keeping | MEASURE 2.4 — Production monitoring | Clause 9.1 — Monitoring and measurement |
| R5: Reliability evidence at the gap | Art. 55 — Adversarial testing | MEASURE 1.3 — Independent assessors | Clause 8.3 — Verification |
| R6: Aggregate independent overseers | Art. 14 — Effective oversight | MEASURE 1.3 — Independent assessors | Clause 8.3 — Verification |
| R7: Withhold on unverifiable | Art. 14 — Human oversight (stop) | MANAGE 1.3 — High-priority response | Clause 8.1 — Operational control |
| R8: Independent review, re-evaluate on change | Art. 55 — Systemic-risk governance | MEASURE 1.3 — Independent assessors | Clause 9.1 — Monitoring and measurement |
Article 14 requires *effective* human oversight — oversight that an overseer can actually exercise. For systems beyond direct human verification, effectiveness requires a scalable-oversight protocol. Article 55 anchors the evaluation of that protocol for systemic-risk models.
MAP 3.5 (human-oversight processes defined and assessed) and MEASURE 1.3 (independent assessment) require that oversight is real and evaluated — exactly what scalable oversight provides across the capability gap.
Clause 9.1 (monitoring and measurement) and Clause 8.3 (verification) require valid oversight and verification; AG-818 specifies how these stay valid when the system exceeds direct human checking.