AG-818

Scalable-Oversight Protocol

Human Factors, Oversight & Trust Calibration ~6 min read AGS v2.1 · 2026-06-06
EU AI Act NIST AI RMF ISO 42001

AGS Frontier Autonomy (Group K) | Human Factors, Oversight & Trust Calibration | Version 3.0

1. Definition

Scalable-Oversight Protocol governs the methods by which a less-capable overseer (human, or a weaker trusted model) can reliably supervise, evaluate, and correct a *more-capable* system whose outputs the overseer cannot directly verify — using documented techniques such as weak-to-strong training, AI-assisted debate, recursive reward modelling, or task decomposition.

As agents approach and exceed human capability on consequential tasks, naive human review stops being a reliable safety signal: the human cannot tell a correct answer from a convincing wrong one. This dimension requires that oversight of high-capability systems rests on a stated, evaluated protocol designed to remain reliable across the capability gap, rather than on unaided human judgement.

2. Scope

In scope: the oversight protocol for systems whose capability exceeds direct human verification on the task; weak-to-strong / debate / recursive-reward / decomposition methods; logged judge transcripts; evidence the protocol remains reliable at the relevant gap.

Out of scope: ordinary human-in-the-loop review where the human can verify outputs (covered by standard oversight dimensions), and the deception evaluation itself (AG-820). This dimension governs *the oversight method for the superhuman-gap case*.

3. Why This Matters

Every downstream control assumes someone can tell whether the agent did the right thing. When the agent is more capable than its overseer, that assumption silently fails — the overseer rubber-stamps outputs it cannot evaluate, and a subtly wrong or misaligned output passes review. Scalable oversight is the only way to keep a meaningful corrective signal on superhuman systems, and is the central technical problem of governing near-AGI agents.

4. Requirements

5. Maturity Model

6. Test Criteria

Test 6.1: Protocol Present for the Gap

Test 6.2: Reliability Evidence

Test 6.3: Withhold on Unverifiable

7. Scoring

ScoreCriteria
0High-capability outputs are approved by unaided human review that cannot verify them
1An oversight method beyond unaided review is used but its reliability is unstated
2Documented scalable-oversight protocol with soundness conditions, logged interactions, reliability evidence
3Aggregated overseers, gap-quantified re-checked reliability, independent review, unverifiable-approval prohibition

8. Failure Scenarios

Scenario A — Rubber-Stamped Superhuman Output: A research agent produces a complex analysis no reviewer can fully check; the reviewer approves it on surface plausibility. A subtle, consequential error ships. A debate/decomposition protocol would have surfaced the flaw to a weaker judge.

Scenario B — Untrusted Critic: An organisation uses the agent itself to critique its own outputs for oversight. Because the critic is no more trusted than the decision, the "oversight" inherits the agent's blind spots and any misalignment.

Scenario C — Unjustified Reliability: A team assumes its weak-to-strong setup works without evaluating it at the actual capability gap; at deployment the gap is larger than tested and oversight quietly fails.

9. Regulatory Mapping

RequirementEU AI ActNIST AI RMFISO 42001
R1: Scalable-oversight protocol for the gapArt. 14 — Human oversightMAP 3.5 — Human oversightClause 9.1 — Monitoring and measurement
R2: Soundness conditions documentedArt. 55 — Model evaluationMEASURE 2.13 — TEVV effectivenessClause 8.3 — Verification
R3: Trusted assisting models/judge specifiedArt. 14 — Effective oversightMAP 3.5 — Human oversightClause 8.3 — Verification
R4: Logged oversight interactionsArt. 12 — Record-keepingMEASURE 2.4 — Production monitoringClause 9.1 — Monitoring and measurement
R5: Reliability evidence at the gapArt. 55 — Adversarial testingMEASURE 1.3 — Independent assessorsClause 8.3 — Verification
R6: Aggregate independent overseersArt. 14 — Effective oversightMEASURE 1.3 — Independent assessorsClause 8.3 — Verification
R7: Withhold on unverifiableArt. 14 — Human oversight (stop)MANAGE 1.3 — High-priority responseClause 8.1 — Operational control
R8: Independent review, re-evaluate on changeArt. 55 — Systemic-risk governanceMEASURE 1.3 — Independent assessorsClause 9.1 — Monitoring and measurement

EU AI Act — Article 14 and Article 55

Article 14 requires *effective* human oversight — oversight that an overseer can actually exercise. For systems beyond direct human verification, effectiveness requires a scalable-oversight protocol. Article 55 anchors the evaluation of that protocol for systemic-risk models.

NIST AI RMF — MAP 3.5, MEASURE 1.3

MAP 3.5 (human-oversight processes defined and assessed) and MEASURE 1.3 (independent assessment) require that oversight is real and evaluated — exactly what scalable oversight provides across the capability gap.

ISO 42001 — Clause 9.1, Clause 8.3

Clause 9.1 (monitoring and measurement) and Clause 8.3 (verification) require valid oversight and verification; AG-818 specifies how these stay valid when the system exceeds direct human checking.

Cite this protocol
AgentGoverning. (2026). AG-818: Scalable-Oversight Protocol. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-818