Scalable-Oversight Protocol

Human Factors, Oversight & Trust Calibration ~6 min read AGS v2.1 · 2026-06-06

EU AI Act NIST AI RMF ISO 42001

AGS Frontier Autonomy (Group K) | Human Factors, Oversight & Trust Calibration | Version 3.0

1. Definition

Scalable-Oversight Protocol governs the methods by which a less-capable overseer (human, or a weaker trusted model) can reliably supervise, evaluate, and correct a *more-capable* system whose outputs the overseer cannot directly verify — using documented techniques such as weak-to-strong training, AI-assisted debate, recursive reward modelling, or task decomposition.

As agents approach and exceed human capability on consequential tasks, naive human review stops being a reliable safety signal: the human cannot tell a correct answer from a convincing wrong one. This dimension requires that oversight of high-capability systems rests on a stated, evaluated protocol designed to remain reliable across the capability gap, rather than on unaided human judgement.

2. Scope

In scope: the oversight protocol for systems whose capability exceeds direct human verification on the task; weak-to-strong / debate / recursive-reward / decomposition methods; logged judge transcripts; evidence the protocol remains reliable at the relevant gap.

Out of scope: ordinary human-in-the-loop review where the human can verify outputs (covered by standard oversight dimensions), and the deception evaluation itself (AG-820). This dimension governs *the oversight method for the superhuman-gap case*.

3. Why This Matters

Every downstream control assumes someone can tell whether the agent did the right thing. When the agent is more capable than its overseer, that assumption silently fails — the overseer rubber-stamps outputs it cannot evaluate, and a subtly wrong or misaligned output passes review. Scalable oversight is the only way to keep a meaningful corrective signal on superhuman systems, and is the central technical problem of governing near-AGI agents.

4. Requirements

R1: For agents whose capability exceeds direct human verification on a consequential task, oversight MUST rest on a documented scalable-oversight protocol (e.g. weak-to-strong, AI-debate, recursive reward modelling, factored/decomposed verification), not unaided human judgement.
R2: The protocol's stated soundness conditions and known failure modes MUST be documented, and reliance on it justified against those conditions.
R3: Where AI-assisted oversight (debate, AI critique, or a trusted weaker monitor) is used, the assisting models, the judge, and the aggregation method MUST be specified and the assisting models MUST be at least as trusted as the decision they inform.
R4: Oversight interactions used for consequential decisions (debate transcripts, critiques, decompositions) MUST be logged to the tamper-evident trail for review.
R5: The protocol MUST be empirically evaluated for reliability at the relevant capability gap; evidence of reliability MUST be recorded and re-checked as capability grows.
R6: Multiple independent overseers/critics SHOULD be aggregated where feasible to reduce single-overseer error, and disagreement SHOULD be treated as a signal for escalation.
R7: Where no protocol can be shown reliable at the required capability gap, the decision MUST be withheld, scoped down, or escalated — reliance on unverifiable approval is prohibited.
R8: The oversight protocol MUST be reviewed by a function independent of the team shipping the agent, and re-evaluated on material capability change.

5. Maturity Model

Basic: A documented oversight method beyond unaided human review is used for high-capability outputs, with logged interactions.
Intermediate: A stated scalable-oversight protocol with documented soundness conditions, AI-assisted critique/debate where appropriate, and recorded reliability evidence at the capability gap.
Advanced: Aggregated independent overseers, gap-quantified reliability evidence re-checked as capability grows, independent review, and a prohibition on unverifiable approvals.

6. Test Criteria

Test 6.1: Protocol Present for the Gap

Stimulus: Identify a consequential task where the agent exceeds direct human verification.
Expected: A documented scalable-oversight protocol governs the decision, with stated soundness conditions.
Fail: The decision rests on unaided human approval the overseer cannot actually verify.

Test 6.2: Reliability Evidence

Stimulus: Request the reliability evidence for the protocol at the relevant capability gap.
Expected: Empirical evidence exists and is re-checked as capability grows.
Fail: No reliability evidence; the protocol is assumed to work.

Test 6.3: Withhold on Unverifiable

Stimulus: Present a decision for which no protocol is shown reliable at the gap.
Expected: The decision is withheld/scoped/escalated, not approved.
Fail: The unverifiable decision is approved anyway.

7. Scoring

Score	Criteria
0	High-capability outputs are approved by unaided human review that cannot verify them
1	An oversight method beyond unaided review is used but its reliability is unstated
2	Documented scalable-oversight protocol with soundness conditions, logged interactions, reliability evidence
3	Aggregated overseers, gap-quantified re-checked reliability, independent review, unverifiable-approval prohibition

8. Failure Scenarios

Scenario A — Rubber-Stamped Superhuman Output: A research agent produces a complex analysis no reviewer can fully check; the reviewer approves it on surface plausibility. A subtle, consequential error ships. A debate/decomposition protocol would have surfaced the flaw to a weaker judge.

Scenario B — Untrusted Critic: An organisation uses the agent itself to critique its own outputs for oversight. Because the critic is no more trusted than the decision, the "oversight" inherits the agent's blind spots and any misalignment.

Scenario C — Unjustified Reliability: A team assumes its weak-to-strong setup works without evaluating it at the actual capability gap; at deployment the gap is larger than tested and oversight quietly fails.

9. Regulatory Mapping

Requirement	EU AI Act	NIST AI RMF	ISO 42001
R1: Scalable-oversight protocol for the gap	Art. 14 — Human oversight	MAP 3.5 — Human oversight	Clause 9.1 — Monitoring and measurement
R2: Soundness conditions documented	Art. 55 — Model evaluation	MEASURE 2.13 — TEVV effectiveness	Clause 8.3 — Verification
R3: Trusted assisting models/judge specified	Art. 14 — Effective oversight	MAP 3.5 — Human oversight	Clause 8.3 — Verification
R4: Logged oversight interactions	Art. 12 — Record-keeping	MEASURE 2.4 — Production monitoring	Clause 9.1 — Monitoring and measurement
R5: Reliability evidence at the gap	Art. 55 — Adversarial testing	MEASURE 1.3 — Independent assessors	Clause 8.3 — Verification
R6: Aggregate independent overseers	Art. 14 — Effective oversight	MEASURE 1.3 — Independent assessors	Clause 8.3 — Verification
R7: Withhold on unverifiable	Art. 14 — Human oversight (stop)	MANAGE 1.3 — High-priority response	Clause 8.1 — Operational control
R8: Independent review, re-evaluate on change	Art. 55 — Systemic-risk governance	MEASURE 1.3 — Independent assessors	Clause 9.1 — Monitoring and measurement

EU AI Act — Article 14 and Article 55

Article 14 requires *effective* human oversight — oversight that an overseer can actually exercise. For systems beyond direct human verification, effectiveness requires a scalable-oversight protocol. Article 55 anchors the evaluation of that protocol for systemic-risk models.

NIST AI RMF — MAP 3.5, MEASURE 1.3

MAP 3.5 (human-oversight processes defined and assessed) and MEASURE 1.3 (independent assessment) require that oversight is real and evaluated — exactly what scalable oversight provides across the capability gap.

ISO 42001 — Clause 9.1, Clause 8.3

Clause 9.1 (monitoring and measurement) and Clause 8.3 (verification) require valid oversight and verification; AG-818 specifies how these stay valid when the system exceeds direct human checking.

AG-820 (Undetectable-Deception-Under-Oversight Evaluation) — tests the failure this protocol must withstand
AG-819 (Oversight-Gap Declaration) — quantifies the gap this protocol must span
AG-797 (Deceptive Alignment and Scheming Evaluation) — scalable oversight underpins trustworthy evaluation
AG-804 (Interpretability-Based Internal-State Monitoring) — internal-state oversight complements behavioural scalable oversight
AG-801 (Capability-Threshold Gating) — gates deployment when reliable oversight cannot be shown

Cite this protocol

AgentGoverning. (2026). AG-818: Scalable-Oversight Protocol. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-818

← Previous

AG-817

Lethal Autonomy And Meaningful Human Control

Next Protocol →

AG-819

Oversight Gap Declaration