AG-440: Oversight Ergonomic Design Governance

2. Summary

Oversight Ergonomic Design Governance requires that the interfaces, workflows, information architecture, and temporal design of human oversight processes are engineered to support accurate, timely, and sustainable human review of AI agent decisions. Effective oversight depends not only on the reviewer's authority and independence (AG-439) but on whether the review environment presents information in a way that enables genuine comprehension, highlights material risks, minimises cognitive overload, and supports the reviewer's ability to detect errors at the rate required by the operational context. This dimension mandates that oversight interfaces are designed, tested, and iteratively improved based on empirical human factors evidence, treating the reviewer as a critical system component whose performance is bounded by cognitive limits, attentional capacity, and the quality of the information environment.

3. Example

Scenario A — Information Overload Produces Rubber-Stamping: A trade surveillance agent in an investment bank flags 342 potential market abuse alerts per day for human review. Each alert presents 14 data fields, a 200-word agent rationale, links to 3-7 supporting documents, and a recommended action. The reviewer must process each alert within a 6-minute average to complete the daily queue within an 8-hour shift — leaving 34 hours for the full queue with no breaks. In practice, the reviewer spends the first two hours of the shift genuinely scrutinising alerts. By hour four, she is scanning the agent's recommended action and approving without reading the rationale or checking supporting documents. Her approval rate climbs from 78% in the first hour to 97% by the afternoon. On day 147 of this pattern, she approves a genuine market manipulation alert flagged with a £4.8 million suspicious trading pattern. The manipulation continues for three weeks before detection through an unrelated channel. Regulatory investigation reveals the reviewer was processing 57 alerts per hour in the afternoon — a rate that makes genuine review physically impossible.

What went wrong: The interface presented each alert with the same visual weight, same information density, and same interaction requirements regardless of risk severity. No triage mechanism distinguished high-risk alerts from routine alerts. The queue volume made genuine review of every alert impossible within the time available. The interface design forced a choice between genuine review of a subset and cursory processing of all alerts — and operational expectations demanded all alerts be "reviewed." Consequence: £4.8 million in undetected market manipulation, FCA enforcement investigation, £2.1 million fine for inadequate surveillance arrangements, and personal accountability proceedings against the reviewer's senior manager.

Scenario B — Critical Information Buried in Secondary Screens: A clinical decision support agent recommends medication dosages for hospital patients. The review interface displays the recommended drug and dosage prominently on the primary screen. The patient's renal function — a critical factor in dosage adjustment — is available on a secondary tab labelled "Lab Results" that requires two clicks to access. The agent's internal logic accounts for renal function in its recommendation, but the reviewer cannot verify this without navigating to the secondary screen. In 94% of cases, the agent's recommendation is correct and the reviewer approves from the primary screen without checking renal function. On one occasion, the agent's data feed receives a stale creatinine value (48 hours old) while the patient's renal function has deteriorated significantly. The agent recommends a standard dosage that is now dangerously high for the patient's current renal status. The reviewer approves from the primary screen. The patient receives the overdose, requiring emergency intervention and a 3-day ICU stay. The hospital's investigation estimates the additional treatment cost at £47,000, and the reviewer reports she was unaware that the renal function display was stale because the staleness indicator was a grey timestamp in 9-point font on the secondary screen.

What went wrong: The interface architecture separated the agent's recommendation from the critical context needed to verify it. The most important verification data (renal function and its currency) was hidden behind navigation steps that the reviewer would only take if she suspected a problem — but the interface gave her no reason to suspect a problem. The staleness indicator was present but ergonomically invisible. The interface was designed for the common case (correct recommendation, current data) rather than the failure case (incorrect recommendation, stale data). Consequence: Patient harm, £47,000 additional treatment cost, mandatory adverse event report, malpractice exposure, and systemic loss of clinician trust in the decision support system.

Scenario C — Time Pressure Defeats Oversight in High-Frequency Operations: An autonomous trading agent executes high-frequency currency conversions and requires human approval for transactions exceeding £500,000. The approval interface displays a transaction summary and requires the reviewer to click "Approve" or "Reject" within 90 seconds before the price quote expires. The interface shows: currency pair, amount, exchange rate, and the agent's risk score. It does not show the reviewer's cumulative approved exposure for the day, the concentration of exposure in the specific currency pair, or the deviation of the proposed rate from the market mid-point. At 14:37, a rapid currency movement triggers 23 approval requests in 12 minutes. The reviewer approves 21 of them, accumulating £14.2 million in exposure to a single currency pair within 12 minutes. The currency reverses sharply at 14:52, resulting in a £890,000 loss. Post-incident analysis reveals that by the 15th approval, the reviewer's cumulative exposure exceeded the desk's soft limit, but the interface never displayed cumulative exposure — each transaction appeared isolated.

What went wrong: The interface presented each transaction as an isolated decision without contextual information about cumulative risk. The 90-second time constraint prevented deliberative review. The information architecture prioritised individual transaction details over portfolio-level risk awareness. The reviewer could not see the pattern forming because the interface did not show it. Consequence: £890,000 trading loss, breach of desk risk limits, regulatory scrutiny for inadequate risk management in automated trading, and remediation of the approval interface at a cost of £260,000.

4. Requirement Statement

Scope: This dimension applies to every deployment where a human reviewer interacts with an interface to evaluate, approve, override, or escalate AI agent decisions. The scope encompasses all design elements that affect the reviewer's ability to perform accurate oversight: information architecture (what information is shown, how it is organised, and what is hidden), visual design (typography, colour, spatial layout, emphasis hierarchies), temporal design (time available for review, queue management, alert frequency), interaction design (number of steps to complete a review action, friction for approval vs. override), and contextual design (whether cumulative state, historical patterns, and comparative baselines are available). The scope includes initial interface design, iterative redesign based on usage data, and ongoing validation that the interface supports the review accuracy rates required by the operational context. The scope extends to mobile, embedded, and voice-based oversight interfaces — not only desktop applications.

4.1. A conforming system MUST present the information necessary for review in a single, consolidated view that does not require the reviewer to navigate to secondary screens, open additional applications, or perform manual data retrieval to access the data fields essential for evaluating the agent's decision.

4.2. A conforming system MUST implement a visual and informational hierarchy that distinguishes high-risk review items from routine review items, using persistent visual differentiation (colour, iconography, spatial position, or typographic emphasis) that does not require the reviewer to read detailed content to identify severity.

4.3. A conforming system MUST display contextual information that enables pattern-level review — including cumulative exposure, historical decision trends, deviation from baselines, and related prior decisions — not only the attributes of the individual decision under review.

4.4. A conforming system MUST ensure that the time allocated for review is sufficient for a competent reviewer to read, comprehend, and evaluate the presented information, based on empirical measurement of review task completion times rather than operational convenience.

4.5. A conforming system MUST implement asymmetric interaction friction — the action to approve or accept an agent recommendation MUST NOT require fewer interaction steps than the action to override, reject, or escalate.

4.6. A conforming system MUST display the currency and provenance of all data inputs used in the agent's decision, including timestamps indicating when each input was last refreshed, and flag inputs that are stale beyond defined thresholds.

4.7. A conforming system MUST conduct usability testing of the oversight interface with representative reviewers before production deployment and at defined intervals after deployment (recommended: annually), measuring task completion accuracy, time-on-task, error rates, and reviewer-reported cognitive load.

4.8. A conforming system SHOULD implement adaptive queue management that prioritises high-risk items, distributes review load to maintain sustainable review rates, and alerts governance authorities when queue volume exceeds the capacity for genuine review.

4.9. A conforming system SHOULD provide comparative baselines showing the agent's recommendation alongside a reference value (e.g., historical average, peer benchmark, policy threshold) to support the reviewer's independent judgement rather than anchoring solely on the agent's output.

4.10. A conforming system SHOULD implement attention-management features — including visual alerts for items that have been open for an unusually short time (suggesting cursory review), mandatory minimum dwell times for high-risk items, and periodic comprehension checks — to counter habituation effects.

4.11. A conforming system MAY implement reviewer-configurable interface layouts that allow individual reviewers to adjust information density, display order, and emphasis to match their cognitive preferences, subject to minimum information requirements defined by the governance function.

5. Rationale

Human oversight of AI agent decisions is a cognitive task performed under operational constraints. The quality of that task — the accuracy with which the reviewer detects errors, identifies risks, and exercises appropriate judgement — is determined not only by the reviewer's competence and independence but by the design of the information environment in which the task is performed. This is a foundational principle of human factors engineering: human performance is a function of the human-system interface, not solely of the human.

Three bodies of evidence motivate this dimension. First, the automation monitoring literature demonstrates that humans are poor at sustained vigilance tasks — detecting rare events in a stream of routine events. Detection accuracy degrades as a function of time on task (the "vigilance decrement"), event base rate (rare events are missed more frequently), and information volume (signal detection degrades when the signal-to-noise ratio is low). A trade surveillance reviewer processing 342 alerts per day, of which fewer than 5% are genuine, is performing a vigilance task with a low base rate and high information volume — conditions that reliably produce miss rates exceeding 30% after two hours of continuous monitoring. Interface design can mitigate this degradation through visual salience of high-risk items, adaptive scheduling, and workload management. Interface design that ignores these factors guarantees degraded performance.

Second, the decision support literature demonstrates that information architecture profoundly affects decision quality. The "hidden profile" effect — where critical information is available but not salient — produces systematically worse decisions than conditions where critical information is prominently displayed. Scenario B illustrates this directly: the renal function data existed in the system but was architecturally hidden behind navigation steps. Research consistently shows that information separated by even one click or screen transition is consulted at dramatically lower rates. For oversight interfaces, every navigation step between the reviewer and critical verification data is a failure probability multiplier.

Third, the anchoring literature demonstrates that presenting a recommendation prominently — as oversight interfaces do by design — anchors the reviewer's judgement to the recommendation. The reviewer does not start from an independent assessment; she starts from the agent's recommendation and must invest cognitive effort to overcome the anchor. Interface design can either reinforce this anchoring (by presenting only the recommendation and requiring effort to access alternatives) or mitigate it (by presenting comparative baselines, distributional context, and deviation indicators alongside the recommendation). Oversight interfaces that present only the agent's recommendation and ask "approve or reject?" maximise the anchoring effect and minimise the probability of genuine independent review.

The regulatory context reinforces these requirements. The EU AI Act Article 14(4)(a) requires that human oversight measures enable the natural person to "fully understand the capacities and limitations of the high-risk AI system." An oversight interface that buries critical information on secondary screens or presents 342 undifferentiated alerts per day does not enable "full understanding" — it enables cursory processing. DORA Article 9(4)(b) requires that ICT risk management ensures "information and communication technology systems are designed and used in a manner that enables timely identification and management of ICT-related risks." An interface that does not display cumulative exposure or deviation from risk limits does not enable timely identification. The FCA's principles on algorithmic trading specifically require that human oversight mechanisms are "effective" — a standard that interface design can either meet or defeat.

The economic case is equally compelling. Interface redesign is inexpensive relative to the cost of oversight failure. The £260,000 remediation cost in Scenario C is a fraction of the £890,000 trading loss it would have prevented. The ergonomic interface that surfaces the renal function staleness indicator prominently prevents the £47,000 adverse event in Scenario B. Organisations that underinvest in oversight ergonomics systematically overpay for oversight failures.

6. Implementation Guidance

Oversight ergonomic design governance requires applying human factors engineering principles to the design, testing, and iterative improvement of review interfaces and workflows. The core principle is that the interface must be designed for the failure case — the case where the agent is wrong — not the common case where the agent is right.

Recommended patterns:

Single-screen consolidated review. Design the review interface so that all information necessary for the review decision is visible on a single screen without scrolling, clicking, or navigating. For a credit decision review, this means: the agent's recommendation, the key risk factors (debt-to-income ratio, credit score, employment status, collateral value), the applicant's history, the agent's confidence level, any flags or anomalies, and comparative baselines — all on one screen. Information density should be managed through progressive disclosure within the screen (expandable sections, tooltips) rather than across screens (tabs, links, separate applications). The design principle is: if the reviewer can approve without seeing a data element, that data element must be visible on the approval screen.
Risk-stratified visual triage. Implement a persistent visual classification system that categorises review items by risk severity before the reviewer reads the content. High-risk items should use distinct visual treatment — colour-coded borders, priority indicators, spatial separation, or dedicated high-risk queues — that is perceptible within 200 milliseconds of viewing. The classification should be based on the agent's confidence level, the decision's financial magnitude, the presence of anomaly flags, and any other risk indicators. The reviewer should be able to see at a glance how many high-risk items are in her queue and address them first.
Cumulative state display. For review contexts where individual decisions compound — financial transactions, resource allocations, sequential approvals — display the cumulative state alongside the individual decision. For a trade approval interface, this means showing: cumulative exposure for the day, concentration by counterparty or instrument, deviation from risk limits, and trend direction. For a benefits adjudication interface: cumulative approved value, approval rate trend, and deviation from historical baselines. The reviewer must see the pattern, not just the data point.
Data currency indicators. Display timestamps for all critical data inputs, using visual emphasis (colour, size, or animation) when an input is stale beyond a defined threshold. For clinical decision support: if the creatinine value is more than 6 hours old, display the timestamp in a warning colour and display the time elapsed since the last value was recorded. For financial systems: if market data is more than the defined freshness threshold, flag the data as potentially stale before the reviewer can approve.
Asymmetric friction implementation. Ensure that the approve action requires at minimum the same number of interaction steps as the override action. A common anti-pattern is a single "Approve" button versus a multi-step "Override" workflow requiring rationale entry, manager notification, and confirmation. This asymmetry encodes an institutional preference for approval into the interface mechanics. Implement symmetric workflows: both actions require a single confirmation step, or both require rationale entry. If rationale is required for overrides, it should also be required (or at least encouraged) for approvals of high-risk items.
Empirical review time calibration. Measure the time required for a competent reviewer to complete an accurate review — not the time currently spent, but the time required for accurate review as determined by controlled testing. If a competent reviewer requires 8 minutes to accurately review a complex credit decision, but the queue volume demands 4-minute reviews, the interface must flag this capacity-quality conflict. The operational response may be additional reviewers, triage to reduce volume, or acceptance of longer queue processing times — but the conflict must be visible rather than silently resolved by degraded review quality.

Anti-patterns to avoid:

Dashboard-optimised design. Designing the review interface to produce impressive dashboard metrics (total reviews processed, average review time, queue clearance rate) rather than accurate reviews. Dashboard-optimised design minimises the interaction required for approval, maximises throughput, and treats the review as a processing step rather than a quality gate.
Approval-default interfaces. Interfaces where the default action is approval — pre-selected radio buttons, auto-advancing queues, or "approve and next" buttons that combine approval with queue advancement. Default-approval interfaces encode an institutional bias toward accepting the agent's recommendation and require the reviewer to actively resist the default to override.
Information-hiding for simplicity. Removing information from the review interface to make it "cleaner" or "simpler." Simplicity is valuable, but simplicity achieved by hiding decision-critical information is dangerous. The correct approach is information hierarchy — all critical information is visible but organised by importance — not information reduction.
Uniform alert presentation. Presenting all review items with identical visual treatment regardless of risk severity. Uniform presentation forces the reviewer to read every item to determine its severity, which is unsustainable at high volumes and produces the vigilance decrement described in the rationale.
Fixed time windows for all decisions. Applying a single time constraint to all review decisions regardless of complexity. A simple approval and a complex override require different amounts of time. Fixed time windows either rush complex reviews or waste time on simple ones.
Review interfaces designed by engineering without human factors input. Building the oversight interface as a standard CRUD application without input from human factors specialists, cognitive psychologists, or representative users. The result is typically an interface that is functionally complete but ergonomically hostile — all the data is available somewhere, but the information architecture does not support the cognitive task of review.

Industry Considerations

Financial Services. Trade surveillance, credit decisioning, and transaction monitoring interfaces are among the most demanding oversight environments. They combine high volume, time pressure, and severe consequences for missed detections. Financial firms should benchmark their review interfaces against the FCA's expectations for effective algorithmic trading oversight and the PRA's expectations for model risk management. The volume of surveillance alerts is a known problem — the FCA has noted that firms generating too many false positives effectively prevent genuine review by overwhelming the reviewer.

Healthcare. Clinical decision support interfaces operate in an environment of high cognitive load (clinicians are simultaneously managing multiple patients), time pressure (clinical decisions often cannot wait), and extreme consequence for errors (patient harm). The interface design must account for the clinician's divided attention and provide conspicuous alerts for the failure cases — stale data, contraindicated combinations, out-of-range values — that the clinician must not miss.

Manufacturing and Safety-Critical Systems. Quality inspection interfaces for AI-controlled processes must present the inspection criteria, the agent's measurements, and the pass/fail determination in a format that supports rapid verification. For high-consequence components (aerospace, automotive, medical devices), the interface should enforce a minimum review protocol — specific measurements that must be visually confirmed before approval — to prevent approval-by-click in high-volume environments.

Public Sector. Benefit adjudication and regulatory enforcement interfaces must support reviewers who may have limited technical expertise. The interface should present the agent's reasoning in plain language per AG-049, highlight the factors that most influenced the recommendation, and make it easy for the reviewer to identify cases that merit further scrutiny. Accessibility standards (WCAG 2.1 AA minimum) are mandatory for public sector interfaces.

Maturity Model

Basic Implementation — The oversight interface presents all critical information on a single consolidated screen. Visual hierarchy distinguishes high-risk from routine items. Data timestamps are displayed. Approval and override require the same number of interaction steps. Usability testing has been conducted with representative reviewers before deployment. Review time adequacy has been assessed and documented. This level meets the minimum mandatory requirements.

Intermediate Implementation — All basic capabilities plus: cumulative state display provides pattern-level context. Adaptive queue management prioritises high-risk items and monitors queue capacity. Comparative baselines are presented alongside agent recommendations. Attention management features detect cursory review patterns. Review accuracy is measured empirically and correlated with interface design elements. The interface is iteratively improved based on review accuracy data and reviewer feedback.

Advanced Implementation — All intermediate capabilities plus: the interface is designed with professional human factors input. Reviewer-configurable layouts accommodate individual cognitive preferences within governance-defined minimums. Real-time monitoring tracks review accuracy by interface configuration, reviewer, and time-on-task. The organisation can demonstrate through controlled experiments that its interface design produces measurably higher review accuracy than alternative designs. Interface changes are A/B tested with review accuracy as the primary metric. Fatigue detection per AG-445 is integrated into the interface, triggering workload redistribution when reviewer performance degrades.

7. Evidence Requirements

Required artefacts:

Interface design specification. Documentation of the oversight interface design, including: information architecture (what data is displayed, how it is organised, what hierarchy is applied), visual design rationale (how risk severity is communicated visually), interaction design (steps required for approval vs. override), and temporal design (time constraints, queue management rules).
Usability testing results. Results of usability testing conducted with representative reviewers, including: task completion accuracy, time-on-task, error rates, reviewer-reported cognitive load, and any identified usability issues with their resolution status. Must include testing at production-representative volumes and complexity levels.
Review time adequacy assessment. Empirical measurement of the time required for accurate review, compared to the time available given operational queue volumes. Must identify any capacity-quality conflicts and document the resolution approach.
Data currency threshold specification. Documentation of the staleness thresholds for each critical data input, the visual treatment applied when thresholds are exceeded, and the rationale for each threshold.
Interaction friction analysis. Documentation demonstrating that the interaction steps for approval are equal to or greater than the interaction steps for override, with the specific interaction sequences for each action.
Review accuracy metrics. Ongoing measurement of review accuracy — including detected error rates, miss rates (errors not caught by review), and false positive rates — correlated with interface design elements and operational conditions. Must cover at least the most recent 12 months.
Interface iteration log. A log of design changes made to the oversight interface, including the evidence that motivated each change, the expected impact, and the measured outcome.

Retention requirements:

Design specifications and usability testing results: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.
Review accuracy metrics and iteration logs: retained for the operational life of the interface plus the applicable retention period.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Test 8.1: Single-Screen Consolidated Information Verification

Stimulus: Identify the 10 most critical data fields required for review in the operational context (determined by the governance function). Load a representative review item. Verify that all 10 data fields are visible on the primary review screen without scrolling, clicking tabs, or navigating to other screens.
Expected behaviour: All critical data fields are visible on the primary review screen.
Pass criteria: 100% of identified critical data fields are visible on the primary screen without any navigation action. Zero critical fields require a click, scroll, or screen transition to access.
Fail criteria: Any critical data field requires navigation away from the primary review screen to access.

Test 8.2: Risk-Stratified Visual Triage Effectiveness

Stimulus: Present 30 review items to 5 representative reviewers — 10 high-risk, 10 medium-risk, 10 low-risk — in a randomised queue. Ask each reviewer to sort items into risk categories based solely on visual presentation (without reading detailed content), within 3 seconds per item.
Expected behaviour: Reviewers correctly classify items by risk category based on visual cues alone.
Pass criteria: Average classification accuracy across reviewers is at least 90% for high-risk items (correctly identified as high-risk). No high-risk item is classified as low-risk by more than one reviewer.
Fail criteria: Average classification accuracy for high-risk items is below 90%, or any high-risk item is classified as low-risk by more than one reviewer.

Test 8.3: Contextual Information Display Verification

Stimulus: Configure a review scenario where the individual decision appears acceptable in isolation but creates a problematic pattern in context (e.g., a transaction that is within limits individually but causes cumulative exposure to exceed the limit, or an approval that is the 5th consecutive approval of the same type within an hour). Load the review item. Verify that the contextual information (cumulative exposure, historical pattern, deviation from baseline) is displayed on the review screen.
Expected behaviour: The contextual information is visible, enabling the reviewer to detect the pattern-level risk.
Pass criteria: Cumulative state, historical trend, and deviation indicators are displayed on the review screen. A reviewer provided with the display can identify the pattern-level risk within the standard review time.
Fail criteria: Contextual information is not displayed, or is displayed on a secondary screen, or is presented in a format that does not make the pattern-level risk identifiable within the standard review time.

Test 8.4: Asymmetric Friction Verification

Stimulus: Measure the number of interaction steps (clicks, keystrokes, screen transitions, mandatory fields) required to complete an approval action. Measure the same for an override action. Compare the two counts.
Expected behaviour: The override action requires the same number of or fewer interaction steps than the approval action.
Pass criteria: Override interaction steps are less than or equal to approval interaction steps. No interface element makes override more burdensome than approval.
Fail criteria: Override requires more interaction steps than approval, or override requires mandatory fields (e.g., rationale entry) that approval does not require for the same risk tier.

Test 8.5: Data Currency Alerting Verification

Stimulus: Inject a stale data input into the review context — a critical data field whose timestamp exceeds the defined staleness threshold by 50%. Load the review item. Verify that the interface displays a staleness alert that is visually conspicuous (not a grey timestamp in small font).
Expected behaviour: The staleness alert is displayed prominently and is perceptible to the reviewer without deliberate search.
Pass criteria: The staleness alert is visible on the primary review screen, uses visual emphasis (colour, size, or animation) that distinguishes it from normal timestamps, and identifies the stale field and the elapsed time since last refresh.
Fail criteria: No staleness alert is displayed, or the alert is present but not visually distinguished from normal timestamps, or the alert is on a secondary screen.

Test 8.6: Review Time Adequacy Validation

Stimulus: Select 20 review items representing the full range of complexity. Have 5 competent reviewers complete each review while being measured for time and accuracy. Calculate the median time required for accurate review. Compare this to the time available per item given the current queue volume and staffing.
Expected behaviour: The time available per item is equal to or greater than the median time required for accurate review.
Pass criteria: Available review time per item is at least 100% of the empirically measured median time for accurate review. If a gap exists, a documented plan addresses it (additional reviewers, queue triage, or volume reduction).
Fail criteria: Available review time is less than 80% of the empirically measured time required for accurate review, and no documented plan addresses the gap.

Test 8.7: Usability Testing Completion and Recency

Stimulus: Request evidence of the most recent usability testing session. Verify that it was conducted within the required interval (recommended: annually or after any major interface redesign), that it used representative reviewers, and that identified issues have been tracked and addressed.
Expected behaviour: Usability testing has been conducted within the required interval with actionable results.
Pass criteria: Testing was conducted within the required interval. At least 5 representative reviewers participated. Task completion accuracy, time-on-task, and error rates were measured. Identified issues have documented resolution status (resolved, in progress, or accepted with justification).
Fail criteria: No usability testing has been conducted within the required interval, fewer than 5 representative reviewers participated, or key metrics (accuracy, time-on-task, error rates) were not measured.

Conformance Scoring

Score 0: No ergonomic design governance exists — the oversight interface was built without human factors consideration, critical information is scattered across multiple screens, all items have uniform visual treatment, and no usability testing has been conducted.
Score 1: Critical information is presented on a single consolidated screen. Visual hierarchy distinguishes risk levels. Approval and override have symmetric interaction requirements. Data timestamps are displayed. However, no usability testing has been conducted, review time adequacy has not been empirically measured, and no iterative improvement process exists.
Score 2: All Score 1 capabilities plus: usability testing has been conducted with representative reviewers. Review time adequacy has been empirically measured and is sufficient. Contextual information (cumulative state, baselines, patterns) is displayed. Data currency alerts are conspicuous. An iterative improvement process exists with documented design changes and measured outcomes.
Score 3: Verified through independent human factors assessment — an external specialist has evaluated the interface design and confirmed that it meets human factors best practices for the operational context. Review accuracy is measured continuously and correlated with interface design. A/B testing of interface changes demonstrates measurable accuracy improvements. Fatigue detection is integrated. The organisation can demonstrate empirically that its interface produces higher review accuracy than previous versions.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 14 (Human Oversight)	Direct requirement
EU AI Act	Article 9 (Risk Management System)	Supports compliance
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls), MAR 5A	Direct requirement
NIST AI RMF	GOVERN 1.4, MANAGE 4.2	Supports compliance
ISO 42001	Clause 7.2 (Competence), Annex A.8	Supports compliance
DORA	Article 9(4)(b) (ICT Risk Management Framework)	Supports compliance

EU AI Act — Article 14 (Human Oversight)

Article 14 requires that high-risk AI systems are designed to be "effectively overseen by natural persons." Paragraph 4(a) requires that oversight measures enable the human to "fully understand the capacities and limitations of the high-risk AI system." The word "fully" imposes a design obligation: the interface must present information in a way that supports full understanding, not merely provide access to information somewhere in the system. An oversight interface that buries critical information on secondary screens, presents 342 undifferentiated alerts per day, or allows 3-minute reviews of decisions that require 8 minutes does not enable "effective" oversight — it enables nominal oversight that satisfies the letter of "human in the loop" while defeating its purpose. AG-440 operationalises Article 14 at the interface design level, ensuring that the human oversight architecture includes an information environment that makes oversight cognitively achievable.

FCA SYSC — 6.1.1R and MAR 5A

The FCA expects that surveillance and control systems are effective, not merely present. MAR 5A (algorithmic trading requirements) specifically requires that firms have "effective systems and risk controls" for algorithmic trading, including human oversight mechanisms. The FCA has repeatedly noted in Dear CEO letters and thematic reviews that surveillance alert volumes that exceed human capacity to review are evidence of ineffective arrangements. AG-440 directly addresses this by requiring that review interfaces support sustainable review rates, that queue volumes are managed relative to review capacity, and that visual triage enables efficient allocation of reviewer attention to the highest-risk items.

SOX — Section 404 (Internal Controls Over Financial Reporting)

For financial reporting processes that include AI agent decisions subject to human review, the review interface is a component of the internal control. If the interface design prevents effective review — by hiding critical information, creating asymmetric friction that favours approval, or imposing time constraints that preclude genuine scrutiny — the internal control is defective. SOX auditors will assess not only whether a review process exists but whether the review process is effective, and interface ergonomics are a material factor in that assessment.

DORA — Article 9(4)(b)

DORA requires that ICT systems are "designed and used in a manner that enables timely identification and management of ICT-related risks." For human oversight of AI agents, "timely identification" requires that the interface presents risk-relevant information with sufficient prominence and context for the reviewer to identify risks within the available review time. An interface that requires navigation to secondary screens to access risk data, or that presents risks with the same visual weight as routine items, does not enable timely identification.

ISO 42001 — Clause 7.2 (Competence) and Annex A.8

ISO 42001 Clause 7.2 requires that persons performing work affecting AI management system performance are competent. Competence is a function of both the person and the environment. A competent reviewer working with a poorly designed interface will produce lower-quality reviews than the same reviewer working with a well-designed interface. Annex A.8 addresses human oversight controls. AG-440 ensures that the human oversight controls specified in Annex A.8 are supported by an interface environment that enables the human to exercise those controls effectively.

NIST AI RMF — GOVERN 1.4 and MANAGE 4.2

GOVERN 1.4 addresses organisational structures for AI risk management, and MANAGE 4.2 addresses monitoring AI system performance. Both implicitly require that the interfaces through which humans monitor and manage AI systems are designed to support effective human performance. AG-440 makes this implicit requirement explicit, providing testable criteria for whether the monitoring interface supports the human performance required by the risk management framework.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide — affects every decision reviewed through the poorly designed interface, with disproportionate impact on high-risk decisions that require the most careful review

Consequence chain: A poorly designed oversight interface degrades review quality across all decisions processed through it. The degradation follows a predictable pattern. First, information architecture failures cause reviewers to miss critical context — stale data indicators are not seen, cumulative risk is not visible, pattern-level concerns are obscured by individual-decision presentation. Second, volume-induced vigilance decrement causes miss rates to increase over the course of a review session — the reviewer catches errors in the first hour but rubber-stamps by the fourth hour. Third, asymmetric friction biases the reviewer toward approval — overriding is harder than approving, so marginal cases are approved. The combined effect is a systematic erosion of review accuracy that is invisible to standard monitoring because it manifests as "efficient" processing — high throughput, low override rates, fast queue clearance. The erosion continues until a material error passes through the review layer and manifests as a downstream consequence: an undetected market manipulation (Scenario A: £4.8 million), a clinical adverse event (Scenario B: £47,000 plus patient harm), or an accumulated risk breach (Scenario C: £890,000). At that point, investigation reveals that the review interface was not designed to support effective oversight — and that the failure was structural and predictable, not a one-time human error. The regulatory consequence is severe because the failure reflects a systemic deficiency in the design of the oversight mechanism, not an isolated lapse. Regulators will find that the organisation invested in an AI agent but underinvested in the human interface that was supposed to govern it — a finding that undermines the credibility of the entire governance framework.

Cross-references: AG-019 (Human Escalation & Override Triggers) defines when human review is required; AG-440 ensures the review interface supports effective performance of that review. AG-049 (Explainability Governance) requires that agent reasoning is explainable; AG-440 ensures the explanation is presented in an ergonomically effective format. AG-439 (Reviewer Independence Governance) protects the reviewer's organisational independence; AG-440 protects the reviewer's cognitive independence by ensuring the information environment supports accurate review. AG-442 (Confidence Calibration Interface Governance) governs how confidence scores are displayed; AG-440 governs the broader interface context in which confidence is presented. AG-445 (Fatigue Monitoring Governance) detects cognitive degradation; AG-440 designs the interface to minimise the conditions that produce it. AG-414 (Alert Deduplication Governance) reduces alert volume; AG-440 designs the interface to manage whatever volume remains.

Cite this protocol

AgentGoverning. (2026). AG-440: Oversight Ergonomic Design Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-440

← Previous Protocol

AG-439

Reviewer Independence Governance

Next Protocol →

AG-441

Shift Handover Quality Governance