AG-445: Fatigue Monitoring Governance

2. Summary

Fatigue Monitoring Governance requires that organisations operating AI agents under human oversight implement continuous mechanisms to detect, measure, and respond to reviewer fatigue or cognitive overload that degrades oversight quality. Fatigue-impaired reviewers approve decisions they would otherwise challenge, miss anomalies they would otherwise catch, and rubber-stamp escalations that warrant genuine deliberation — converting formal human-in-the-loop governance into a compliance theatre that satisfies process requirements while eliminating their protective value. This dimension mandates quantitative fatigue indicators, threshold-based alerts, and mandatory intervention protocols that preserve the substantive quality of human oversight throughout extended operational periods.

3. Example

Scenario A — Overnight Shift Approval Degradation: A financial services firm operates a 24-hour AI-assisted trading desk where human reviewers approve algorithmically generated trade recommendations. Between 02:00 and 06:00, a single reviewer is responsible for approving trades from three agent systems. Audit analysis reveals that the overnight reviewer's average review time per trade drops from 47 seconds during the first two hours of the shift to 8 seconds during hours 10 through 12. During the 02:00–06:00 window, the reviewer approves 99.4% of presented trades compared with a 91.2% daytime approval rate. One trade approved at 04:17 with a 3-second review time results in a £2.3 million position in an illiquid instrument that violates the firm's concentration policy. The position is unwound at a loss of £410,000. Regulatory investigation finds that the reviewer was cognitively impaired by fatigue but no monitoring system detected or responded to the degradation.

What went wrong: The organisation required human oversight but did not monitor whether that oversight was substantive. The reviewer's approval rate and review time shifted dramatically during fatigue-impaired hours, but no system tracked these proxy indicators. The formal human-in-the-loop requirement was satisfied — a human did click "approve" — but the oversight was functionally absent. The £410,000 loss and regulatory finding resulted directly from unmonitored fatigue degradation.

Scenario B — Alert Volume Saturation in Safety-Critical Operations: A chemical plant deploys an AI-driven process control agent that generates safety alerts requiring human acknowledgement. During normal operations, the system generates 15–25 alerts per 8-hour shift. Following a software update to a sensor array, the alert rate increases to 340 alerts per shift due to recalibrated thresholds. The human operator, responsible for acknowledging each alert and determining whether physical intervention is required, initially reviews each alert carefully. After four hours of sustained high-volume alert processing, the operator begins batch-acknowledging alerts without reading the detail pane. At hour six, the system generates a genuine high-severity alert indicating a pressure anomaly in a reactor vessel. The operator acknowledges it without reading it, in 1.2 seconds. The anomaly escalates over the following 90 minutes, resulting in an emergency shutdown, £1.7 million in lost production, and a near-miss safety incident investigated by the Health and Safety Executive.

What went wrong: The operator experienced alert fatigue — a well-documented phenomenon where high alert volumes cause reviewers to treat all alerts as low-priority noise. No system monitored the operator's acknowledgement patterns to detect the shift from deliberate review to batch acknowledgement. The organisation had a human-in-the-loop requirement but no mechanism to verify that the human was substantively in the loop. The 340-alert-per-shift volume exceeded any reasonable human processing capacity, but no volume-based fatigue threshold existed to trigger intervention.

Scenario C — Cumulative Micro-Decision Fatigue in Benefits Processing: A public sector agency uses an AI agent to process disability benefit applications, with human reviewers making final eligibility determinations. Each reviewer processes approximately 120 cases per day. Analysis of 18 months of decisions reveals a statistically significant pattern: reviewers approve 74% of cases reviewed in the first two hours of the day and 89% of cases reviewed in the final two hours. The approval rate divergence is not explained by case complexity distribution, which is randomised. An applicant whose case is reviewed at 16:30 is 15 percentage points more likely to be approved than an identical case reviewed at 09:30. Over 18 months, an estimated 2,400 determinations are affected by fatigue-driven decision drift, with approximately 640 applicants receiving incorrect outcomes (some approved who should have been denied, some denied who should have been approved). The agency faces a judicial review challenge arguing that the decision-making process is structurally biased by time-of-day effects attributable to cognitive fatigue.

What went wrong: The organisation processed high volumes of consequential decisions without monitoring for decision quality degradation over time. The 120-cases-per-day workload exceeded sustainable cognitive capacity for careful deliberation, but no threshold or monitoring existed. The time-of-day approval rate divergence was a classic fatigue indicator that would have been detectable within 60 days of operation but was not discovered for 18 months because no fatigue monitoring was in place. The judicial review challenge questions the fundamental fairness of a decision process where outcomes are predicted by review time rather than case merit.

4. Requirement Statement

Scope: This dimension applies to any AI agent deployment where human reviewers, operators, or overseers are required to perform cognitive tasks — approvals, reviews, acknowledgements, escalation decisions, quality checks, or safety assessments — as part of the agent's governance or operational loop. The scope includes both synchronous oversight (real-time approval before agent action) and asynchronous oversight (post-hoc review of agent actions). The critical test is: does the governance model depend on a human performing a cognitive task with adequate attention and judgement? If yes, this dimension applies. The scope excludes purely automated oversight mechanisms (e.g., rule-based filters) that do not depend on human cognitive performance. The scope includes all human participants in the oversight chain regardless of organisational role — reviewers, approvers, operators, monitors, escalation handlers, and quality assurance personnel.

4.1. A conforming system MUST implement quantitative fatigue indicators that measure proxy signals for reviewer cognitive degradation, including at minimum: (a) average decision time per review over rolling windows, (b) approval/rejection rate deviation from established baselines, and (c) consecutive hours of active review without substantive break.

4.2. A conforming system MUST define fatigue thresholds for each quantitative indicator that, when breached, trigger mandatory intervention actions. Thresholds MUST be calibrated against empirical baselines for each reviewer role and MUST be documented with the rationale for each threshold value.

4.3. A conforming system MUST implement mandatory intervention actions when fatigue thresholds are breached, including at minimum: (a) alerting the fatigued reviewer's supervisor, (b) suspending the reviewer's approval authority until the fatigue condition is resolved, and (c) queuing pending decisions for a non-fatigued reviewer or deferring them until the reviewer has recovered.

4.4. A conforming system MUST monitor alert and decision volume per reviewer per shift and trigger volume-based interventions when the volume exceeds the defined sustainable processing capacity for the reviewer role. Sustainable capacity thresholds MUST be established through empirical measurement or evidence-based standards, not arbitrary assignment.

4.5. A conforming system MUST retroactively flag decisions made during periods when fatigue indicators exceeded thresholds, enabling targeted re-review of potentially compromised decisions.

4.6. A conforming system MUST produce fatigue monitoring reports at least monthly, disaggregated by reviewer, shift pattern, and decision type, showing threshold breaches, intervention actions taken, and decisions flagged for re-review.

4.7. A conforming system SHOULD implement graduated fatigue response levels — advisory (notify reviewer of degradation indicators), warning (notify supervisor, increase sampling of reviewer decisions), and critical (suspend reviewer authority, redirect decisions).

4.8. A conforming system SHOULD integrate fatigue monitoring data with shift scheduling systems to enable proactive schedule adjustments that prevent foreseeable fatigue conditions (e.g., reducing assignment volume in known high-fatigue periods, ensuring adequate break scheduling).

4.9. A conforming system SHOULD implement challenge injection — periodically inserting known-answer test cases into the reviewer's decision queue to provide direct measurement of oversight accuracy under current conditions.

4.10. A conforming system MAY implement physiological fatigue indicators where the operational context permits (e.g., interaction latency patterns, mouse movement characteristics, session engagement metrics), provided that such monitoring complies with applicable privacy and employment regulations and that reviewers are informed of the monitoring.

5. Rationale

Human oversight of AI agent operations is one of the most widely mandated governance mechanisms in current and emerging regulation. The EU AI Act requires human oversight for high-risk AI systems. Financial regulators require human approval for consequential automated decisions. Safety-critical domains require human operators who can intervene when automated systems behave unexpectedly. But the value of human oversight depends entirely on the cognitive quality of the human performing it. A fatigued, overloaded, or attention-depleted reviewer provides the illusion of oversight without its substance — the process requirement is satisfied but the protective function is absent.

Fatigue is not an edge case or an exceptional failure mode. It is a predictable, measurable, and well-studied phenomenon that affects every human performing sustained cognitive work. Decades of research in aviation, medicine, nuclear operations, and transportation have established that cognitive performance degrades significantly after extended periods of sustained attention. The specific degradation patterns are well-characterised: decision quality declines, response times lengthen and then paradoxically shorten (as reviewers begin to skip deliberation), approval rates drift toward defaults, and anomaly detection sensitivity decreases. These patterns are not character flaws — they are neurological constraints of human cognitive architecture.

The governance risk is acute because AI agent deployments often require sustained oversight at volumes and durations that exceed historical precedent. A human reviewer who previously processed 30 manual applications per day may be asked to oversee 300 AI-generated determinations per day. The cognitive demand has increased tenfold, but the governance model still assumes the same quality of human judgement. The mismatch between oversight volume and human cognitive capacity is a structural risk that must be managed through monitoring and intervention, not merely through hiring more reviewers or exhorting existing reviewers to maintain attention.

Alert fatigue is a particularly dangerous variant. When the volume of items requiring human attention exceeds the human's sustainable processing capacity, the human adapts by reducing the attention allocated to each item. This is not laziness — it is a rational cognitive strategy for an impossible workload. The result is that all items receive less attention, including the genuinely critical items that the oversight process was designed to catch. In safety-critical domains, alert fatigue has been identified as a contributing factor in major incidents across healthcare, aviation, and industrial process control.

The regulatory environment increasingly recognises that nominal human oversight is insufficient. The EU AI Act's requirement for "effective" human oversight implies that the oversight must be substantive, not merely procedural. An organisation that requires human approval but does not monitor whether that approval reflects genuine cognitive engagement is at risk of a finding that its oversight was not effective. This dimension provides the detection mechanisms that transform nominal oversight into verified oversight.

6. Implementation Guidance

Fatigue Monitoring Governance requires a detection infrastructure that continuously assesses reviewer cognitive state through behavioural proxy indicators and triggers interventions before fatigue degrades oversight quality below acceptable thresholds. The core principle is that human oversight quality is not static — it varies with time, workload, and individual capacity — and must be monitored as a dynamic variable, not assumed as a constant.

Recommended patterns:

Behavioural proxy instrumentation. Instrument the review interface to capture timestamped behavioural signals: time spent on each decision, scrolling and reading patterns within decision detail views, the specific fields or sections viewed before each decision, and the sequence of approve/reject/escalate actions. These signals are processed through rolling-window analysis to detect degradation patterns. For example, a reviewer whose average decision time drops below 40% of their personal baseline over a 30-minute rolling window is exhibiting a quantitative fatigue signal regardless of subjective self-assessment.
Baseline calibration per reviewer and role. Establish empirical baselines for each reviewer by measuring their decision patterns during validated non-fatigued periods (e.g., the first two hours of a well-rested shift). Baselines should capture: median decision time, decision time variance, approval rate, escalation rate, and detail-view engagement depth. Fatigue thresholds are then defined as deviations from the individual baseline rather than absolute values, accounting for natural variation in individual working speeds and styles.
Volume-capacity matching. Define sustainable decision volumes for each reviewer role based on empirical measurement or evidence-based standards from comparable domains. For complex financial trade approvals, sustainable volume might be 8–12 decisions per hour; for routine document classifications, 40–60 per hour; for safety-critical alert acknowledgements, 4–8 genuine assessments per hour. When the queued decision volume would exceed sustainable capacity within the shift, trigger proactive interventions: redistribute to additional reviewers, defer non-urgent decisions, or activate fallback staffing per AG-426.
Graduated response protocol. Implement three response levels: (1) Advisory — the reviewer receives a non-intrusive notification that fatigue indicators are approaching thresholds, suggesting a voluntary break; (2) Warning — the reviewer's supervisor is notified, a random sample of the reviewer's recent decisions is flagged for quality check, and the reviewer receives a prominent notification; (3) Critical — the reviewer's approval authority is suspended, pending decisions are redirected, and decisions made in the preceding 60 minutes are flagged for re-review.
Challenge injection. Periodically insert synthetic test cases with known correct outcomes into the reviewer's decision queue. These cases should be indistinguishable from real cases in presentation but have a pre-determined correct outcome (e.g., a trade that clearly violates a policy and should be rejected). The reviewer's response to challenge cases provides a direct, real-time measurement of oversight accuracy. A reviewer who approves a challenge case that should be rejected is demonstrating impaired oversight regardless of what behavioural proxy indicators suggest.

Anti-patterns to avoid:

Self-reported fatigue assessment. Relying on reviewers to self-report when they are fatigued. Research consistently shows that fatigued individuals are poor judges of their own impairment — the cognitive processes that enable accurate self-assessment are themselves degraded by fatigue. Self-reporting may complement quantitative monitoring but must never replace it.
Uniform thresholds across all reviewer roles. Applying the same fatigue thresholds to a reviewer processing 10 complex safety assessments per shift and a reviewer processing 200 routine classifications per shift. Decision complexity, consequence severity, and baseline processing rates vary dramatically across roles. Thresholds must be role-specific and empirically calibrated.
Monitoring without intervention. Collecting fatigue indicators without implementing mandatory intervention actions. If the monitoring detects fatigue but the system merely logs it for later analysis without intervening in real time, the protective function is absent. Detection without intervention is audit evidence of negligence, not governance.
Punitive fatigue management. Treating fatigue threshold breaches as performance failures attributable to individual reviewers. Fatigue is a system-level risk created by workload design, shift scheduling, and staffing levels. Punitive responses discourage honest engagement with the monitoring system and do not address root causes. Fatigue breaches should trigger system-level interventions (workload redistribution, schedule adjustment, staffing review), not individual disciplinary action.
Break scheduling as sole mitigation. Mandating break schedules without monitoring whether breaks are actually taken or whether they are sufficient to restore cognitive function. Scheduled breaks are necessary but not sufficient — the monitoring must verify that post-break performance returns to baseline levels.

Industry Considerations

Financial Services. Financial regulators expect that human oversight of automated trading and advisory decisions is substantive, not nominal. The FCA's Senior Managers and Certification Regime creates personal accountability for individuals overseeing automated systems. A senior manager who oversees a trading desk where fatigue-impaired reviewers approve concentration policy violations faces personal regulatory liability. Firms should implement fatigue monitoring with particular attention to overnight and weekend shifts where staffing is typically lighter and oversight fatigue risk is highest.

Healthcare. Clinical decision support systems increasingly require human clinician approval or review. Clinician fatigue is a longstanding patient safety concern with an extensive evidence base. Healthcare deployments should integrate AI oversight fatigue monitoring with existing clinician fatigue management frameworks, including maximum shift duration limits, mandatory rest periods, and cognitive workload assessment tools.

Safety-Critical and Industrial. Process control environments have decades of human factors research on operator fatigue. Standards such as ANSI/ISA-18.2 for alarm management already address alert rationalisation and operator cognitive load. AI agent deployments in these environments should align fatigue monitoring with existing alarm management standards and human factors engineering practices.

Public Sector. Benefits determination, immigration processing, and other high-volume public sector decision-making contexts involve consequential decisions affecting individuals' rights. Decision quality degradation due to fatigue creates fairness risks — applicants reviewed during high-fatigue periods receive systematically different outcomes than those reviewed during low-fatigue periods. Public sector deployments should monitor for time-of-day and end-of-shift decision drift as a fairness indicator.

Maturity Model

Basic Implementation — The organisation monitors at least three quantitative fatigue indicators per reviewer: decision time, approval rate deviation, and consecutive review duration. Fatigue thresholds are defined for each indicator. Threshold breaches trigger supervisor notification and decision flagging. Monthly fatigue reports are produced. Sustainable volume thresholds are defined for each reviewer role. This level meets the minimum mandatory requirements.

Intermediate Implementation — All basic capabilities plus: graduated response protocols are implemented with advisory, warning, and critical levels. Baselines are calibrated per individual reviewer. Fatigue monitoring data is integrated with shift scheduling. Challenge injection provides direct accuracy measurement. Retroactive re-review of fatigue-period decisions is implemented as a standard process. Volume-capacity matching proactively prevents foreseeable overload conditions.

Advanced Implementation — All intermediate capabilities plus: real-time fatigue dashboards provide organisational visibility across all reviewer populations. Predictive models identify fatigue risk before threshold breaches occur, enabling pre-emptive intervention. Fatigue monitoring data feeds into continuous improvement of workload design, shift patterns, and staffing levels. Independent validation confirms that fatigue monitoring effectively prevents oversight quality degradation. Cross-shift and cross-team fatigue pattern analysis identifies systemic workload design issues.

7. Evidence Requirements

Required artefacts:

Fatigue indicator specification. Documentation defining each quantitative fatigue indicator monitored, the measurement methodology, the threshold values, and the calibration rationale.
Threshold calibration records. Evidence showing how fatigue thresholds were established — empirical baseline measurements, reference to evidence-based standards, or expert determination with documented justification.
Intervention protocol documentation. The documented intervention actions triggered at each fatigue threshold level, including escalation paths, authority suspension procedures, and decision re-routing mechanisms.
Fatigue monitoring logs. Time-series records of fatigue indicator values per reviewer per session, including all threshold breach events, intervention actions taken, and outcomes.
Monthly fatigue reports. Aggregated monthly reports showing threshold breaches by reviewer, shift pattern, and decision type, including intervention actions and decisions flagged for re-review.
Retroactive re-review records. Records of decisions flagged due to fatigue-period operation, the re-review outcomes, and any decision reversals or corrections.
Sustainable volume capacity documentation. The defined sustainable processing capacity for each reviewer role, with the empirical or evidence-based justification for each threshold.

Retention requirements:

Fatigue monitoring logs and intervention records: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Test 8.1: Fatigue Indicator Detection Accuracy

Stimulus: Simulate a reviewer session where decision time decreases from a 45-second baseline to 6 seconds over a 3-hour period, with approval rate increasing from 88% to 99% over the same period. Inject these behavioural patterns into the monitoring system.
Expected behaviour: The fatigue monitoring system detects the degradation in decision time and the approval rate deviation, triggering threshold breach alerts at the configured threshold points.
Pass criteria: All configured fatigue indicators detect the degradation within 15 minutes of the threshold being crossed. Threshold breach events are logged with correct timestamps and indicator values.
Fail criteria: Any configured fatigue indicator fails to detect the simulated degradation, or detection latency exceeds 15 minutes from threshold crossing.

Test 8.2: Mandatory Intervention Execution on Threshold Breach

Stimulus: Trigger a critical-level fatigue threshold breach for a reviewer currently holding active approval authority. Verify that all three mandatory intervention actions execute: (a) supervisor alert, (b) authority suspension, (c) decision re-routing.
Expected behaviour: The supervisor receives an alert within 5 minutes. The reviewer's approval authority is suspended — any attempt to approve a decision is blocked. Pending decisions in the reviewer's queue are re-routed to an alternative reviewer or deferred.
Pass criteria: All three intervention actions execute within the defined SLA (recommended: 5 minutes). The reviewer cannot approve decisions after authority suspension. Pending decisions are successfully re-routed.
Fail criteria: Any of the three mandatory interventions fails to execute, or the reviewer retains active approval authority after a critical threshold breach.

Test 8.3: Volume-Based Fatigue Threshold Enforcement

Stimulus: Configure a reviewer role with a sustainable volume threshold of 15 complex decisions per hour. Submit 25 decisions to the reviewer's queue within a single hour.
Expected behaviour: The monitoring system detects that the queued volume exceeds the sustainable capacity threshold. A volume-based intervention is triggered: additional reviewer activation, decision deferral, or queue redistribution.
Pass criteria: Volume threshold breach is detected before the reviewer processes the 16th decision. An intervention action is logged and executed. The reviewer is not required to process decisions beyond the sustainable capacity threshold without intervention.
Fail criteria: The reviewer processes all 25 decisions without any volume-based intervention, or the volume threshold breach is not detected.

Test 8.4: Retroactive Decision Flagging

Stimulus: Trigger a fatigue threshold breach at a known timestamp. Verify that all decisions approved by the affected reviewer in the preceding 60 minutes (or configured lookback window) are flagged for re-review.
Expected behaviour: The system identifies all decisions made by the reviewer during the lookback window. Each decision is flagged with a fatigue-risk marker. The flagged decisions appear in a re-review queue accessible to designated re-reviewers.
Pass criteria: 100% of decisions within the lookback window are flagged. Each flagged decision is traceable to the triggering fatigue event. The re-review queue is populated and accessible.
Fail criteria: Any decision within the lookback window is not flagged, or flagged decisions are not routed to a re-review queue.

Test 8.5: Monthly Fatigue Report Generation and Completeness

Stimulus: At the end of a reporting month, trigger report generation. Verify that the report contains all required disaggregations: per-reviewer fatigue breach counts, per-shift-pattern analysis, per-decision-type breakdown, intervention actions taken, and decisions flagged for re-review.
Expected behaviour: The report is generated within the defined SLA (recommended: 5 business days after month-end). All required sections are present and populated with data consistent with the underlying monitoring logs.
Pass criteria: Report is generated on schedule. All required sections are present. Data in the report reconciles with underlying fatigue monitoring logs (spot-check at least 10 records). Threshold breach counts in the report match logged breach events.
Fail criteria: Report is not generated, any required section is missing, or report data does not reconcile with underlying logs.

Test 8.6: Challenge Injection Accuracy Measurement

Stimulus: Insert 5 challenge cases with known correct outcomes (3 that should be rejected and 2 that should be approved) into a reviewer's decision queue during normal operations. Record the reviewer's responses.
Expected behaviour: The reviewer's responses to challenge cases are recorded and compared against the known correct outcomes. An accuracy score is calculated.
Pass criteria: Challenge cases are presented indistinguishably from real cases. Reviewer responses are recorded. Accuracy is calculated correctly against known outcomes. If accuracy falls below the defined threshold (recommended: 80%), a fatigue alert is triggered.
Fail criteria: Challenge cases are distinguishable from real cases, reviewer responses are not recorded, or accuracy calculation is incorrect.

Test 8.7: Consecutive Duration Threshold Enforcement

Stimulus: Simulate a reviewer who has been continuously active (processing decisions without a break exceeding 10 minutes) for the configured maximum consecutive duration (e.g., 4 hours). Verify that the consecutive-duration fatigue threshold triggers an intervention.
Expected behaviour: The monitoring system tracks continuous active review time. When the maximum consecutive duration is reached, a fatigue intervention is triggered (at minimum, mandatory break notification and supervisor alert).
Pass criteria: The consecutive duration threshold triggers within 5 minutes of the configured maximum. An intervention action is logged and executed. The reviewer receives a mandatory break notification.
Fail criteria: The reviewer continues processing decisions beyond the maximum consecutive duration without any intervention, or the threshold breach is not logged.

Conformance Scoring

Score 0: No fatigue monitoring exists — human oversight is required but no mechanism verifies that oversight quality is maintained over time, workload, or shift duration.
Score 1: Basic fatigue indicators are tracked (e.g., session duration) but thresholds are not defined, interventions are not automated, and retroactive flagging is not implemented. Monitoring is informational only.
Score 2: Quantitative fatigue indicators are monitored with defined thresholds. Threshold breaches trigger automated interventions including supervisor notification and authority suspension. Decisions during fatigue periods are retroactively flagged. Monthly reports are produced. Volume-based thresholds are enforced.
Score 3: Verified through independent assessment — an independent party has validated that fatigue monitoring effectively detects oversight degradation and that interventions prevent fatigue-impaired decisions from standing unreviewed. Challenge injection provides real-time accuracy measurement. Fatigue data drives continuous improvement of workload design and scheduling. Predictive fatigue modelling enables pre-emptive intervention.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 14 (Human Oversight)	Direct requirement
EU AI Act	Article 9 (Risk Management System)	Supports compliance
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Direct requirement
NIST AI RMF	GOVERN 1.4, MAP 3.5	Supports compliance
ISO 42001	Clause 6.1 (Actions to Address Risks)	Supports compliance
DORA	Article 9 (ICT Risk Management Framework)	Supports compliance

EU AI Act — Article 14 (Human Oversight)

Article 14 requires that high-risk AI systems are designed and developed so that they can be effectively overseen by natural persons during their period of use. The word "effectively" is critical — it implies that oversight must be substantive, not merely procedural. An oversight process where fatigue-impaired reviewers routinely rubber-stamp agent decisions is not effective oversight under any reasonable interpretation of Article 14. Fatigue monitoring is the mechanism by which organisations verify that human oversight remains effective throughout extended operational periods. Without fatigue monitoring, an organisation cannot demonstrate that its oversight satisfies the effectiveness requirement, because it has no data on whether reviewer cognitive quality is maintained.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA requires firms to maintain adequate systems and controls for the management of their affairs. Where a firm's controls rely on human oversight of automated decision-making systems, the firm must ensure those controls remain effective. A human reviewer who is cognitively impaired by fatigue is not an effective control, regardless of whether the reviewer is formally present and pressing "approve." The FCA's Senior Managers and Certification Regime further requires that senior managers take reasonable steps to ensure the effectiveness of the controls in their area of responsibility. A senior manager who knows that their team conducts overnight oversight shifts without fatigue monitoring has failed to take reasonable steps to ensure control effectiveness.

SOX — Section 404 (Internal Controls Over Financial Reporting)

For organisations where AI agents participate in financial reporting processes (e.g., automated journal entries, transaction classification, or financial data aggregation), human reviewers form part of the internal control framework. SOX requires that internal controls are effective — not merely present. A reviewer who approves financial transactions while impaired by fatigue is a control failure that could constitute a material weakness if the aggregate value of inadequately reviewed transactions is significant. Fatigue monitoring provides evidence that human controls within the financial reporting chain maintained their effectiveness.

NIST AI RMF — GOVERN 1.4 and MAP 3.5

GOVERN 1.4 addresses ongoing monitoring of AI systems, which includes monitoring the effectiveness of human oversight mechanisms. MAP 3.5 addresses the ability of human operators to exercise effective oversight, including consideration of cognitive load and operational fatigue. Fatigue monitoring directly supports both provisions by providing empirical data on whether human oversight remains effective under operational conditions.

DORA — Article 9 (ICT Risk Management Framework)

DORA requires financial entities to implement an ICT risk management framework that includes mechanisms for detecting anomalous activities. Human oversight degradation due to fatigue is an anomalous condition in the oversight process that creates operational risk. Fatigue monitoring is a detection mechanism for this class of operational risk, supporting the organisation's ICT risk management framework.

ISO 42001 — Clause 6.1 (Actions to Address Risks)

ISO 42001 requires organisations to determine actions to address risks and opportunities related to AI system management. Reviewer fatigue is a well-documented risk to AI oversight quality. Fatigue monitoring represents the organisation's action to address this risk through detection and intervention, supporting conformance with the risk treatment requirements of Clause 6.1.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	All decisions processed during fatigue-impaired oversight periods — potentially spanning entire shifts, affecting hundreds of decisions per incident, with disproportionate impact on complex or high-value decisions that require the most attentive oversight

Consequence chain: Reviewer fatigue goes undetected, causing a progressive decline in oversight quality across the affected shift or session. The immediate failure is that decisions that should receive careful deliberation — anomalous transactions, edge-case applications, safety-relevant alerts — are approved or acknowledged with the same cursory attention as routine items. The operational impact compounds silently: each rubber-stamped decision is individually minor but collectively they represent a period of uncontrolled agent operation. The business consequences materialise when one of the fatigue-impaired decisions involves a consequential error — a policy-violating trade (Scenario A: £410,000 loss), a missed safety alert (Scenario B: £1.7 million production loss plus safety investigation), or a pattern of unfair determinations (Scenario C: 2,400 affected decisions, judicial review). The regulatory consequence is severe because the failure directly undermines the most widely mandated governance mechanism — human oversight. A regulator finding that an organisation required human oversight but did not monitor whether that oversight was effective will treat this as a systemic control failure, not an isolated incident. The reputational consequence extends beyond the immediate incident because the failure reveals that the organisation's governance model was structurally vulnerable — it depended on human cognition but made no effort to verify that the cognition was adequate.

Cross-references: AG-440 (Oversight Ergonomic Design Governance), AG-022 (Behavioural Drift Detection), AG-439 (Reviewer Independence Governance), AG-441 (Shift Handover Quality Governance), AG-446 (Training Recertification Cadence Governance), AG-448 (Escalation Timeliness Governance), AG-426 (Fallback Staffing Governance), AG-383 (Runtime Scheduler Fairness Governance).

Cite this protocol

AgentGoverning. (2026). AG-445: Fatigue Monitoring Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-445

← Previous Protocol

AG-444

Override Rationale Capture Governance

Next Protocol →

AG-446

Training Recertification Cadence Governance