AG-104: Trust Calibration Governance

2. Summary

Trust Calibration Governance requires that organisations implement explicit controls to ensure human operators maintain appropriately calibrated trust in AI agent outputs and decisions — neither over-trusting (automation complacency) nor under-trusting (automation disuse). The dimension mandates measurable mechanisms that align operator confidence with actual agent reliability, including dynamic trust indicators, performance transparency dashboards, and structured recalibration interventions when trust-reliability divergence is detected. Without trust calibration controls, human oversight becomes either a rubber-stamp exercise or an obstruction that defeats the purpose of agent deployment.

3. Example

Scenario A — Automation Complacency in Financial Trade Review: A financial services firm deploys an AI agent to generate trade recommendations with a human reviewer approving each trade. During the first three months, the agent's recommendations are correct 98.7% of the time. The human reviewer, observing consistent accuracy, develops a pattern of approving recommendations within 2 seconds of display — insufficient time to meaningfully evaluate the trade rationale. In month four, the agent begins generating subtly flawed recommendations due to a data pipeline change that introduces stale pricing data. The human reviewer continues approving at the same 2-second pace, rubber-stamping 47 trades over three days that collectively result in £2.3 million in losses. Post-incident analysis reveals the reviewer's approval time had declined from an initial average of 45 seconds to 1.8 seconds, but no system tracked or flagged this decline.

What went wrong: No trust calibration mechanism existed. The system did not track operator engagement metrics (review time, query rate, override frequency) as proxies for trust level. No recalibration intervention was triggered when the reviewer's behaviour indicated automation complacency. The human oversight control existed on paper but had become functionally absent. Consequence: £2.3 million in trading losses, FCA investigation into adequacy of human oversight controls, personal liability risk for the Senior Manager responsible under SM&CR.

Scenario B — Automation Disuse in Clinical Decision Support: A hospital deploys an AI agent to assist radiologists with preliminary scan analysis. After a widely publicised incident at another hospital where an AI system missed a tumour, the radiology department's trust in the tool collapses. Radiologists begin ignoring the agent's outputs entirely, performing full independent reads on every scan. The agent correctly identifies 12 critical findings over a two-week period that the radiologists, now operating under time pressure from performing double work, miss on their independent reads. Three patients experience delayed diagnoses.

What went wrong: No mechanism existed to communicate the agent's actual per-category accuracy to the radiologists. The trust failure was driven by an anecdotal external event, not by observed local performance data. No structured recalibration process re-established appropriate trust by presenting the agent's validated accuracy for the specific scan types in use. Consequence: Three delayed diagnoses, potential malpractice claims, regulatory scrutiny from CQC, and effective waste of the AI investment.

Scenario C — Trust Asymmetry Across Operator Shifts: A logistics company deploys an AI agent for route optimisation. Day-shift operators, who were involved in the agent's training and validation, trust it appropriately and override only when they have specific local knowledge. Night-shift operators, who received a 30-minute briefing, either follow the agent's recommendations blindly or override them based on gut instinct. The day shift achieves a 14% efficiency improvement; the night shift shows a 3% degradation. Management cannot explain the discrepancy because no per-operator trust calibration metrics exist.

What went wrong: Trust calibration was not systematically managed across the operator population. Training was inconsistent. No per-operator engagement metrics were tracked. No mechanism identified the divergent trust profiles between shifts. Consequence: Inconsistent operational performance, inability to demonstrate uniform human oversight quality, and audit finding for inadequate operator training.

4. Requirement Statement

Scope: This dimension applies to all AI agent deployments where human operators are expected to review, approve, override, or otherwise exercise judgement over agent outputs or actions. It applies regardless of whether the human role is formally designated as an approver, reviewer, monitor, or supervisor. The scope includes direct human-agent interaction (a human reviewing an agent's recommendation) and indirect interaction (a human monitoring a dashboard of agent activity). It excludes fully autonomous operations where no human oversight is expected or required — though organisations should note that removing human oversight without trust calibration evidence may itself be a governance deficiency under AG-019. The scope extends to all operator roles across all shifts, locations, and experience levels; trust calibration is not a one-time training event but an ongoing operational control.

4.1. A conforming system MUST track at least three operator engagement metrics as proxies for trust calibration: mean review time per decision, override rate per operator, and query or clarification request rate.

4.2. A conforming system MUST define threshold bands for each trust proxy metric that distinguish appropriately calibrated trust from over-trust (complacency) and under-trust (disuse), with thresholds validated against the agent's measured reliability for the specific task category.

4.3. A conforming system MUST trigger a recalibration intervention when any operator's trust proxy metrics fall outside the defined threshold bands for more than 48 consecutive hours or 50 consecutive decisions, whichever comes first.

4.4. A conforming system MUST present operators with ongoing, contextual trust indicators that communicate the agent's current reliability for the specific decision type — not a single global accuracy figure, but per-category performance data updated at least weekly.

4.5. A conforming system MUST log all trust calibration metrics, threshold breaches, and recalibration interventions with timestamps and operator identifiers.

4.6. A conforming system SHOULD implement dynamic trust thresholds that adjust as the agent's measured reliability changes — tightening acceptable review times when agent reliability decreases, and relaxing them when reliability is validated at higher levels.

4.7. A conforming system SHOULD deliver recalibration interventions through structured methods: presenting recent agent errors to complacent operators, and presenting validated accuracy data to distrustful operators.

4.8. A conforming system SHOULD track trust calibration metrics per operator, per task category, and per shift to detect population-level calibration asymmetries.

4.9. A conforming system MAY implement challenge tasks — occasional synthetic decisions where the agent's recommendation is deliberately incorrect — to verify that operators are genuinely evaluating outputs rather than rubber-stamping.

5. Rationale

Trust Calibration Governance addresses the fundamental vulnerability in any human-in-the-loop architecture: the assumption that human oversight is meaningful simply because a human is present. Decades of human factors research in aviation, nuclear power, and process control demonstrate that human trust in automated systems follows predictable but ungoverned trajectories — typically rising to complacency after a period of high automation reliability, or collapsing to disuse after a salient failure event. Neither trajectory produces effective oversight.

The concept originates from Lee and See's foundational trust calibration framework (2004), which established that trust in automation must be calibrated to match actual system capability. Parasuraman and Riley's research on automation misuse and disuse (1997) demonstrated that uncalibrated trust produces worse outcomes than no automation at all — operators either ignore valid automation outputs or fail to detect automation failures. These findings have been consistently replicated across domains for three decades.

In the AI agent context, the trust calibration problem is amplified by three factors. First, AI agents exhibit variable reliability across task categories — an agent that is 99% accurate on routine decisions may be only 60% accurate on edge cases, but operators who experience the 99% develop trust that generalises inappropriately to the 60%. Second, AI agent reliability can shift rapidly due to data drift, model updates, or environmental changes, creating a moving target for human calibration. Third, AI agents can be persuasive in their explanations, creating an illusion of competence that further biases operators toward over-trust.

AG-104 intersects directly with AG-019 (Human Escalation & Override Triggers) because escalation mechanisms are effective only when operators trust them to work and trust their own judgement to invoke them. It intersects with AG-038 (Human Control Responsiveness) because response time requirements are meaningful only when operators are engaged rather than complacent. And it intersects with AG-049 (Governance Decision Explainability) because explanation quality directly influences trust calibration — poor explanations can undermine warranted trust, while convincing explanations of wrong answers can amplify unwarranted trust.

6. Implementation Guidance

Trust calibration is an ongoing operational control, not a one-time configuration. The implementation must continuously measure the alignment between operator trust (as expressed through behaviour) and agent reliability (as measured through outcomes), then intervene when the two diverge.

Recommended patterns:

Engagement metric instrumentation. Instrument the operator interface to capture review time (from decision presentation to operator action), override rate (proportion of agent recommendations the operator rejects or modifies), and query rate (how often the operator requests additional information or explanation before deciding). Establish baselines during a supervised calibration period of at least 200 decisions per operator. An operator whose mean review time drops below 20% of the calibration baseline is exhibiting complacency indicators. An operator whose override rate exceeds 3x the calibration baseline without a corresponding change in agent accuracy is exhibiting disuse indicators.
Per-category reliability dashboards. Present operators with the agent's measured accuracy broken down by decision category, updated weekly. For example, in a credit decisioning context, show accuracy for standard applications (99.2% over last 30 days), applications with incomplete data (87.4%), and applications flagged as unusual (71.8%). This allows operators to calibrate their scrutiny to the actual risk: routine decisions warrant lighter review; unusual decisions warrant deep review.
Structured recalibration interventions. When trust metrics indicate complacency, present the operator with a curated set of recent agent errors relevant to their task category, requiring the operator to identify what the agent got wrong before resuming normal operations. When trust metrics indicate disuse, present the operator with validated performance data for their specific task categories and require a structured assessment of why their override rate diverges from agent accuracy. Both interventions should be documented and tracked.
Challenge task injection. Periodically inject synthetic decisions where the correct action is known and differs from the agent's recommendation. Track whether the operator catches the discrepancy. A challenge detection rate below 70% indicates that the operator is not meaningfully evaluating agent outputs. Challenge tasks should be indistinguishable from real decisions during presentation and revealed only after the operator acts. Frequency: at least 1 challenge per 100 real decisions, with a minimum of 2 per operator per week.

Anti-patterns to avoid:

Single global accuracy display. Showing operators that "the AI is 95% accurate" provides no calibration value. Accuracy varies by decision category, by time period, and by data conditions. A global figure actively misleads operators into applying uniform trust to non-uniform reliability.
Trust training as a one-time event. Training operators during onboarding to "critically evaluate AI outputs" has no lasting effect without ongoing measurement and intervention. Within weeks, operator behaviour reverts to whatever trust level their experience with the system produces, regardless of training content.
Punishing overrides. If operators are penalised (explicitly or implicitly through performance metrics) for overriding the agent, they will stop overriding regardless of their trust level. Override rates must be evaluated against agent accuracy, not against a target of zero.
Ignoring shift and experience asymmetries. Treating all operators as a homogeneous group masks critical calibration differences. New operators, operators on different shifts, and operators with different training backgrounds will have systematically different trust profiles that require different interventions.
Relying on self-reported trust. Surveys asking operators how much they trust the AI correlate poorly with behavioural trust indicators. Measurement must be behavioural (what operators do), not attitudinal (what operators say they do).

Industry Considerations

Financial Services. Trust calibration directly supports SM&CR obligations by demonstrating that human oversight is substantive rather than nominal. The FCA's expectations for algorithmic trading oversight (MiFID II RTS 6) require that human monitors are capable of intervening effectively — which requires calibrated trust. Trading firms should align trust calibration thresholds with existing human performance monitoring for manual traders. Challenge task frequency should be higher for high-value decision categories: at least 1 per 50 decisions for trade approvals exceeding £100,000.

Healthcare. Clinical decision support systems are subject to MHRA regulation (where the AI qualifies as a medical device) and CQC oversight of care quality. Trust calibration is a patient safety control: both over-trust (missing an AI error that harms a patient) and under-trust (ignoring a valid AI finding that delays diagnosis) produce adverse patient outcomes. Per-category reliability data should align with clinical sensitivity — distinguishing accuracy for common conditions from accuracy for rare conditions where AI training data is sparse.

Critical Infrastructure. In safety-critical environments (aviation, nuclear, process control), trust calibration has an established regulatory basis through human factors requirements in IEC 61511, DO-178C, and nuclear regulatory frameworks. Organisations deploying AI agents in these contexts should map AG-104 requirements to existing human factors obligations. Challenge task design must account for the risk that a synthetic error could trigger real safety consequences if the operator fails to catch it.

Maturity Model

Basic Implementation — The organisation tracks at least three trust proxy metrics (review time, override rate, query rate) per operator. Threshold bands are defined based on initial calibration baselines. Alerts are generated when operators breach thresholds. Recalibration consists of a notification to the operator's supervisor. Per-category reliability data is available but not actively presented to operators. This level meets the minimum mandatory requirements but relies on manual supervision for recalibration and does not systematically verify that operators are genuinely evaluating outputs.

Intermediate Implementation — All basic capabilities plus: per-category reliability dashboards are presented to operators in the decision interface. Structured recalibration interventions are automated — complacent operators receive curated error examples; distrustful operators receive validated accuracy presentations. Trust metrics are tracked per operator, per category, and per shift, with population-level analysis identifying systemic calibration issues. Thresholds adjust dynamically as agent reliability changes. Recalibration effectiveness is measured by tracking whether metrics return to calibrated bands within 5 working days of intervention.

Advanced Implementation — All intermediate capabilities plus: challenge tasks are injected at a rate of at least 1 per 100 decisions, with per-operator detection rates tracked and reported. Predictive models identify operators trending toward miscalibration before thresholds are breached. Trust calibration data feeds into agent deployment decisions — new agent capabilities are not activated until operator trust calibration for the new category has been validated. Independent auditors review trust calibration effectiveness annually. The organisation can demonstrate to regulators that human oversight is substantively engaged, not nominal, with quantitative evidence at the per-operator level.

7. Evidence Requirements

Required artefacts:

Trust metric configuration artefact. The defined trust proxy metrics, their measurement methodology, threshold bands, and the calibration baseline data from which thresholds were derived. Format: structured data (JSON, YAML, or database schema export) with methodology documentation.
Operator trust metric records. Continuous records of per-operator trust proxy metrics with timestamps, including review time, override rate, and query rate per decision. Minimum 12 months retention.
Threshold breach and intervention log. Timestamped records of every threshold breach, the recalibration intervention triggered, and the operator's subsequent metric trajectory. Must include operator identifier, breach type (over-trust or under-trust), metric values at breach, intervention type, and post-intervention metric values.
Per-category reliability data. Records of the agent's measured accuracy per decision category, updated at least weekly, demonstrating the data presented to operators via trust indicators. Must include category definitions, sample sizes, and confidence intervals.

Retention requirements:

Trust metric records and intervention logs: minimum 7 years for regulated financial services; minimum 5 years for healthcare; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Test 8.1: Engagement Metric Capture Accuracy

Stimulus: An operator reviews 20 decisions with known review times (controlled by timed presentation). Measure captured review times against known values.
Expected behaviour: Captured review times are accurate to within 500 milliseconds of actual review times.
Pass criteria: At least 95% of captured review times fall within 500ms of known values. Override rate and query rate match actual operator actions exactly.
Fail criteria: More than 5% of review times deviate by more than 500ms, or override/query counts do not match actual operator actions.

Test 8.2: Threshold Breach Detection

Stimulus: Simulate an operator whose mean review time declines from 45 seconds (baseline) to 3 seconds over 60 consecutive decisions, crossing the complacency threshold at decision 40.
Expected behaviour: The system detects the threshold breach no later than the configured trigger point (48 hours or 50 decisions, whichever comes first).
Pass criteria: Breach is detected and logged within the configured window. An intervention is triggered automatically.
Fail criteria: Breach is not detected, or detection occurs outside the configured window, or no intervention is triggered.

Test 8.3: Recalibration Intervention Delivery

Stimulus: Trigger a complacency recalibration intervention for a test operator. Verify that the intervention presents recent agent errors from the operator's task categories.
Expected behaviour: The intervention is delivered through the configured channel (in-interface, email, or supervisor notification). The content includes specific recent agent errors relevant to the operator's task categories with sufficient detail for the operator to understand what the agent got wrong.
Pass criteria: Intervention is delivered within 1 hour of trigger. Content includes at least 3 relevant recent agent errors with decision context.
Fail criteria: Intervention is not delivered, is delivered to the wrong operator, or contains no relevant error examples.

Test 8.4: Per-Category Reliability Display Accuracy

Stimulus: Inject 100 decisions with known outcomes across 3 categories: 95 correct in Category A (95%), 40 correct out of 50 in Category B (80%), and 7 correct out of 10 in Category C (70%). Verify the reliability dashboard displays these figures.
Expected behaviour: Dashboard displays per-category accuracy within 1 percentage point of actual values, with appropriate sample size indicators.
Pass criteria: Displayed accuracy matches actual accuracy within 1 percentage point for all categories. Sample sizes are displayed.
Fail criteria: Any category's displayed accuracy deviates by more than 1 percentage point, or sample sizes are not displayed.

Test 8.5: Challenge Task Indistinguishability

Stimulus: Inject 5 challenge tasks among 500 real decisions. After the test period, ask the operator to identify which decisions were challenges (without revealing answers first).
Expected behaviour: The operator cannot reliably distinguish challenge tasks from real decisions based on presentation alone (identification rate not significantly above chance).
Pass criteria: Operator correctly identifies fewer than 4 of 5 challenges based on presentation cues alone (above-chance identification suggests the challenges are distinguishable and therefore ineffective).
Fail criteria: Operator correctly identifies all 5 challenges, indicating that challenge presentation differs systematically from real decisions.

Test 8.6: Trust Metric Logging Completeness

Stimulus: Generate 1,000 operator decisions over a 7-day period. Query the trust metric log for completeness.
Expected behaviour: All 1,000 decisions have corresponding trust metric entries with complete fields (timestamp, operator ID, review time, action taken, category).
Pass criteria: 100% of decisions have complete trust metric records. No gaps, no null fields in mandatory columns.
Fail criteria: Any decision lacks a corresponding trust metric record, or any mandatory field is null.

Test 8.7: Population-Level Asymmetry Detection

Stimulus: Configure two operator groups with systematically different trust profiles: Group A with mean review time of 40 seconds, Group B with mean review time of 4 seconds. Run for 200 decisions per group.
Expected behaviour: The system identifies the inter-group asymmetry and flags it as a population-level calibration concern.
Pass criteria: The asymmetry is detected and reported within 48 hours or 100 decisions per group, whichever comes first.
Fail criteria: The asymmetry is not detected, or is detected only at the individual operator level without population-level aggregation.

Conformance Scoring

Score 0: No trust calibration controls exist — human oversight is assumed to be effective without measurement.
Score 1: Trust proxy metrics are tracked but no thresholds or interventions are defined — data is collected but not acted upon.
Score 2: Threshold bands are defined, breaches are detected, and recalibration interventions are triggered — the system actively manages trust calibration.
Score 3: All Score 2 capabilities plus challenge tasks, dynamic thresholds, population-level analysis, and independent validation that trust calibration controls are effective — verified through audit or adversarial testing.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 14 (Human Oversight)	Direct requirement
EU AI Act	Article 9 (Risk Management System)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Supports compliance
MiFID II RTS 6	Article 18 (Human review of algorithmic trading)	Direct requirement
NIST AI RMF	GOVERN 1.4, MEASURE 2.6	Supports compliance
ISO 42001	Clause 8.4 (AI System Operation)	Supports compliance
MHRA Software as Medical Device	Intended Purpose & Human Factors	Supports compliance

EU AI Act — Article 14 (Human Oversight)

Article 14(4)(a) requires that human oversight measures enable the individual exercising oversight to "correctly interpret the high-risk AI system's output." Trust calibration is the operational mechanism that ensures this requirement is met in practice. Without calibrated trust, human overseers either do not interpret outputs at all (complacency) or reject correct outputs (disuse). Article 14(4)(b) requires that overseers can "decide not to use the high-risk AI system or to disregard, override or reverse the output." Trust calibration ensures that override decisions are based on informed judgement rather than miscalibrated trust. The EU AI Act's human oversight requirements are meaningful only if the humans exercising oversight are genuinely engaged — AG-104 provides the control that ensures genuine engagement.

MiFID II RTS 6 — Article 18

Article 18 requires investment firms using algorithmic trading to ensure adequate human review. For AI agents executing or recommending trades, trust calibration provides evidence that human reviewers are substantively engaged rather than rubber-stamping. Regulatory supervisors examining a trading loss will ask not only whether a human reviewed the trade but whether the review was meaningful — AG-104 provides the quantitative evidence to answer that question.

NIST AI RMF — GOVERN 1.4, MEASURE 2.6

GOVERN 1.4 addresses organisational practices for AI risk governance including human-AI interaction. MEASURE 2.6 addresses the assessment of human-AI teaming effectiveness. AG-104 provides the measurement infrastructure and intervention mechanisms that operationalise these functions, ensuring that human-AI interaction is monitored and managed rather than assumed.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide — affects all human-overseen agent operations and the credibility of human oversight claims to regulators

Consequence chain: Without trust calibration controls, human oversight degrades silently. Over-trust produces rubber-stamping that renders human-in-the-loop controls functionally absent — the organisation believes it has human oversight but in practice does not. Under-trust produces rejection of valid agent outputs, negating the operational benefit of agent deployment and potentially producing worse outcomes than either pure human or pure agent operation. Both failure modes are invisible without measurement: an operator who approves every recommendation in 2 seconds appears to be "working efficiently" unless engagement metrics are tracked. The regulatory consequence is severe: organisations claiming human oversight compliance (EU AI Act Article 14, FCA SM&CR) without trust calibration evidence face the risk that a post-incident investigation reveals oversight was nominal. The financial consequence depends on the domain but scales with the volume of decisions processed under miscalibrated oversight. In the financial services example above, a single complacent reviewer produced £2.3 million in losses over three days; at scale, the exposure is proportionally larger.

Cross-references: AG-019 (Human Escalation & Override Triggers) establishes when escalation must occur; AG-104 ensures operators are calibrated to invoke escalation appropriately. AG-038 (Human Control Responsiveness) sets response time requirements; AG-104 ensures operators are engaged enough to meet them. AG-049 (Governance Decision Explainability) provides the explanation quality that supports informed trust; AG-104 measures whether that trust is actually calibrated. AG-105 (Oversight Workload and Alarm Fatigue Governance) addresses the workload conditions that degrade trust calibration. AG-106 (Human Skill Atrophy Monitoring Governance) addresses the skill degradation that miscalibrated trust accelerates. AG-107 (Override Usability and Actionability Governance) ensures that when calibrated operators decide to override, the mechanism is usable. AG-108 (Operator Role Segregation Governance) ensures that trust calibration is measured per role with appropriate thresholds.

Cite this protocol

AgentGoverning. (2026). AG-104: Trust Calibration Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-104

← Previous Protocol

AG-103

Red-Team Coverage Management Governance

Next Protocol →

AG-105

Oversight Workload and Alarm Fatigue Governance