AG-442: Confidence Calibration Interface Governance

2. Summary

Confidence Calibration Interface Governance requires that AI agent systems expose their uncertainty, confidence levels, and epistemic limitations through interface elements that humans interpret correctly — meaning the human reviewer's perceived confidence in the agent's output aligns with the agent's actual reliability for that specific output. Miscalibrated confidence presentation is a critical human-factors failure: if an agent displays 92% confidence and the human interprets this as "almost certainly correct" but the agent's actual accuracy at stated 92% confidence is only 71%, the human will under-scrutinise outputs, skip verification steps, and defer to the agent when they should override. This dimension mandates that confidence indicators are empirically calibrated, presented in formats validated for correct human interpretation, and accompanied by contextual information that enables reviewers to make appropriate trust decisions.

3. Example

Scenario A — Percentage Confidence Creates False Precision in Loan Underwriting: A lending agent evaluates mortgage applications and presents its recommendation with a numerical confidence score. For Application 4,271, the agent displays: "Recommendation: APPROVE — Confidence: 94.3%." The underwriter sees 94.3% and interprets this as a near-certain correct recommendation requiring minimal scrutiny. She approves the application in 45 seconds rather than her typical 4-minute review. Post-hoc analysis reveals that the agent's confidence scores are poorly calibrated: when the agent states 94% confidence, its actual accuracy is 78%. Furthermore, the three-decimal-place display (94.3% rather than "high confidence") creates a false impression of precision. The approved application defaults after 11 months — the applicant had an undisclosed second mortgage that the agent's model could not detect from the available data, a limitation that was within the agent's uncertainty estimate but not communicated to the underwriter. The default results in a £287,000 loss.

What went wrong: The confidence display was numerically precise but empirically uncalibrated. The three-decimal-place format implied a level of measurement precision that did not exist. The underwriter had no way to know that "94.3% confidence" meant "78% actual accuracy." No calibration data, no confidence interval, and no indication of what the agent could not assess (the undisclosed second mortgage) were presented. The interface design optimised for speed (a single number) rather than correct interpretation. Consequence: £287,000 default loss, portfolio-level under-scrutiny affecting 340 applications approved during the period of miscalibrated confidence display, £1.4 million in additional provisioning required.

Scenario B — Traffic-Light Confidence Collapses Meaningful Uncertainty Gradations in Medical Triage: A clinical triage agent classifies emergency department presentations into urgency categories. The interface uses a traffic-light system: green (confident in classification), amber (moderate confidence), red (low confidence — refer to senior clinician). For Patient 892, the agent classifies the presentation as "Category 3 — Urgent, not life-threatening" with an internal confidence of 0.61. The traffic-light threshold maps 0.50–0.80 to amber. The triage nurse sees an amber indicator and interprets it as "somewhat uncertain, but within normal range — proceed with standard Category 3 pathway." However, the agent's internal confidence distribution is bimodal: 0.61 probability of Category 3, but 0.34 probability of Category 1 (immediate, life-threatening). The traffic-light system collapses this critical bimodal distribution into a single amber indicator. The patient is triaged as Category 3. The actual condition is a Category 1 aortic dissection. Treatment delay: 47 minutes. Patient outcome: emergency surgery, extended ICU stay, partial permanent disability.

What went wrong: The traffic-light confidence display discarded critical information about the shape of the agent's uncertainty. A 0.61 confidence with the remaining 0.34 concentrated on life-threatening Category 1 is categorically different from a 0.61 confidence with the remaining 0.39 spread evenly across benign categories. The traffic-light system conveyed neither the magnitude of the worst-case classification nor its probability. The interface design prioritised simplicity over the information density required for safe clinical decisions. Consequence: patient harm from delayed treatment, £2.3 million litigation settlement, clinical decision-support agent suspended for 6 months pending interface redesign.

Scenario C — Confidence Without Context Enables Automation Bias in Sanctions Screening: A financial crime compliance agent screens international wire transfers against sanctions lists. The agent displays: "Match confidence: 87% — Potential match with designated entity SDN-4472." The compliance analyst has processed 180 alerts today. She has learned from experience that "87% confidence" alerts are usually true matches. She files a Suspicious Activity Report (SAR) and blocks the transfer. However, the 87% confidence is based on name similarity alone. The agent has no visibility into contextual factors: the sending entity is a hospital in a non-sanctioned country, the transfer amount is consistent with medical equipment procurement, and the name similarity is due to common Arabic naming patterns, not an actual sanctions match. The agent's confidence is high because its scope is narrow (name matching), but the analyst interprets the confidence as covering the full match assessment (name, context, intent). The blocked transfer delays £430,000 in medical equipment delivery to a paediatric hospital for 6 weeks. The SAR is ultimately determined to be a false positive.

What went wrong: The confidence score lacked scope labelling — it conveyed confidence in name similarity but the analyst interpreted it as confidence in overall sanctions match likelihood. No indication of what the confidence score covered and what it did not cover was presented. The analyst's calibration was based on historical experience where high-confidence alerts were usually true matches, but the agent's model had been recently updated with a more aggressive name-matching algorithm that increased confidence scores without improving contextual accuracy. No recalibration notice was provided to analysts after the model update. Consequence: 6-week delay in medical equipment delivery, reputational damage, £43,000 in expedited reshipment costs, regulatory criticism for inadequate false-positive management.

4. Requirement Statement

Scope: This dimension applies to any AI agent deployment where the agent produces outputs accompanied by confidence indicators, uncertainty estimates, probability scores, or any other representation of the agent's certainty or reliability, and where a human reviewer uses those indicators to inform oversight, approval, escalation, or override decisions. The scope includes explicit confidence scores (numerical percentages, probability values), categorical confidence indicators (traffic lights, high/medium/low labels), implicit confidence signals (ranking order, recommendation strength language), and the absence of confidence information where it should be present. Any deployment where the human reviewer's decision is influenced by the agent's expressed or implied confidence level is in scope. The scope also extends to model updates that change the calibration properties of existing confidence indicators, requiring recalibration validation and reviewer notification. The test is: does the human reviewer adjust their scrutiny, verification effort, or override likelihood based on the agent's expressed confidence? If yes, this dimension applies in full.

4.1. A conforming system MUST empirically calibrate confidence indicators so that stated confidence levels correspond to observed accuracy rates within defined tolerance bands (recommended: stated confidence of X% must correspond to actual accuracy of X% plus or minus 5 percentage points, measured over a rolling window of at least 1,000 decisions).

4.2. A conforming system MUST present confidence information in formats that have been validated — through user testing with representative reviewers — for correct human interpretation, meaning the reviewer's perceived reliability aligns with the agent's actual reliability within defined tolerance.

4.3. A conforming system MUST accompany confidence indicators with scope labelling that explicitly states what the confidence score covers (e.g., "name similarity only") and what it does not cover (e.g., "does not assess contextual plausibility, geographic risk, or transaction pattern"), preventing reviewers from interpreting narrow confidence as comprehensive assessment confidence.

4.4. A conforming system MUST display the distribution of uncertainty when the agent's confidence is multimodal or when alternative classifications carry significant consequences — not merely the top-ranked classification's confidence, but at minimum the probability and consequence severity of the second-most-likely classification when its probability exceeds a defined threshold (recommended: 10% probability) and its consequence severity exceeds the top classification's severity.

4.5. A conforming system MUST recalibrate confidence indicators after any model update, retraining, fine-tuning, or data pipeline change that could affect the relationship between stated confidence and actual accuracy, and MUST notify reviewers when recalibration results in materially different confidence distributions (recommended: notification when average confidence for any decision category shifts by more than 5 percentage points).

4.6. A conforming system MUST publish calibration performance metrics — reliability diagrams, expected calibration error (ECE), or equivalent measures — to reviewers and governance stakeholders on a defined cadence (recommended: monthly), enabling independent assessment of whether the agent's confidence indicators remain trustworthy.

4.7. A conforming system SHOULD avoid false precision in confidence display — presenting confidence at a precision level that exceeds the measurement's actual discriminative ability (e.g., displaying 94.3% when the calibration data cannot distinguish between 90% and 95%).

4.8. A conforming system SHOULD implement adaptive confidence thresholds that adjust based on consequence severity — lower confidence thresholds for triggering enhanced scrutiny when the consequences of error are more severe (e.g., a 75% confidence threshold triggers enhanced review for a £500 decision, but a 95% confidence threshold is required for the same review reduction on a £500,000 decision).

4.9. A conforming system SHOULD provide reviewers with confidence trend information showing whether the agent's confidence for similar decision types has been stable, increasing, or decreasing over recent periods, enabling detection of systematic confidence inflation or deflation.

4.10. A conforming system MAY implement personalised calibration feedback — showing each individual reviewer how their decisions correlate with the agent's confidence levels, identifying calibration biases (e.g., "You override the agent 3% of the time when confidence exceeds 90%, but the agent's actual error rate at >90% confidence is 11%, suggesting under-scrutiny at high confidence levels").

5. Rationale

Confidence calibration sits at the intersection of two well-documented failure modes: machine learning miscalibration and human automation bias. Modern AI systems are frequently overconfident — neural networks in particular tend to produce confidence scores that are systematically higher than their actual accuracy. When these overconfident scores are presented to human reviewers through interfaces that amplify the perceived precision (three-decimal-place percentages, authoritative colour coding), the result is a compounding error: the machine is more confident than it should be, and the human is more trusting than they should be.

Research on automation bias demonstrates that humans adjust their verification effort based on the system's expressed confidence. When a system says "95% confident," the human reviewer unconsciously reduces scrutiny — fewer checks, faster decisions, less likelihood of override. This is rational behaviour if the confidence is calibrated: if 95% confidence means 95% accuracy, reduced scrutiny is appropriate. But if 95% confidence means 78% accuracy, the human's reduced scrutiny creates a systematic gap where 22% of errors pass through oversight undetected. The compounding effect is significant: a miscalibrated confidence display does not merely fail to inform — it actively misinforms, causing the human to behave less safely than they would with no confidence information at all.

The interface design challenge is substantial. Numerical percentages create false precision (Scenario A). Categorical systems (traffic lights) discard information density that may be critical for safe decisions (Scenario B). Confidence scores without scope labelling cause misinterpretation of narrow assessments as comprehensive ones (Scenario C). No single presentation format is universally correct — the appropriate format depends on the decision context, the consequence severity, the reviewer's expertise, and the nature of the agent's uncertainty. This is why the dimension requires user-validated formats rather than prescribing a specific format.

The regulatory pressure on confidence communication is increasing. The EU AI Act Article 13 requires transparency including "the level of accuracy" of AI systems, which necessarily includes communicating that accuracy level to human overseers in a comprehensible way. Article 14 requires effective human oversight, which is impossible if the human's trust calibration is systematically distorted by miscalibrated confidence indicators. FCA expectations on model risk management include validation that model outputs, including confidence indicators, are reliable and interpretable by their intended users. ISO 42001 addresses the trustworthiness of AI systems, which includes the trustworthiness of the system's self-assessment of its own outputs.

The temporal dimension is also critical. Confidence calibration is not a one-time validation — it degrades over time as data distributions shift, as models are updated, and as the operational environment changes. An agent that was well-calibrated at deployment may become miscalibrated within months as its operating context evolves. Requirement 4.5 addresses this by mandating recalibration after model changes, and Requirement 4.6 addresses it by mandating ongoing calibration performance publication. Without these temporal safeguards, calibration validated at deployment provides false assurance that degrades to active harm as the calibration drifts.

The consequence of getting confidence calibration wrong is not merely that individual decisions suffer — it is that the entire human-oversight architecture is undermined. If reviewers cannot trust the agent's confidence indicators, they face an impossible choice: either scrutinise every decision equally (eliminating any efficiency benefit of the agent) or develop their own implicit calibration through experience (which is slow, inconsistent, and subject to recency bias). Neither outcome is acceptable. Governed confidence calibration is a precondition for the human-AI collaborative oversight model that modern agent deployments depend upon.

6. Implementation Guidance

Confidence Calibration Interface Governance requires a pipeline that connects model-level uncertainty quantification to interface-level confidence presentation through an empirical calibration layer. The model produces raw confidence scores; the calibration layer maps those scores to empirically validated accuracy rates; the interface layer presents the calibrated confidence in formats validated for correct human interpretation. Each layer can fail independently, and failures compound across layers.

Recommended patterns:

Platt scaling or isotonic regression for post-hoc calibration. Apply a calibration function to the model's raw confidence outputs to produce calibrated probabilities. Platt scaling fits a logistic regression on held-out validation data, mapping raw scores to calibrated probabilities. Isotonic regression is a non-parametric alternative that can capture non-monotonic calibration curves. Both approaches require a calibration dataset that is representative of the production decision distribution. Recalibrate on a defined schedule (recommended: after every model update and at least quarterly for stable models) using recent production data with ground-truth labels.
Reliability diagrams as calibration validation. Generate reliability diagrams (calibration curves) that plot predicted confidence against observed accuracy across confidence bins. A perfectly calibrated model produces a diagonal line. Deviations from the diagonal indicate regions of over-confidence or under-confidence. Publish reliability diagrams to reviewers and governance stakeholders monthly. Calculate Expected Calibration Error (ECE) as a scalar summary metric. Define acceptable ECE thresholds (recommended: ECE below 0.05 for high-risk applications, below 0.10 for standard applications).
Confidence bands rather than point estimates. Present confidence as a range rather than a single number: "Confidence: 85-92%" rather than "Confidence: 88.7%." The band width communicates the precision of the confidence estimate itself (the confidence about the confidence). Narrower bands indicate more precise calibration; wider bands indicate higher uncertainty about the confidence estimate. This approach prevents the false-precision problem in Scenario A and gives reviewers appropriate uncertainty about the uncertainty estimate.
Multimodal uncertainty visualisation. When the agent's classification uncertainty is concentrated in two or more alternatives, display the full relevant distribution rather than only the top classification. For clinical triage (Scenario B), display: "Category 3 (Urgent): 61% | Category 1 (Immediate): 34% | Other: 5%." For sanctions screening, display: "Name match: 87% | Contextual match: 42% | Overall assessment: Moderate." The key insight is that the shape of the uncertainty distribution carries information that a single score destroys.
Scope-labelled confidence indicators. Every confidence score must be accompanied by a scope label identifying what the score measures and what it does not measure. For Scenario C's sanctions screening: "Confidence: 87% — Scope: name similarity to SDN-4472. Not assessed: geographic risk, transaction pattern, business context, beneficial ownership." The scope label should be visible without additional clicks or expansions — it must be part of the primary confidence display, not hidden in a tooltip or detail view.
Calibration drift alerts. Monitor calibration metrics continuously. When the ECE exceeds the defined threshold, or when the accuracy at a specific confidence level diverges from the stated confidence by more than the tolerance band (Requirement 4.1), generate an alert to governance stakeholders and, if the divergence is material, notify reviewers that the confidence indicators may be temporarily unreliable.

Anti-patterns to avoid:

Raw model logits as confidence scores. Presenting the model's raw softmax output or logit values as confidence percentages without calibration. Neural network softmax outputs are notoriously miscalibrated — they are not probabilities in any meaningful frequentist or Bayesian sense. Presenting them as such actively misleads reviewers.
Three-decimal-place confidence. Displaying confidence as "94.372%" when the calibration data cannot distinguish between 90% and 95%. False precision creates unjustified trust. The display precision should not exceed the calibration precision. If calibration bins are 5 percentage points wide, confidence should be displayed in 5-point increments or as ranges.
Traffic-light-only systems for high-consequence decisions. Using a three-category (green/amber/red) system as the sole confidence indicator for decisions where the shape of the uncertainty distribution matters. Traffic lights are appropriate for low-consequence, high-volume decisions where binary scrutiny decisions suffice. For decisions where the second-most-likely classification could be life-threatening or financially catastrophic, the uncertainty distribution must be visible.
Confidence without calibration validation. Deploying confidence indicators that have never been empirically validated against ground-truth outcomes. Without calibration validation, the confidence score is a number with no demonstrated relationship to reality. It may be worse than no confidence indicator at all, because it creates false assurance.
Static calibration after model updates. Retaining the pre-update calibration function after the underlying model has been retrained, fine-tuned, or updated. Model updates change the raw confidence distribution, potentially invalidating the calibration mapping entirely. A model update without recalibration creates a period where confidence indicators have unknown reliability.
Confidence indicators hidden behind interaction. Placing confidence information in tooltips, expandable panels, or secondary screens that require additional clicks to access. Under time pressure, reviewers will not click — they will proceed with the default display. Critical confidence information must be visible in the primary decision interface without interaction.

Industry Considerations

Financial Services. Credit scoring, fraud detection, and sanctions screening all produce confidence scores that drive human decisions with significant financial consequences. Financial services regulators (FCA, PRA, OCC, Fed) have model risk management expectations (e.g., SR 11-7 in the US, SS1/23 in the UK) that explicitly require model outputs to be reliable and interpretable. Confidence calibration is a direct model-validation requirement. For sanctions screening specifically, the false-positive management challenge (Scenario C) means that scope-labelled confidence is essential to prevent reviewers from blocking legitimate transactions based on narrow name-matching confidence interpreted as comprehensive sanctions-match confidence.

Healthcare. Clinical confidence presentation has life-or-death consequences. The traffic-light problem (Scenario B) is particularly acute: clinical triage requires understanding the severity of the worst plausible classification, not just the most likely classification. Healthcare implementations must prioritise multimodal uncertainty display that shows reviewers when a low-probability classification has high-severity consequences. Medical device regulations (MDR in the EU, FDA in the US) require clinical decision-support outputs to be validated for correct interpretation by intended users.

Public Sector / Benefits Processing. Government decision agents must present confidence in ways that prevent both over-reliance (rubber-stamping high-confidence recommendations without review) and under-reliance (manual review of every decision regardless of confidence, making the agent economically pointless). The equity dimension is critical: if confidence calibration varies across demographic subgroups (e.g., the agent is well-calibrated for majority populations but overconfident for minority populations), the confidence display will systematically mislead reviewers for minority applicants.

Safety-Critical / Embodied Agents. Robotic and CPS agents must present confidence in real-time, often to operators under time pressure. The display format must be interpretable within the operator's decision timeframe (potentially seconds). Confidence bands and multimodal displays must be designed for rapid comprehension, not detailed analysis.

Maturity Model

Basic Implementation — The organisation has empirically calibrated confidence indicators using held-out validation data. Calibration performance is measured using reliability diagrams and ECE. Confidence is displayed with appropriate precision (no false precision). Scope labelling identifies what the confidence score covers. Recalibration occurs after model updates. Calibration metrics are published to governance stakeholders at least quarterly.

Intermediate Implementation — All basic capabilities plus: confidence is displayed as bands or ranges rather than point estimates. Multimodal uncertainty is visible when alternative classifications carry significant consequences. User testing with representative reviewers validates that the confidence display produces correct interpretation. Calibration drift monitoring generates automated alerts when ECE exceeds thresholds. Adaptive confidence thresholds adjust scrutiny requirements based on consequence severity. Recalibration includes subgroup analysis to detect differential calibration across demographic groups.

Advanced Implementation — All intermediate capabilities plus: personalised calibration feedback shows each reviewer how their decisions correlate with the agent's confidence. Confidence trend information enables reviewers to detect systematic shifts. Real-time calibration monitoring tracks ECE across all decision categories in production. Independent calibration audit validates that confidence indicators are trustworthy. The organisation can demonstrate through testing that reviewer trust calibration (the relationship between the agent's stated confidence and the reviewer's actual scrutiny level) is within defined tolerance bands.

7. Evidence Requirements

Required artefacts:

Calibration validation report. Current reliability diagrams and ECE measurements for all confidence indicators, with comparison against defined tolerance bands. Must include per-category and per-subgroup calibration analysis. Updated on the defined cadence (recommended: monthly).
User interpretation testing results. Results of user testing with representative reviewers demonstrating that the confidence display format produces correct interpretation. Testing protocol, participant selection criteria, interpretation accuracy measurements, and any format iterations are required.
Scope labelling specification. Documentation of what each confidence indicator covers and excludes, as displayed to reviewers, with mapping to the underlying model's assessment scope.
Recalibration records. Records of each recalibration event, including the triggering event (model update, scheduled recalibration, calibration drift alert), the recalibration dataset, the pre- and post-recalibration ECE, and any reviewer notifications issued.
Calibration drift monitoring logs. Continuous monitoring records showing calibration metrics over time, including any threshold-breach alerts and the response actions taken.
Confidence display design specification. The documented design of the confidence interface, including the format rationale, the precision justification, the multimodal display threshold, and the scope labelling content.

Retention requirements:

Calibration reports and recalibration records: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Test 8.1: Empirical Calibration Accuracy

Stimulus: Collect a sample of at least 1,000 production decisions with ground-truth outcomes. Group decisions by the agent's stated confidence into bins (e.g., 50-60%, 60-70%, ..., 90-100%). Calculate the actual accuracy within each bin.
Expected behaviour: The actual accuracy in each bin falls within the defined tolerance band of the stated confidence (recommended: plus or minus 5 percentage points). A reliability diagram shows the calibration curve close to the diagonal.
Pass criteria: All confidence bins have actual accuracy within the tolerance band. ECE is below the defined threshold (recommended: 0.05 for high-risk, 0.10 for standard).
Fail criteria: Any confidence bin has actual accuracy outside the tolerance band, or ECE exceeds the defined threshold.

Test 8.2: Human Interpretation Accuracy

Stimulus: Present 20 representative agent outputs with their confidence indicators to 10 representative reviewers. For each output, ask the reviewer: "What do you estimate the agent's actual accuracy is for this output?" and "What factors does the confidence score account for?" Compare the reviewer's estimates against actual calibrated accuracy and actual confidence scope.
Expected behaviour: Reviewers' perceived accuracy estimates fall within 10 percentage points of the actual calibrated accuracy. Reviewers correctly identify the scope of the confidence indicator (what it covers and does not cover) for at least 80% of outputs.
Pass criteria: Mean absolute deviation between reviewer-perceived accuracy and actual accuracy is 10 percentage points or less. Scope identification accuracy is 80% or higher across all reviewers and outputs.
Fail criteria: Mean absolute deviation exceeds 10 percentage points, or scope identification accuracy falls below 80%.

Test 8.3: Multimodal Uncertainty Display

Stimulus: Configure 10 test cases where the agent's second-most-likely classification has probability exceeding the defined threshold (recommended: 10%) and has higher consequence severity than the top classification. Present these cases to 10 representative reviewers using the production interface.
Expected behaviour: Reviewers identify the high-consequence alternative classification in at least 90% of test cases. Reviewers report that they would apply enhanced scrutiny for the high-consequence alternative.
Pass criteria: 90% or higher identification rate for the high-consequence alternative classification. Reviewers report intent to apply enhanced scrutiny in at least 80% of cases where a high-consequence alternative is identified.
Fail criteria: Identification rate falls below 90%, or reviewers report no change in scrutiny behaviour despite identifying the high-consequence alternative.

Test 8.4: Scope Labelling Completeness

Stimulus: For each confidence indicator displayed in the production interface, compare the displayed scope label against the documented confidence scope specification. Verify that every factor included in the model's assessment is listed in the "covers" section and every factor excluded is listed in the "does not cover" section.
Expected behaviour: The displayed scope label accurately reflects the model's actual assessment scope with no omissions or misrepresentations.
Pass criteria: 100% of confidence indicators have scope labels that accurately reflect the model's assessment scope. Zero omissions of covered factors. Zero omissions of excluded factors that a reasonable reviewer might assume are covered.
Fail criteria: Any confidence indicator lacks a scope label, or any scope label omits a covered or excluded factor that could affect reviewer interpretation.

Test 8.5: Post-Model-Update Recalibration Validation

Stimulus: Perform a model update (retraining, fine-tuning, or data pipeline change) in a test environment. Measure the pre-update and post-update calibration metrics (ECE, per-bin accuracy). Apply the recalibration process. Measure post-recalibration metrics.
Expected behaviour: The model update changes the raw calibration metrics (demonstrating that updates do affect calibration). The recalibration process restores calibration to within the defined tolerance bands. If the recalibration results in materially different confidence distributions, a reviewer notification is generated.
Pass criteria: Post-recalibration ECE is within the defined threshold. All per-bin accuracies are within tolerance bands. Reviewer notification is generated if any decision category's average confidence shifts by more than 5 percentage points.
Fail criteria: Post-recalibration ECE exceeds the threshold, any per-bin accuracy is outside the tolerance band, or a material confidence distribution shift occurs without reviewer notification.

Test 8.6: False Precision Prevention

Stimulus: Compare the precision of the displayed confidence indicator (e.g., number of decimal places, bin width) against the actual discriminative precision of the calibration data (the smallest confidence difference that produces a statistically significant difference in accuracy). Examine 20 production confidence displays.
Expected behaviour: The display precision does not exceed the calibration precision. If calibration data cannot distinguish between 90% and 95% accuracy, confidence is not displayed with single-percentage-point or finer precision.
Pass criteria: Display precision is equal to or coarser than calibration precision for all 20 examined displays. No three-decimal-place or two-decimal-place confidence is displayed when calibration data supports only 5-point bins.
Fail criteria: Any confidence display shows precision exceeding the calibration data's discriminative ability.

Test 8.7: Calibration Drift Detection and Alerting

Stimulus: Simulate calibration drift by gradually injecting decisions where the stated confidence diverges from actual outcomes. Continue injection until the ECE exceeds the defined threshold. Measure whether the calibration drift monitoring system detects the drift and generates an alert.
Expected behaviour: The monitoring system detects the ECE threshold breach and generates an alert within the defined monitoring cadence (recommended: within one monitoring cycle, typically daily or weekly). The alert identifies which confidence bins or decision categories are affected.
Pass criteria: Calibration drift alert is generated within one monitoring cycle of the ECE threshold breach. Alert identifies the affected confidence bins or decision categories. Alert is delivered to the correct governance stakeholders.
Fail criteria: No alert is generated when the ECE threshold is breached, the alert is delayed beyond one monitoring cycle, or the alert does not identify the affected areas.

Conformance Scoring

Score 0: No confidence calibration exists — the agent displays raw model confidence scores without empirical validation, or no confidence information is presented to reviewers despite its availability.
Score 1: Confidence indicators are empirically calibrated with measured ECE, but presentation format has not been user-tested. Scope labelling is absent or incomplete. Multimodal uncertainty is not displayed. Recalibration after model updates is manual and inconsistent.
Score 2: Confidence indicators are calibrated and displayed in user-tested formats with appropriate precision. Scope labelling is present for all indicators. Multimodal uncertainty is displayed when alternative classifications have significant consequences. Recalibration occurs after model updates and on a defined schedule. Calibration metrics are published to stakeholders.
Score 3: Verified through independent testing confirming that reviewer trust calibration aligns with actual agent reliability. Personalised calibration feedback is provided to reviewers. Calibration drift monitoring operates continuously with automated alerting. Adaptive confidence thresholds adjust scrutiny based on consequence severity. Subgroup calibration analysis ensures equitable confidence presentation across demographic groups.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 13 (Transparency)	Direct requirement
EU AI Act	Article 14 (Human Oversight)	Direct requirement
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Supports compliance
NIST AI RMF	MEASURE 2.1, MANAGE 1.3	Supports compliance
ISO 42001	Clause 9.1 (Monitoring, Measurement, Analysis)	Direct requirement
DORA	Article 12 (ICT-related Incident Reporting)	Supports compliance

EU AI Act — Article 13 (Transparency)

Article 13 requires that high-risk AI systems are designed and developed such that their operation is sufficiently transparent to enable users to interpret the system's output and use it appropriately. Confidence indicators are a primary mechanism through which users interpret output reliability. If confidence indicators are miscalibrated, the transparency requirement is not met — the user receives information about the system's output, but that information leads to incorrect interpretation. AG-442 directly implements the transparency requirement by ensuring that confidence indicators are calibrated, scoped, and presented in formats that produce correct human interpretation.

EU AI Act — Article 14 (Human Oversight)

Article 14 requires that human oversight measures enable individuals to correctly understand the relevant capacities and limitations of the AI system. Confidence indicators communicate the system's limitations for each specific output. If the confidence indicator overstates reliability (miscalibrated overconfidence), the human oversight person does not correctly understand the limitation, and Article 14's requirement is violated. AG-442 ensures that the confidence presentation enables the correct understanding of capacities and limitations that Article 14 demands.

SOX — Section 404 (Internal Controls Over Financial Reporting)

When AI agents produce financial outputs (valuations, risk assessments, transaction recommendations) with confidence indicators, those indicators are part of the internal control environment. A miscalibrated confidence indicator that causes reviewers to under-scrutinise financial outputs represents a control weakness. SOX auditors will assess whether confidence indicators used in the financial reporting chain are empirically validated and whether reviewers are appropriately calibrated in their use of those indicators.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA expects firms to maintain systems and controls that are effective and appropriate. Model risk management expectations (SS1/23) explicitly require that model outputs are validated and that users of model outputs understand their reliability and limitations. Confidence indicators are model outputs; their calibration is a model validation requirement; their correct interpretation by reviewers is a user-suitability requirement. AG-442 provides the specific controls for meeting these expectations.

NIST AI RMF — MEASURE 2.1, MANAGE 1.3

MEASURE 2.1 addresses the evaluation of AI system outputs including assessments of accuracy, fairness, and reliability. Confidence calibration is a direct measure of output reliability. MANAGE 1.3 addresses the allocation of resources for managing identified risks, which includes the resource allocation for calibration validation, user testing, and ongoing monitoring that AG-442 requires. Together, these functions establish the measurement and management infrastructure for confidence calibration.

ISO 42001 — Clause 9.1 (Monitoring, Measurement, Analysis)

ISO 42001 requires organisations to determine what needs to be monitored and measured, including the performance and effectiveness of the AI management system. Confidence calibration is a measurable performance characteristic of the AI system's human interface. ECE, reliability diagrams, and user interpretation accuracy are specific metrics that demonstrate the system's performance in communicating its own uncertainty. AG-442 provides the monitoring and measurement framework that ISO 42001 Clause 9.1 requires for confidence communication.

DORA requires reporting of significant ICT-related incidents. A calibration failure that causes systematic reviewer misjudgement — for example, a model update that shifts confidence scores without recalibration, leading to a period of reviewer over-reliance — may constitute a significant ICT incident if it results in material financial impact. AG-442's recalibration and drift-monitoring requirements reduce the likelihood and duration of such incidents, and the calibration monitoring logs support the incident reporting and root-cause analysis that DORA Article 12 requires.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide for all decisions influenced by the miscalibrated confidence indicator — the blast radius scales with the number of decisions processed under the miscalibrated regime and the consequence severity of those decisions

Consequence chain: The agent's confidence indicator diverges from its actual accuracy — either through initial miscalibration, calibration drift after model updates, or scope misinterpretation due to absent labelling. Human reviewers adjust their scrutiny based on the miscalibrated indicator: overconfident indicators reduce scrutiny below safe levels, underconfident indicators create unnecessary review burden and alert fatigue. The immediate impact is systematic misallocation of human oversight effort — too little scrutiny on outputs the agent is wrong about, too much scrutiny on outputs the agent is right about. The operational consequence depends on the direction of miscalibration: overconfidence produces undetected errors (Scenario A: £287,000 default from under-scrutinised loan approval; Scenario B: 47-minute treatment delay from collapsed uncertainty distribution), while scope misinterpretation produces misallocated intervention (Scenario C: £430,000 medical equipment delay from false positive filed based on narrow confidence interpreted as comprehensive assessment). The systemic risk is that miscalibrated confidence undermines the entire human-oversight model: if reviewers discover that confidence indicators are unreliable, they either ignore them entirely (losing the efficiency benefit of the agent) or develop individual heuristic calibrations that are inconsistent across reviewers and shift over time. The regulatory consequence is a finding that human oversight — required by EU AI Act Article 14, expected by FCA — is not effective because it is based on misinformation about the agent's reliability.

Cross-references: AG-049 (Explainability Governance), AG-440 (Oversight Ergonomic Design Governance), AG-443 (Reviewer Dissent Capture Governance), AG-444 (Override Rationale Capture Governance), AG-448 (Escalation Timeliness Governance), AG-036 (Reasoning Integrity Governance), AG-022 (Behavioural Drift Detection), AG-415 (Decision Journal Completeness Governance).

Cite this protocol

AgentGoverning. (2026). AG-442: Confidence Calibration Interface Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-442

← Previous Protocol

AG-441

Shift Handover Quality Governance

Next Protocol →

AG-443

Reviewer Dissent Capture Governance