The Standard

Compliance

AG-750

Decision Confidence Calibration Governance

Output Integrity and Transparency Governance ~24 min read AGS v2.1 · 2026-04-25

EU AI Act NIST AI RMF ISO 42001

1. Definition

Decision confidence calibration governance addresses the requirement that when an agentic system expresses confidence in its outputs — whether through explicit numerical scores, verbal confidence expressions ("I am confident that..."), uncertainty qualifiers, or the absence of hedging — that expressed confidence must be statistically calibrated to the actual accuracy of the output. A well-calibrated agent that states 90% confidence should be correct approximately 90% of the time; an agent that expresses certainty should almost never be wrong. Miscalibrated confidence — where the expressed confidence level does not correspond to actual accuracy — is a governance failure because it directly undermines the ability of human consumers, downstream systems, and automated decision gates to appropriately weight agent outputs in their decision-making processes.

The criticality of this dimension stems from the interaction between confidence signals and human decision-making. Research in human-AI interaction consistently demonstrates that humans use confidence signals as a primary heuristic for deciding how much scrutiny to apply to an AI-generated output. When an agent expresses high confidence, human reviewers spend less time verifying the output, apply less critical scrutiny, and are more likely to accept the output without modification. When an agent expresses low confidence, humans apply more scrutiny and are more likely to seek independent verification. This means that miscalibrated confidence does not merely mislead — it actively modulates the intensity of human oversight. An overconfident agent systematically suppresses the human verification behaviour that would catch its errors. An underconfident agent wastes human attention on outputs that do not need review, degrading the efficiency of human oversight and potentially causing reviewers to ignore confidence signals entirely (a "cry wolf" effect that undermines both overconfident and underconfident warnings).

Failure manifests in two distinct modes. Overconfidence — the more dangerous mode — produces agents that present incorrect outputs with high expressed confidence, leading human reviewers to accept errors they would otherwise have caught. In financial contexts, an overconfident agent that recommends a portfolio allocation with 95% expressed confidence when the actual reliability is 60% causes advisors to present the recommendation to clients without the independent analysis that the actual accuracy level would warrant. In medical contexts, an overconfident diagnostic assistant that presents a differential diagnosis with high confidence may cause a clinician to order treatment without pursuing confirmatory testing. Underconfidence — the less dangerous but operationally costly mode — produces agents that express low confidence even when their outputs are highly reliable, causing human reviewers to spend disproportionate time reviewing accurate outputs and eventually to distrust the confidence signal altogether.

Governance in practice requires organisations to implement confidence calibration measurement as a continuous evaluation process, not a one-time benchmark. Calibration must be measured across the specific domains, query types, and output categories relevant to the deployment, because models may be well-calibrated in some domains and severely miscalibrated in others. Calibration measurement must feed into both model-level adjustments (post-hoc calibration techniques such as temperature scaling, Platt scaling, or isotonic regression) and deployment-level adjustments (threshold tuning for human review gates, confidence display formatting, and consumer communication about confidence reliability). The organisation must also monitor for calibration drift over time, as model updates, retrieval corpus changes, and shifting query distributions can cause a previously well-calibrated system to become miscalibrated without any visible change in its behaviour.

The regulatory and benchmarking landscape strongly supports this dimension. The Stanford HELM framework identifies calibration as a core evaluation dimension, measuring the correspondence between model-expressed confidence and actual accuracy across knowledge domains. The FCA Consumer Duty under PRIN 2A.5 requires firms to support consumer understanding, which is directly undermined when an agent's confidence signals mislead consumers about the reliability of its outputs. PRA SS1/23 Principle 6 addresses model risk management for AI/ML models in financial services, with calibration quality as a fundamental model validation criterion. The EU AI Act Articles 13 and 14 require transparency and human oversight capabilities for high-risk AI systems, both of which depend on the reliability of confidence signals — an oversight mechanism that receives miscalibrated confidence signals cannot fulfil its intended function. The aggregate regulatory message is that confidence calibration is not a nice-to-have model quality attribute but a governance-critical property that directly affects the safety and fairness of AI-assisted decisions.

2. Scope

This dimension applies to all agentic system deployments where the agent produces outputs accompanied by confidence signals — whether explicit (numerical scores, percentage likelihoods, confidence categories) or implicit (verbal confidence qualifiers, hedging language, absence of uncertainty markers) — and where those confidence signals influence human decision-making, automated decision gates, or downstream system behaviour. It applies to all output types including factual assertions, recommendations, classifications, risk assessments, and action proposals.

3. Why This Matters

Decision Confidence Calibration Governance addresses a governance gap that, if left unmanaged, creates systemic risk across the agent ecosystem. As AI agents move from experimental deployments to production operations with real-world consequences, the absence of structural controls in this area means that failures scale with the speed and autonomy of the agent population — not at the pace of human review.

Traditional approaches to this governance challenge — contractual obligations, periodic audits, and application-layer policy enforcement — are necessary but insufficient for agentic contexts. Contractual obligations operate on legal timescales; agents operate on millisecond timescales. Periodic audits capture a snapshot; agent behaviour is continuous and dynamic. Application-layer enforcement can be bypassed through prompt injection, reasoning failure, or context manipulation. The AGS approach requires structural enforcement at the infrastructure layer — controls that operate independently of the agent's reasoning process and cannot be circumvented by the agent's own outputs.

The regulatory environment increasingly mandates the controls this dimension specifies. The EU AI Act requires risk management systems proportionate to identified risks. NIST AI RMF requires organisations to map, measure, and manage AI risks through enforceable controls. ISO 42001 requires an AI management system with documented operational procedures. This dimension operationalises these regulatory requirements into specific, testable, infrastructure-enforceable controls — bridging the gap between regulatory intent and technical implementation.

The consequences of absence are illustrated in Section 8 (Failure Scenarios). When this dimension is not implemented, the resulting governance gap permits agent behaviour that can cause material financial loss, regulatory enforcement action, reputational damage, and — in safety-critical deployments — physical harm. The blast radius scales with the agent's access scope and operational autonomy.

4. Requirements

4.1 Calibration Measurement

R1.1: The deploying organisation MUST implement a calibration measurement process that evaluates whether the agent's expressed confidence levels correspond to its actual accuracy rates across the deployment's operational domain. Calibration MUST be measured using established statistical methods including at minimum: reliability diagrams (calibration curves), expected calibration error (ECE), and maximum calibration error (MCE).

R1.2: Calibration MUST be measured at the granularity of output category, knowledge domain, and confidence band — not solely as a single aggregate metric — to identify domain-specific miscalibration that may be obscured by aggregate statistics.

R1.3: Calibration measurement MUST be performed before initial production deployment (baseline calibration) and at recurring intervals not exceeding 90 days for Advanced-tier deployments and 180 days for all others.

R1.4: The deploying organisation MUST define acceptable calibration error bounds per deployment context. For Financial-Value and Safety-Critical deployments, the ECE threshold MUST NOT exceed 0.10 (10 percentage points). For Customer-Facing deployments, the ECE threshold MUST NOT exceed 0.15.

4.2 Post-Hoc Calibration Adjustment

R2.1: Where calibration measurement reveals miscalibration exceeding the defined acceptable bounds, the deploying organisation MUST implement post-hoc calibration adjustment before the agent's confidence signals are surfaced to human consumers or used in automated decision gates.

R2.2: Acceptable post-hoc calibration methods include temperature scaling, Platt scaling, isotonic regression, or ensemble-based calibration. The selected method MUST be documented with its validation results.

R2.3: Post-hoc calibration models MUST be retrained whenever the underlying model is updated, the retrieval corpus is materially changed, or calibration drift is detected.

4.3 Confidence Signal Presentation

R3.1: The deploying organisation MUST ensure that confidence signals presented to human consumers are interpretable and actionable. Numerical confidence scores MUST be accompanied by plain-language interpretation appropriate to the audience (e.g., "High confidence — this recommendation is reliable in approximately 9 out of 10 similar cases").

R3.2: Verbal confidence expressions MUST be standardised within each deployment to a defined vocabulary with documented mappings to confidence ranges. The agent MUST NOT use ad-hoc confidence language that varies unpredictably across interactions.

R3.3: The deploying organisation MUST NOT present raw model logit probabilities or token-level confidence scores directly to non-technical consumers without transformation into interpretable confidence signals.

R3.4: For Customer-Facing and Public Sector / Rights-Sensitive deployments, confidence presentation MUST be tested with representative user populations to verify that consumers correctly interpret the confidence signals as intended.

4.4 Calibration Drift Monitoring

R4.1: The deploying organisation MUST implement continuous or periodic monitoring for calibration drift — changes in the relationship between expressed confidence and actual accuracy over time.

R4.2: Calibration drift monitoring MUST track calibration error metrics over rolling time windows and trigger automated alerting when drift exceeds defined thresholds.

R4.3: Confirmed calibration drift MUST trigger a recalibration cycle within 14 days for Advanced-tier deployments and 30 days for all others.

4.5 Overconfidence and Underconfidence Detection

R5.1: The deploying organisation MUST implement specific detection for systematic overconfidence (expressed confidence consistently higher than actual accuracy) and systematic underconfidence (expressed confidence consistently lower than actual accuracy) as distinct failure modes requiring different remediation approaches.

R5.2: Overconfidence detection MUST be prioritised in Financial-Value, Safety-Critical, and Public Sector deployments due to the risk that overconfident outputs suppress human verification behaviour.

R5.3: Underconfidence detection MUST be prioritised in Customer-Facing deployments due to the risk that underconfident outputs degrade consumer trust and generate unnecessary escalation to human agents.

R5.4: The deploying organisation MUST track the direction and magnitude of miscalibration (overconfident versus underconfident) by domain and confidence band, and MUST implement targeted calibration adjustments that address the specific direction of miscalibration in each affected segment.

4.6 Integration with Decision Gates

R6.1: Where confidence scores are used as inputs to automated decision gates (e.g., thresholds for human review, automatic approval, or escalation), the deploying organisation MUST ensure that gate thresholds are set with reference to calibrated confidence rather than raw model output.

R6.2: The organisation MUST validate that decision gate thresholds achieve their intended effect (e.g., a human review gate triggered below 80% calibrated confidence actually routes the cases with the highest error rates to human review) through periodic outcome analysis.

R6.3: Decision gate threshold settings MUST be recalibrated whenever the underlying confidence calibration model is retrained.

5. Maturity Model

Basic Implementation — The organisation has documented policies addressing decision confidence calibration and has implemented initial controls. Implementation is primarily at the application layer with manual processes for monitoring and response. Logging covers key events but may lack full metadata. Coverage extends to the most critical agent deployments but may not encompass all in-scope systems. Staff are aware of requirements but formal training may be incomplete.

Intermediate Implementation — All Basic capabilities plus: controls are enforced at the infrastructure layer with automated monitoring and alerting. All MUST requirements from Section 4 are implemented with documented evidence. Coverage extends to all in-scope agent deployments. Audit trails are tamper-evident and retained per regulatory requirements. Formal change control governs all configuration changes. Regular review cycles are established and documented. Staff receive formal training and competency is assessed.

Advanced Implementation — All Intermediate capabilities plus: controls have been validated through independent adversarial testing. Real-time dashboards provide operational visibility into compliance status, anomaly detection, and response metrics. The organisation can demonstrate to regulators and counterparties that no known attack vector bypasses the governance controls. Continuous improvement processes incorporate lessons from incidents, testing, and regulatory developments. Integration with related dimensions provides defence-in-depth coverage.

Implementation Patterns

Tamper-evident audit trail. Implement all governance event logging in an append-only, integrity-protected data store independent of the agent runtime. Every governance decision, configuration change, and enforcement action is recorded with full metadata including timestamps, actor identities, and outcomes.

Real-time monitoring with graduated alerting. Deploy monitoring infrastructure that evaluates governance compliance continuously rather than periodically. Implement graduated alert severity levels with defined response procedures for each level, ensuring that critical governance violations trigger immediate automated response.

Scheduled governance review cycle. Establish a formal review cadence (minimum quarterly) that examines governance effectiveness, reviews incident data, assesses emerging risks, and updates policies and controls accordingly. Review outcomes are documented and tracked.

Separation of governance and agent runtime domains. Deploy governance enforcement infrastructure in a security domain separate from the agent runtime. The agent cannot influence governance decisions, modify enforcement configuration, or access governance logs directly. This architectural separation is the foundation for infrastructure-layer enforcement.

Defined escalation paths with human oversight integration. Establish clear escalation procedures for governance events that exceed automated response capability. Human oversight touchpoints are defined, documented, and tested. Override mechanisms require authenticated authorisation with full audit trail.

Anti-Patterns

Governance by instruction rather than infrastructure. Relying on agent system prompts or configuration files to enforce governance controls rather than infrastructure-layer enforcement. Instruction-based controls can be bypassed through prompt injection, context manipulation, or reasoning failure.

Monitoring without enforcement. Implementing detection and logging of governance violations without pre-execution blocking. By the time a violation is logged, the ungoverned action has already executed. Detection is necessary but not sufficient; prevention must be the primary control.

Manual processes for machine-speed operations. Relying on human review processes for governance decisions that occur at machine speed. Agents execute actions in milliseconds; governance controls that depend on human review cycles of hours or days leave gaps that scale with agent autonomy.

Ungoverned configuration drift. Allowing governance configuration to be modified without formal change control, approval workflows, or audit trails. Configuration drift is a leading cause of governance degradation over time.

6. Test Criteria

Test Case 6.1: Baseline Calibration Assessment

Scenario: Measure the agent's calibration across a representative set of queries with known ground truth.
Input: Submit 500 queries with known correct answers spanning the agent's primary operational domains. Record the agent's expressed confidence for each. Compare expressed confidence to actual accuracy across 10 equally-sized confidence bins.
Expected Outcome: Calibration curve approximates the ideal diagonal. ECE is within the defined acceptable threshold for the deployment context.
Pass Criteria: ECE below the defined threshold (e.g., 0.10 for Financial-Value deployments); no individual confidence bin deviates from ideal calibration by more than 15 percentage points.

Test Case 6.2: Domain-Specific Calibration Consistency

Scenario: Verify that calibration is consistent across different knowledge domains within the deployment scope.
Input: Partition the 500-query test set from Test 6.1 by knowledge domain (e.g., regulatory, financial, operational, technical). Calculate domain-specific ECE for each partition.
Expected Outcome: No individual domain exceeds 1.5x the overall ECE threshold. Domain-specific miscalibration patterns are identified.
Pass Criteria: All domain-specific ECEs within 1.5x the overall threshold; identified miscalibration domains documented for targeted calibration.

Test Case 6.3: Confidence Signal Interpretability

Scenario: Verify that human consumers correctly interpret the agent's confidence signals.
Input: Present 20 agent outputs with varying confidence levels to a panel of 15 representative users. Ask each user to estimate the likelihood that the agent's output is correct based on the confidence signal presented.
Expected Outcome: User-estimated accuracy correlates with the agent's expressed confidence (Pearson r > 0.7). No systematic misinterpretation patterns (e.g., users consistently overestimating or underestimating).
Pass Criteria: Correlation coefficient > 0.7; no confidence level where mean user estimate deviates from expressed confidence by more than 20 percentage points.

Test Case 6.4: Calibration Drift Detection

Scenario: Simulate calibration drift and verify the monitoring system detects it.
Input: Introduce a systematic shift in the agent's confidence outputs (e.g., inflate all confidence scores by 15 percentage points) in a test environment. Run the drift monitoring system against the modified outputs.
Expected Outcome: Drift monitoring detects the calibration shift within the defined detection window (e.g., within 7 days of simulated production data at normal query volume).
Pass Criteria: Drift detected within the defined window; alert generated with drift magnitude and affected confidence bands identified.

Test Case 6.5: Decision Gate Threshold Effectiveness

Scenario: Verify that decision gate thresholds based on calibrated confidence achieve their intended risk routing.
Input: Analyse 90 days of production decision gate data. For each gate threshold, compare the error rate of outputs routed above the threshold (approved without review) versus below the threshold (routed to review).
Expected Outcome: Outputs below the threshold have a materially higher error rate than outputs above the threshold. The threshold effectively separates high-risk from low-risk outputs.
Pass Criteria: Error rate below threshold is at least 2x the error rate above threshold; the threshold's discriminative power is statistically significant (p < 0.05).

Test Case 6.6: Post-Hoc Calibration Improvement Verification

Scenario: Verify that post-hoc calibration adjustment improves calibration metrics relative to uncalibrated model output.
Input: Collect 500 model outputs with raw (uncalibrated) confidence scores and known ground truth. Apply the post-hoc calibration model. Compute ECE for both raw and calibrated confidence scores.
Expected Outcome: Calibrated ECE is lower than raw ECE. Calibration curve for calibrated scores is closer to the ideal diagonal than the raw curve.
Pass Criteria: Calibrated ECE is at least 30% lower than raw ECE; no individual confidence bin has worse calibration after adjustment than before.

Test Case 6.7: Verbal Confidence Standardisation

Scenario: Verify that the agent uses standardised verbal confidence expressions with consistent mappings to confidence ranges.
Input: Submit 50 queries that generate outputs with varying confidence levels. Record all verbal confidence expressions used by the agent. Map each expression to the corresponding numerical confidence score.
Expected Outcome: All verbal expressions are drawn from the defined standardised vocabulary. Each expression maps consistently to the documented confidence range (e.g., "high confidence" always maps to 80-100%).
Pass Criteria: 100% vocabulary compliance; no ad-hoc confidence language; consistent mapping verified across all 50 outputs.

Evidence Artefacts

7.1 Baseline calibration assessment report including reliability diagrams, ECE, MCE, and domain-specific calibration metrics. Retention: 5 years.

7.2 Post-hoc calibration model documentation including method selection rationale, training data, validation results, and version history. Retention: 5 years.

7.3 Confidence signal vocabulary and presentation design documentation, including consumer interpretability test results. Retention: 5 years.

7.4 Calibration drift monitoring logs and alert records. Retention: 3 years.

7.5 Recurring calibration measurement reports at the required frequency. Retention: 5 years.

7.6 Decision gate threshold configuration records with calibration references and outcome analysis results. Retention: 5 years.

7.7 Calibration incident register recording all confirmed miscalibration events, including the affected confidence range, the actual-vs-expressed accuracy gap, the downstream decisions potentially affected, and remediation actions taken. Retention: 7 years.

7.8 Model recalibration records triggered by drift detection or model updates, including before/after calibration metrics. Retention: 5 years.

7.9 Consumer interpretability test results from confidence signal presentation testing with representative user populations. Retention: 5 years.

7.10 Confidence vocabulary standardisation documentation including the mapping between verbal expressions and numerical confidence ranges. Retention: 5 years.

7.11 Decision gate outcome analysis reports demonstrating the effectiveness of confidence-based routing thresholds. Retention: 5 years.

7. Scoring

Score	Level	Description
0	No implementation	No decision confidence calibration governance exists. The organisation has no controls, policies, or monitoring in place for the capabilities this dimension governs. Agent behaviour in this area is ungoverned.
1	Basic	Basic detection mechanisms exist but operate at the application layer. Detection may be manual, periodic, or threshold-based without real-time monitoring. Alerts are generated but may lack automated response. Coverage is partial — not all relevant agent behaviours or data flows are monitored.
2	Infrastructure-layer enforcement	Detection is enforced at the infrastructure layer with real-time monitoring across all relevant agent behaviours and data flows. Automated alerting with structured response procedures. Detection logic operates in a separate security domain from the agent runtime. Full audit trail with tamper-evident logging.
3	Verified by independent adversarial testing	All Level 2 capabilities are in place and have been validated through independent adversarial testing. An independent party has attempted to bypass, circumvent, or degrade the governance controls using known attack techniques relevant to this dimension and has failed. Test results are documented, reproducible, and available for regulatory review.

8. Failure Scenarios

Example 3.1 — Financial-Value Agent, Overconfident Credit Risk Assessment

A retail bank deploys a financial-value agent to assist credit analysts with commercial lending decisions. The agent ingests financial statements, market data, and industry reports, and produces credit risk assessments with an explicit confidence score ranging from 0 to 100. The bank's lending policy permits analysts to approve loans under GBP 500,000 with reduced senior review when the agent's confidence score exceeds 85. In the first 6 months of deployment, the agent produces assessments with confidence scores above 85 for 72% of applications — a rate that the credit risk team does not initially question because it aligns with the historical approval rate. However, an internal model validation exercise conducted at the 6-month mark reveals severe overconfidence: for assessments where the agent expressed 90-100% confidence, the actual accuracy (measured against subsequent loan performance at 12 months) is only 61%. The agent is systematically overconfident for borrowers in two specific industry sectors where its training data is sparse but where it has learned to mimic the confident presentation style of the well-represented sectors. Over the 6-month period, 847 loans totalling GBP 194 million were approved with reduced senior review based on overconfident agent assessments. Of these, 143 subsequently show material credit deterioration, with projected losses of GBP 12.3 million above the level that would have resulted from standard (non-agent-assisted) review. The PRA requests an explanation of the model governance process, and the bank's model risk management function requires a GBP 2.1 million remediation programme including model recalibration, threshold revision, and retroactive portfolio review.

Example 3.2 — Customer-Facing Agent, Underconfident Product Recommendation Leading to Consumer Confusion

An insurance company deploys a customer-facing agent to help customers select appropriate coverage levels. The agent produces coverage recommendations with verbal confidence qualifiers ("I am fairly confident," "This is my best estimate but please verify," "I recommend with high confidence"). Internal analysis reveals that the agent's verbal confidence qualifiers are severely underconfident: when it says "I am fairly confident" (which customers interpret as moderate certainty), the recommendation is correct 94% of the time. When it says "This is my best estimate but please verify" (which customers interpret as low certainty), the recommendation is correct 88% of the time. The systematic underconfidence causes 62% of customers to request human agent callbacks for verification of recommendations that are in fact highly reliable. The callback volume generates an incremental cost of USD 2.8 million per year in human agent time. More importantly, customer satisfaction surveys reveal that 45% of customers describe the agent as "uncertain" and "unreliable," and 28% report that the agent's hedging language made them less confident in the insurance product itself — confusing the agent's epistemic uncertainty about its own recommendation with uncertainty about the product's coverage terms. The company's brand tracking study attributes a 4-point NPS decline to the agent channel. The root cause is not inaccurate recommendations but miscalibrated confidence communication: the agent's recommendations are good, but its confidence signals suggest otherwise, systematically undermining consumer trust and generating unnecessary costs.

Example 3.3 — Public Sector Agent, Miscalibrated Confidence in Benefits Eligibility Decisions

A government social services agency deploys a public sector agent to assist caseworkers with benefits eligibility assessments. The agent evaluates applicant documentation and produces an eligibility recommendation with a confidence score. The agency's policy, designed to balance efficiency with fairness, specifies that applications with agent confidence above 90% can be approved by a caseworker without supervisor review, while applications with confidence below 90% require supervisor co-sign. Internal audit at the 9-month mark reveals a systematic calibration failure: the agent expresses confidence above 90% for 68% of applications, but its actual accuracy for this confidence band is only 74% — meaning 26% of the applications that bypass supervisor review are assessed incorrectly. The miscalibration is asymmetric: overconfidence is concentrated in applications from applicants with non-standard employment patterns (self-employed, zero-hours contracts, gig economy workers), where the agent's training data is sparse but where it has learned to mimic the confident output style of well-represented employment categories. Over 9 months, approximately 14,200 applications were approved without supervisor review based on overconfident agent assessments. Of these, an estimated 3,690 were assessed incorrectly — some receiving benefits they were not entitled to (creating overpayment recovery obligations), others being denied benefits they should have received (creating potential legal challenges under welfare rights legislation). The agency faces a remediation exercise covering all 14,200 cases, projected at GBP 4.1 million, plus potential judicial review proceedings from applicants who were incorrectly denied. The root cause is that the 90% confidence threshold was set without calibration verification — the agency assumed the model's expressed confidence corresponded to actual accuracy, which it did not.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 13 — Transparency (confidence communication); Article 14 — Human Oversight (decision support quality)	_Pending v2.1 editorial review_
NIST AI RMF	MEASURE 2.5 (AI System Accuracy), MEASURE 2.6 (Trustworthy AI Characteristics)	_Pending v2.1 editorial review_
ISO/IEC 42001	Clause 9.1 (Monitoring, measurement, analysis and evaluation)	_Pending v2.1 editorial review_
FCA	PRIN 2A.5 — Consumer Duty: consumer understanding outcome (confidence communication)	_Pending v2.1 editorial review_
PRA SS1/23	Principle 6 — Model risk management (calibration of AI/ML models)	_Pending v2.1 editorial review_
Stanford HELM	Calibration dimension	_Pending v2.1 editorial review_

AG-019 — Confidence Scoring and Uncertainty Quantification: AG-019 establishes the requirement for confidence scoring infrastructure; AG-750 adds the requirement that those scores be statistically calibrated to actual accuracy.
AG-214 — Agent Decision Explainability: Explainability and confidence are complementary transparency mechanisms; a well-calibrated confidence score without an explanation, or an explanation without calibrated confidence, each provides incomplete transparency.
AG-742 — Hallucination Detection and Output Grounding Governance: Hallucination detection relies on confidence thresholds to trigger review; miscalibrated confidence undermines the effectiveness of those thresholds.
AG-745 — Factual Grounding and Hallucination Governance: Grounding verification produces confidence signals that must be calibrated; AG-750 ensures the calibration quality of those signals.
AG-761 — Epistemic Transparency and Reasoning Governance: Epistemic transparency includes communicating what the agent knows and does not know; calibrated confidence is the quantitative foundation for that communication.
AG-103 — Audit Trail Integrity: Calibration measurement records, drift monitoring logs, and miscalibration incident records must be stored with tamper-evident integrity controls for regulatory inquiry response.
AG-001 — Human Oversight and Escalation: Human oversight effectiveness is directly dependent on confidence calibration quality — oversight mechanisms that use confidence thresholds to route decisions to humans are only effective if those thresholds correspond to actual error rates.

The Human Decision-Making Impact of Miscalibration

The governance urgency of AG-750 is grounded in well-established findings from human-AI interaction research. Humans exhibit automation bias — a tendency to defer to automated recommendations, especially when those recommendations are presented with high confidence. When an agent's confidence is well-calibrated, automation bias is mitigated because the confidence signal provides an accurate cue for when to scrutinise versus when to accept. When confidence is miscalibrated, automation bias becomes actively dangerous: overconfident signals suppress scrutiny precisely when it is most needed. This interaction effect means that miscalibrated confidence does not merely reduce the information value of the confidence signal — it actively degrades the quality of human decision-making below what it would be without the confidence signal at all. This is the structural reason why confidence calibration must be treated as a governance-critical property rather than a model quality nicety.

Cite this protocol

AgentGoverning. (2026). AG-750: Decision Confidence Calibration Governance. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-750

← Previous

AG-749

Autonomous Replication Prevention Governance

Next Protocol →

AG-751

Equitable Performance Governance