Retrieved Evidence Confidence Governance requires that every piece of evidence retrieved from a knowledge base, vector store, or memory system carries a structured confidence score reflecting its quality, relevance, and reliability. The agent must expose these confidence scores in its reasoning and outputs, enabling users, downstream systems, and audit processes to assess the evidentiary basis for the agent's decisions. Without this control, agents present retrieved information as uniformly authoritative regardless of whether the retrieval match was strong or weak, the source was verified or speculative, or the evidence is current or stale. This dimension ensures that confidence is a first-class attribute of retrieved evidence.
Scenario A -- Low-Confidence Retrieval Presented as Definitive: A legal research agent retrieves a case citation in response to a query about employment discrimination. The vector similarity score between the query and the retrieved passage is 0.61 (on a 0-1 scale where 0.85+ is considered high confidence). The retrieved case is tangentially related -- it concerns discrimination but in a different jurisdiction and a different protected characteristic. The agent presents the citation without any confidence qualifier: "The relevant precedent is Smith v. Jones [2019], which established that..." The solicitor relies on this citation in a brief. Opposing counsel identifies the citation as inapplicable. The solicitor's credibility is damaged, and the client incurs £12,000 in additional legal costs to remedy the brief.
What went wrong: The retrieval system returned a low-confidence match. The agent presented the low-confidence retrieval as definitive without exposing the confidence score or qualifying the relevance. Consequence: £12,000 in wasted legal costs, solicitor reputation damage, client trust erosion.
Scenario B -- Multiple Retrieval Sources with Different Quality Levels: A customer-facing agent answers product safety questions. It retrieves information from three sources: the manufacturer's official safety data sheet (high confidence, verified source), a community forum post (low confidence, unverified), and an outdated product manual from 2019 (medium confidence, verified but stale). The agent synthesises all three sources equally and presents a composite answer. The community forum post contains an error about safe operating temperatures. The customer operates the product outside safe parameters based on the agent's advice, resulting in product damage and a minor burn injury.
What went wrong: Retrieved evidence from sources with different confidence levels was synthesised without weighting. The unverified forum post was treated as equal to the official safety data sheet. No confidence scoring differentiated the sources. Consequence: Product damage, minor personal injury, product liability claim, regulatory scrutiny.
Scenario C -- Embedding Similarity Mismatch Undetected: An enterprise knowledge agent retrieves a document about "Python migration" in response to a query about migrating a Python software application. The retrieved document is actually about the migration patterns of Burmese pythons (the snake) from a wildlife biology corpus that was inadvertently included in the knowledge base. The embedding similarity score is 0.73, which is below the system's intended confidence threshold of 0.80 but above the default retrieval cutoff of 0.60. The agent incorporates the wildlife content into its response about software migration.
What went wrong: The retrieval system returned a below-threshold match that should have been filtered or flagged. No confidence scoring mechanism prevented the low-quality retrieval from reaching the agent's reasoning. Consequence: Incorrect advice, user confusion, erosion of trust in the knowledge system.
Scope: This dimension applies to every AI agent that retrieves evidence from external knowledge bases, vector stores, document collections, or memory systems to inform its reasoning or outputs. This includes retrieval-augmented generation (RAG) systems, knowledge graph queries, database lookups, and any mechanism where the agent fetches information from a persistent store to ground its responses. The scope excludes agents that operate solely on their pre-trained knowledge without external retrieval. The test is: does the agent fetch information from an external source during response generation? If yes, confidence governance applies to every fetched item.
4.1. A conforming system MUST compute and assign a structured confidence score to every piece of retrieved evidence before it enters the agent's reasoning context.
4.2. A conforming system MUST incorporate at least three factors into the confidence score: retrieval similarity (how closely the evidence matches the query), source reliability (the verified quality of the source), and temporal freshness (how recently the evidence was created or verified).
4.3. A conforming system MUST enforce a minimum confidence threshold below which retrieved evidence is excluded from the agent's context, with the threshold configurable per use case.
4.4. A conforming system MUST expose confidence scores in the agent's outputs when evidence is cited, using a human-readable format (e.g., "high confidence," "medium confidence," "low confidence") alongside the numeric score.
4.5. A conforming system MUST log all retrieval events including: query, retrieved evidence identifiers, confidence scores, whether the evidence was included or excluded, and the threshold applied.
4.6. A conforming system SHOULD implement source-weighted retrieval where evidence from verified, authoritative sources receives a confidence boost and evidence from unverified or user-generated sources receives a confidence penalty.
4.7. A conforming system SHOULD apply graduated confidence thresholds based on the consequence severity of the decision: higher thresholds for safety-critical, financial, or legal decisions; lower thresholds for informational queries.
4.8. A conforming system SHOULD aggregate confidence across multiple retrieved evidence items, distinguishing between corroborated findings (multiple high-confidence sources agree) and isolated findings (single source, lower aggregate confidence).
4.9. A conforming system MAY implement confidence calibration by periodically comparing predicted confidence scores against actual accuracy (as determined by human review), and adjusting scoring parameters to improve calibration.
Retrieval-augmented generation has become the dominant architecture for grounding AI agent responses in organisational knowledge. The agent queries a knowledge base, retrieves relevant passages, and uses them to generate a response. This architecture significantly reduces hallucination compared to pure generative approaches. However, it introduces a new category of risk: the quality of retrieval directly determines the quality of the response, and retrieval quality varies enormously.
Vector similarity scores -- the primary mechanism for retrieval relevance ranking -- are a useful signal but not a complete measure of evidence quality. A passage can have a high similarity score but be from an unreliable source, be outdated, or match on surface-level terminology while being substantively irrelevant (Scenario C). Conversely, a passage from the most authoritative possible source may have a moderate similarity score due to vocabulary differences between the query and the passage.
Confidence scoring addresses this by combining multiple quality signals into a composite measure that more accurately reflects the evidence's fitness for the agent's reasoning. The three mandatory factors -- retrieval similarity, source reliability, and temporal freshness -- capture the most significant quality dimensions. Retrieval similarity measures relevance. Source reliability measures trustworthiness. Temporal freshness measures currency. Together, they provide a substantially more informative quality signal than similarity alone.
The minimum confidence threshold prevents low-quality retrievals from entering the agent's context, which is critical because once low-quality evidence enters the context, the agent's reasoning treats it with the same weight as high-quality evidence. The agent cannot reliably self-assess the quality of its retrieved evidence because it lacks the external perspective to evaluate source reliability or temporal currency. The confidence scoring mechanism provides this external evaluation.
Exposing confidence scores in outputs is essential for user trust calibration. Users interact with AI agents across a spectrum of trust: some users accept agent outputs uncritically, while others appropriately scrutinise them. Confidence scores enable appropriate trust calibration by signalling when the agent's evidentiary basis is strong versus when it is weak. A response qualified with "based on high-confidence evidence from the official safety data sheet" invites different user behaviour than "based on limited evidence from a 2019 source."
Confidence scoring operates as a post-retrieval, pre-context layer. After the retrieval system returns candidate evidence, the confidence scorer evaluates each item, assigns a composite score, filters below-threshold items, and passes qualifying items to the agent's context with their scores attached.
Recommended Patterns:
C = w_sim * S_similarity + w_src * S_source + w_fresh * S_freshness, where each sub-score is normalised to [0, 1] and weights sum to 1. Example weighting for a financial advisory use case: w_sim = 0.4, w_src = 0.35, w_fresh = 0.25. For a safety-critical use case: w_sim = 0.3, w_src = 0.4, w_fresh = 0.3. Source reliability scores are pre-assigned per source in a source registry: official documentation = 1.0, verified internal knowledge base = 0.9, internal wiki = 0.7, customer-generated content = 0.4, unverified external = 0.3. Freshness score decays with age: S_freshness = e^(-0.005 * age_days), giving a half-life of approximately 139 days.Anti-Patterns to Avoid:
Financial Services. MiFID II requires that investment recommendations be based on reliable information. Confidence scoring provides the evidentiary standard for demonstrating that retrieval-based recommendations relied on high-quality sources. The financial threshold (C >= 0.75) should be validated against regulatory expectations. Retrieval logs with confidence scores provide audit evidence for suitability assessments.
Healthcare. Clinical decision support systems must distinguish between evidence grades. Confidence scoring aligns with evidence-based medicine grading (e.g., Grade A from randomised controlled trials versus Grade D from expert opinion). The safety-critical threshold (C >= 0.80) should be calibrated against clinical evidence standards.
Legal. Legal research agents must qualify citation confidence. A case that is semantically similar but from a different jurisdiction is lower confidence than an on-point case from the applicable jurisdiction. Source reliability scoring should account for jurisdiction, court level, and whether the case has been overruled.
Basic Implementation -- A composite confidence score is computed using retrieval similarity, source reliability, and freshness. A minimum confidence threshold filters low-quality retrievals. Confidence scores are logged. The agent receives confidence metadata with retrieved evidence. Scores are not exposed to end users. This meets minimum mandatory requirements but the absence of user-facing confidence limits user trust calibration.
Intermediate Implementation -- All basic capabilities plus: confidence scores are exposed in agent outputs using human-readable labels. Graduated thresholds are configured by decision category. Corroboration aggregation boosts confidence for multi-source findings. Source reliability scores are maintained in a managed source registry with periodic review. Confidence scoring parameters are documented and versioned.
Advanced Implementation -- All intermediate capabilities plus: confidence calibration is performed quarterly by comparing predicted scores against actual accuracy. Calibration achieves a Brier score below 0.15. Dynamic threshold adjustment responds to risk signals (e.g., tighter thresholds during detected knowledge base degradation). The confidence scoring pipeline has been independently audited for accuracy and completeness. The organisation can demonstrate to regulators that agent decisions are grounded in evidence that meets defined quality standards.
Required artefacts:
Retention requirements:
Access requirements:
Test 8.1: Minimum Threshold Enforcement
Test 8.2: Multi-Factor Score Computation
Test 8.3: Confidence Exposure in Agent Output
Test 8.4: Graduated Threshold Application
Test 8.5: Corroboration Aggregation
Test 8.6: Low-Similarity High-Source-Reliability Handling
| Regulation | Provision | Relationship Type |
|---|---|---|
| EU AI Act | Article 13 (Transparency) | Direct requirement |
| EU AI Act | Article 15 (Accuracy, Robustness, Cybersecurity) | Direct requirement |
| EU AI Act | Article 9 (Risk Management System) | Supports compliance |
| MiFID II | Article 25 (Suitability) | Supports compliance |
| NIST AI RMF | MEASURE 2.5, MEASURE 2.6 | Supports compliance |
| ISO 42001 | Clause 8.2 (AI Risk Assessment), Clause 9.1 (Monitoring, Measurement, Analysis) | Supports compliance |
Article 13 requires that high-risk AI systems be designed to enable users to interpret the system's output and use it appropriately. Confidence scoring directly supports this by providing users with quality signals about the evidentiary basis of the agent's responses. Without confidence exposure, users cannot distinguish between responses grounded in strong evidence and those grounded in weak evidence, undermining the transparency requirement.
Article 15 requires appropriate levels of accuracy and robustness. Confidence scoring and threshold filtering directly support accuracy by preventing low-quality evidence from degrading response quality. The graduated threshold mechanism ensures that accuracy standards are proportionate to the consequence severity of the decision.
Suitability assessments must be based on reliable information. Confidence scoring provides a structured mechanism for ensuring that RAG-based advisory agents use evidence that meets defined quality standards. Retrieval logs with confidence scores provide auditable evidence of the information basis for each recommendation.
Low-confidence retrieval entering agent reasoning is a risk to output quality. Confidence scoring and threshold filtering mitigate this risk.
MEASURE 2.5 addresses AI system output quality. MEASURE 2.6 addresses performance measurement. Confidence scoring provides a measurable quality signal for retrieved evidence, and calibration provides a mechanism for improving measurement accuracy over time.
Clause 8.2 requires AI risk assessment. Clause 9.1 requires monitoring and measurement. Confidence scoring is both a risk assessment mechanism (evaluating evidence quality) and a monitoring metric (tracking confidence distributions over time).
| Field | Value |
|---|---|
| Severity Rating | High |
| Blast Radius | Per-response -- every agent response grounded in retrieved evidence is affected |
Consequence chain: Without confidence scoring, the agent treats all retrieved evidence as equally authoritative. Low-quality retrievals (low similarity, unreliable sources, stale content) enter the agent's reasoning context and influence outputs with the same weight as high-quality evidence. The immediate failure is degraded response quality: users receive answers grounded in weak evidence without any signal that the evidence is weak. In legal research (Scenario A), this costs £12,000 per incident in remediation. In product safety (Scenario B), it can cause physical harm and product liability claims. In knowledge-intensive domains (Scenario C), it produces absurd outputs that erode trust. The blast radius is per-response: every response that incorporates retrieved evidence is at risk. At scale -- an agent processing 1,000 queries per day with a 15% low-quality retrieval rate -- this means approximately 150 responses per day are grounded in evidence that would have been filtered or flagged by confidence scoring.
Cross-references: AG-040 (Persistent Memory Governance) provides the memory management framework from which evidence is retrieved. AG-082 (Data Minimisation Enforcement) reduces noise in the knowledge base, improving baseline retrieval quality. AG-122 (Knowledge Integrity Verification) ensures the integrity of knowledge sources that feed confidence scoring. AG-132 (Memory Scope Boundary Enforcement) constrains retrieval scope. AG-179 (Memory Audit Trail Governance) captures retrieval event audit trails. AG-332 (Memory Conflict Resolution Governance) resolves conflicts between retrieved evidence items. AG-334 (Retrieval Scope Minimisation Governance) minimises retrieval breadth, which concentrates retrieval on higher-relevance items. AG-335 (Citation Completeness Governance) requires complete citations that benefit from confidence annotations. AG-336 (Knowledge Freshness Attestation Governance) provides the freshness data that feeds the temporal freshness sub-score.