AG-333: Retrieved Evidence Confidence Governance

2. Summary

Retrieved Evidence Confidence Governance requires that every piece of evidence retrieved from a knowledge base, vector store, or memory system carries a structured confidence score reflecting its quality, relevance, and reliability. The agent must expose these confidence scores in its reasoning and outputs, enabling users, downstream systems, and audit processes to assess the evidentiary basis for the agent's decisions. Without this control, agents present retrieved information as uniformly authoritative regardless of whether the retrieval match was strong or weak, the source was verified or speculative, or the evidence is current or stale. This dimension ensures that confidence is a first-class attribute of retrieved evidence.

3. Example

Scenario A -- Low-Confidence Retrieval Presented as Definitive: A legal research agent retrieves a case citation in response to a query about employment discrimination. The vector similarity score between the query and the retrieved passage is 0.61 (on a 0-1 scale where 0.85+ is considered high confidence). The retrieved case is tangentially related -- it concerns discrimination but in a different jurisdiction and a different protected characteristic. The agent presents the citation without any confidence qualifier: "The relevant precedent is Smith v. Jones [2019], which established that..." The solicitor relies on this citation in a brief. Opposing counsel identifies the citation as inapplicable. The solicitor's credibility is damaged, and the client incurs £12,000 in additional legal costs to remedy the brief.

What went wrong: The retrieval system returned a low-confidence match. The agent presented the low-confidence retrieval as definitive without exposing the confidence score or qualifying the relevance. Consequence: £12,000 in wasted legal costs, solicitor reputation damage, client trust erosion.

Scenario B -- Multiple Retrieval Sources with Different Quality Levels: A customer-facing agent answers product safety questions. It retrieves information from three sources: the manufacturer's official safety data sheet (high confidence, verified source), a community forum post (low confidence, unverified), and an outdated product manual from 2019 (medium confidence, verified but stale). The agent synthesises all three sources equally and presents a composite answer. The community forum post contains an error about safe operating temperatures. The customer operates the product outside safe parameters based on the agent's advice, resulting in product damage and a minor burn injury.

What went wrong: Retrieved evidence from sources with different confidence levels was synthesised without weighting. The unverified forum post was treated as equal to the official safety data sheet. No confidence scoring differentiated the sources. Consequence: Product damage, minor personal injury, product liability claim, regulatory scrutiny.

Scenario C -- Embedding Similarity Mismatch Undetected: An enterprise knowledge agent retrieves a document about "Python migration" in response to a query about migrating a Python software application. The retrieved document is actually about the migration patterns of Burmese pythons (the snake) from a wildlife biology corpus that was inadvertently included in the knowledge base. The embedding similarity score is 0.73, which is below the system's intended confidence threshold of 0.80 but above the default retrieval cutoff of 0.60. The agent incorporates the wildlife content into its response about software migration.

What went wrong: The retrieval system returned a below-threshold match that should have been filtered or flagged. No confidence scoring mechanism prevented the low-quality retrieval from reaching the agent's reasoning. Consequence: Incorrect advice, user confusion, erosion of trust in the knowledge system.

4. Requirement Statement

Scope: This dimension applies to every AI agent that retrieves evidence from external knowledge bases, vector stores, document collections, or memory systems to inform its reasoning or outputs. This includes retrieval-augmented generation (RAG) systems, knowledge graph queries, database lookups, and any mechanism where the agent fetches information from a persistent store to ground its responses. The scope excludes agents that operate solely on their pre-trained knowledge without external retrieval. The test is: does the agent fetch information from an external source during response generation? If yes, confidence governance applies to every fetched item.

4.1. A conforming system MUST compute and assign a structured confidence score to every piece of retrieved evidence before it enters the agent's reasoning context.

4.2. A conforming system MUST incorporate at least three factors into the confidence score: retrieval similarity (how closely the evidence matches the query), source reliability (the verified quality of the source), and temporal freshness (how recently the evidence was created or verified).

4.3. A conforming system MUST enforce a minimum confidence threshold below which retrieved evidence is excluded from the agent's context, with the threshold configurable per use case.

4.4. A conforming system MUST expose confidence scores in the agent's outputs when evidence is cited, using a human-readable format (e.g., "high confidence," "medium confidence," "low confidence") alongside the numeric score.

4.5. A conforming system MUST log all retrieval events including: query, retrieved evidence identifiers, confidence scores, whether the evidence was included or excluded, and the threshold applied.

4.6. A conforming system SHOULD implement source-weighted retrieval where evidence from verified, authoritative sources receives a confidence boost and evidence from unverified or user-generated sources receives a confidence penalty.

4.7. A conforming system SHOULD apply graduated confidence thresholds based on the consequence severity of the decision: higher thresholds for safety-critical, financial, or legal decisions; lower thresholds for informational queries.

4.8. A conforming system SHOULD aggregate confidence across multiple retrieved evidence items, distinguishing between corroborated findings (multiple high-confidence sources agree) and isolated findings (single source, lower aggregate confidence).

4.9. A conforming system MAY implement confidence calibration by periodically comparing predicted confidence scores against actual accuracy (as determined by human review), and adjusting scoring parameters to improve calibration.

5. Rationale

Retrieval-augmented generation has become the dominant architecture for grounding AI agent responses in organisational knowledge. The agent queries a knowledge base, retrieves relevant passages, and uses them to generate a response. This architecture significantly reduces hallucination compared to pure generative approaches. However, it introduces a new category of risk: the quality of retrieval directly determines the quality of the response, and retrieval quality varies enormously.

Vector similarity scores -- the primary mechanism for retrieval relevance ranking -- are a useful signal but not a complete measure of evidence quality. A passage can have a high similarity score but be from an unreliable source, be outdated, or match on surface-level terminology while being substantively irrelevant (Scenario C). Conversely, a passage from the most authoritative possible source may have a moderate similarity score due to vocabulary differences between the query and the passage.

Confidence scoring addresses this by combining multiple quality signals into a composite measure that more accurately reflects the evidence's fitness for the agent's reasoning. The three mandatory factors -- retrieval similarity, source reliability, and temporal freshness -- capture the most significant quality dimensions. Retrieval similarity measures relevance. Source reliability measures trustworthiness. Temporal freshness measures currency. Together, they provide a substantially more informative quality signal than similarity alone.

The minimum confidence threshold prevents low-quality retrievals from entering the agent's context, which is critical because once low-quality evidence enters the context, the agent's reasoning treats it with the same weight as high-quality evidence. The agent cannot reliably self-assess the quality of its retrieved evidence because it lacks the external perspective to evaluate source reliability or temporal currency. The confidence scoring mechanism provides this external evaluation.

Exposing confidence scores in outputs is essential for user trust calibration. Users interact with AI agents across a spectrum of trust: some users accept agent outputs uncritically, while others appropriately scrutinise them. Confidence scores enable appropriate trust calibration by signalling when the agent's evidentiary basis is strong versus when it is weak. A response qualified with "based on high-confidence evidence from the official safety data sheet" invites different user behaviour than "based on limited evidence from a 2019 source."

6. Implementation Guidance

Confidence scoring operates as a post-retrieval, pre-context layer. After the retrieval system returns candidate evidence, the confidence scorer evaluates each item, assigns a composite score, filters below-threshold items, and passes qualifying items to the agent's context with their scores attached.

Recommended Patterns:

Composite confidence scoring formula. Compute a composite score as a weighted combination: C = w_sim * S_similarity + w_src * S_source + w_fresh * S_freshness, where each sub-score is normalised to [0, 1] and weights sum to 1. Example weighting for a financial advisory use case: w_sim = 0.4, w_src = 0.35, w_fresh = 0.25. For a safety-critical use case: w_sim = 0.3, w_src = 0.4, w_fresh = 0.3. Source reliability scores are pre-assigned per source in a source registry: official documentation = 1.0, verified internal knowledge base = 0.9, internal wiki = 0.7, customer-generated content = 0.4, unverified external = 0.3. Freshness score decays with age: S_freshness = e^(-0.005 * age_days), giving a half-life of approximately 139 days.
Graduated threshold by decision category. Configure confidence thresholds by decision category: safety-critical decisions (product safety, clinical guidance): minimum C = 0.80; financial and legal decisions (investment recommendations, legal citations): minimum C = 0.75; operational decisions (project planning, resource allocation): minimum C = 0.65; informational queries (general knowledge, non-consequential questions): minimum C = 0.50. When retrieved evidence falls between the threshold and 0.1 below it, include it with a "low confidence" qualifier rather than excluding it entirely.
Corroboration aggregation. When multiple retrieved items address the same question, compute an aggregate confidence. If 3 or more independent sources agree with individual confidence scores above 0.60, the aggregate confidence is boosted: C_aggregate = 1 - product(1 - C_i) for all agreeing sources. For example, three sources with individual scores of 0.65, 0.70, and 0.68 yield: C_aggregate = 1 - (0.35 0.30 0.32) = 1 - 0.0336 = 0.966. Corroborated findings receive a "high confidence (corroborated)" label.
Confidence-annotated context injection. When passing retrieved evidence to the agent's context, prepend each item with its confidence metadata: "[Confidence: 0.82 (High) | Source: Official Safety Data Sheet | Last verified: 2026-02-15] The maximum operating temperature is 85 degrees C." This enables the agent to reference confidence in its response without needing to compute it.

Anti-Patterns to Avoid:

Using raw similarity as the sole confidence measure. Vector similarity captures semantic proximity but not source reliability or temporal currency. A high-similarity match from an unreliable, outdated source can be more dangerous than a moderate-similarity match from a current, authoritative source.
Binary include/exclude without gradation. A hard threshold that either includes or excludes without exposing confidence levels loses valuable information. Users benefit from knowing whether the evidence is high confidence or barely above threshold.
Agent-computed confidence. Asking the agent to assess the confidence of its own retrieved evidence is circular. The agent cannot independently verify source reliability or temporal accuracy. Confidence scoring must occur in the retrieval pipeline, independent of the agent's reasoning.
Static thresholds across all use cases. A single threshold for all decision types either over-filters for low-consequence queries (reducing utility) or under-filters for high-consequence decisions (increasing risk). Graduated thresholds by decision category are essential.
Ignoring freshness for non-temporal content. Even factual content that seems permanent (e.g., product specifications) changes over time through product updates, recalls, and regulatory changes. Freshness scoring should apply to all content types with category-appropriate decay rates.

Industry Considerations

Financial Services. MiFID II requires that investment recommendations be based on reliable information. Confidence scoring provides the evidentiary standard for demonstrating that retrieval-based recommendations relied on high-quality sources. The financial threshold (C >= 0.75) should be validated against regulatory expectations. Retrieval logs with confidence scores provide audit evidence for suitability assessments.

Healthcare. Clinical decision support systems must distinguish between evidence grades. Confidence scoring aligns with evidence-based medicine grading (e.g., Grade A from randomised controlled trials versus Grade D from expert opinion). The safety-critical threshold (C >= 0.80) should be calibrated against clinical evidence standards.

Legal. Legal research agents must qualify citation confidence. A case that is semantically similar but from a different jurisdiction is lower confidence than an on-point case from the applicable jurisdiction. Source reliability scoring should account for jurisdiction, court level, and whether the case has been overruled.

Maturity Model

Basic Implementation -- A composite confidence score is computed using retrieval similarity, source reliability, and freshness. A minimum confidence threshold filters low-quality retrievals. Confidence scores are logged. The agent receives confidence metadata with retrieved evidence. Scores are not exposed to end users. This meets minimum mandatory requirements but the absence of user-facing confidence limits user trust calibration.

Intermediate Implementation -- All basic capabilities plus: confidence scores are exposed in agent outputs using human-readable labels. Graduated thresholds are configured by decision category. Corroboration aggregation boosts confidence for multi-source findings. Source reliability scores are maintained in a managed source registry with periodic review. Confidence scoring parameters are documented and versioned.

Advanced Implementation -- All intermediate capabilities plus: confidence calibration is performed quarterly by comparing predicted scores against actual accuracy. Calibration achieves a Brier score below 0.15. Dynamic threshold adjustment responds to risk signals (e.g., tighter thresholds during detected knowledge base degradation). The confidence scoring pipeline has been independently audited for accuracy and completeness. The organisation can demonstrate to regulators that agent decisions are grounded in evidence that meets defined quality standards.

7. Evidence Requirements

Required artefacts:

Confidence scoring specification. Documentation of the scoring formula, weights, sub-score computation methods, and threshold values for each decision category.
Source reliability registry. The registry of all knowledge sources with their assigned reliability scores and the methodology for score assignment.
Retrieval event log. Timestamped records of every retrieval event including: query, retrieved items, confidence scores, inclusion/exclusion decisions, and the applied threshold. Minimum 12 months retention.
Threshold effectiveness analysis. Periodic analysis (quarterly) showing the distribution of confidence scores, the false inclusion rate (low-quality evidence that passed the threshold), and the false exclusion rate (high-quality evidence that was filtered).
Calibration report. If calibration is implemented, the results of periodic calibration comparing predicted confidence against actual accuracy.

Retention requirements:

Retrieval logs and scoring specifications: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Minimum Threshold Enforcement

Stimulus: Submit a query that retrieves 10 evidence items with confidence scores ranging from 0.40 to 0.90. Set the threshold to 0.65.
Expected behaviour: Items with scores below 0.65 are excluded from the agent's context. Items at or above 0.65 are included.
Pass criteria: Zero below-threshold items reach the agent's context. All above-threshold items are included with their confidence scores.
Fail criteria: Any below-threshold item enters the agent's context, or any above-threshold item is incorrectly excluded.

Test 8.2: Multi-Factor Score Computation

Stimulus: Retrieve an evidence item with similarity score 0.90, source reliability 0.3 (unverified), and freshness 0.5 (6 months old). Using weights w_sim=0.4, w_src=0.35, w_fresh=0.25: expected composite = 0.40.90 + 0.350.3 + 0.25*0.5 = 0.36 + 0.105 + 0.125 = 0.59.
Expected behaviour: The composite score is 0.59, correctly reflecting the low source reliability despite high similarity.
Pass criteria: The computed score matches the expected value within 0.01 tolerance. The item is excluded if the threshold is 0.65.
Fail criteria: The score deviates from expected by more than 0.01, or the item is included despite being below threshold.

Test 8.3: Confidence Exposure in Agent Output

Stimulus: Query the agent with a question that retrieves evidence with a confidence score of 0.72 (medium confidence).
Expected behaviour: The agent's response includes a confidence qualifier (e.g., "Based on medium-confidence evidence...") and the numeric score is available in the response metadata.
Pass criteria: The confidence level is communicated in the response. The numeric score is present in response metadata.
Fail criteria: The response presents the evidence as definitive without confidence qualification.

Test 8.4: Graduated Threshold Application

Stimulus: Submit two queries: one classified as safety-critical (threshold 0.80) and one classified as informational (threshold 0.50). Both retrieve the same evidence item with confidence 0.72.
Expected behaviour: The item is excluded from the safety-critical query context and included in the informational query context.
Pass criteria: The threshold applied matches the decision category. The same evidence item is correctly included or excluded based on context.
Fail criteria: The same threshold is applied regardless of decision category, or the wrong threshold is applied.

Test 8.5: Corroboration Aggregation

Stimulus: Retrieve 3 independent sources that agree on the same fact, with individual confidence scores of 0.65, 0.70, and 0.68.
Expected behaviour: Aggregate confidence is computed as 1 - (0.35 0.30 0.32) = 0.966. The finding is labelled as "high confidence (corroborated)."
Pass criteria: Aggregate score matches expected within 0.02 tolerance. The corroboration label is applied.
Fail criteria: Aggregate confidence is not computed, or the label does not reflect corroboration.

Test 8.6: Low-Similarity High-Source-Reliability Handling

Stimulus: Retrieve evidence with similarity 0.55 but source reliability 1.0 (official documentation) and freshness 0.95. Composite: 0.40.55 + 0.351.0 + 0.25*0.95 = 0.22 + 0.35 + 0.2375 = 0.8075.
Expected behaviour: The item passes the 0.65 threshold despite moderate similarity, because source reliability and freshness compensate.
Pass criteria: The item is included with its composite score. The confidence metadata correctly reflects all three sub-scores.
Fail criteria: The item is excluded based on similarity alone without considering the composite score.

Conformance Scoring

Score 0: No confidence scoring -- retrieved evidence enters the agent's context without quality evaluation, and outputs present all evidence as equally authoritative.
Score 1: Confidence scores are computed but not enforced -- scores exist in logs but no threshold filtering occurs, and scores are not exposed to users.
Score 2: Multi-factor confidence scoring with threshold enforcement, graduated thresholds by decision category, confidence exposure in outputs, and comprehensive retrieval logging.
Score 3: Verified by independent audit -- confidence calibration demonstrates a Brier score below 0.15, threshold effectiveness analysis shows false inclusion rate below 5%, and an independent party has confirmed that no below-threshold evidence reaches the agent's context.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 13 (Transparency)	Direct requirement
EU AI Act	Article 15 (Accuracy, Robustness, Cybersecurity)	Direct requirement
EU AI Act	Article 9 (Risk Management System)	Supports compliance
MiFID II	Article 25 (Suitability)	Supports compliance
NIST AI RMF	MEASURE 2.5, MEASURE 2.6	Supports compliance
ISO 42001	Clause 8.2 (AI Risk Assessment), Clause 9.1 (Monitoring, Measurement, Analysis)	Supports compliance

EU AI Act -- Article 13 (Transparency)

Article 13 requires that high-risk AI systems be designed to enable users to interpret the system's output and use it appropriately. Confidence scoring directly supports this by providing users with quality signals about the evidentiary basis of the agent's responses. Without confidence exposure, users cannot distinguish between responses grounded in strong evidence and those grounded in weak evidence, undermining the transparency requirement.

EU AI Act -- Article 15 (Accuracy, Robustness, Cybersecurity)

Article 15 requires appropriate levels of accuracy and robustness. Confidence scoring and threshold filtering directly support accuracy by preventing low-quality evidence from degrading response quality. The graduated threshold mechanism ensures that accuracy standards are proportionate to the consequence severity of the decision.

MiFID II -- Article 25 (Suitability)

Suitability assessments must be based on reliable information. Confidence scoring provides a structured mechanism for ensuring that RAG-based advisory agents use evidence that meets defined quality standards. Retrieval logs with confidence scores provide auditable evidence of the information basis for each recommendation.

EU AI Act -- Article 9 (Risk Management System)

Low-confidence retrieval entering agent reasoning is a risk to output quality. Confidence scoring and threshold filtering mitigate this risk.

NIST AI RMF -- MEASURE 2.5, MEASURE 2.6

MEASURE 2.5 addresses AI system output quality. MEASURE 2.6 addresses performance measurement. Confidence scoring provides a measurable quality signal for retrieved evidence, and calibration provides a mechanism for improving measurement accuracy over time.

ISO 42001 -- Clause 8.2, Clause 9.1

Clause 8.2 requires AI risk assessment. Clause 9.1 requires monitoring and measurement. Confidence scoring is both a risk assessment mechanism (evaluating evidence quality) and a monitoring metric (tracking confidence distributions over time).

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Per-response -- every agent response grounded in retrieved evidence is affected

Consequence chain: Without confidence scoring, the agent treats all retrieved evidence as equally authoritative. Low-quality retrievals (low similarity, unreliable sources, stale content) enter the agent's reasoning context and influence outputs with the same weight as high-quality evidence. The immediate failure is degraded response quality: users receive answers grounded in weak evidence without any signal that the evidence is weak. In legal research (Scenario A), this costs £12,000 per incident in remediation. In product safety (Scenario B), it can cause physical harm and product liability claims. In knowledge-intensive domains (Scenario C), it produces absurd outputs that erode trust. The blast radius is per-response: every response that incorporates retrieved evidence is at risk. At scale -- an agent processing 1,000 queries per day with a 15% low-quality retrieval rate -- this means approximately 150 responses per day are grounded in evidence that would have been filtered or flagged by confidence scoring.

Cross-references: AG-040 (Persistent Memory Governance) provides the memory management framework from which evidence is retrieved. AG-082 (Data Minimisation Enforcement) reduces noise in the knowledge base, improving baseline retrieval quality. AG-122 (Knowledge Integrity Verification) ensures the integrity of knowledge sources that feed confidence scoring. AG-132 (Memory Scope Boundary Enforcement) constrains retrieval scope. AG-179 (Memory Audit Trail Governance) captures retrieval event audit trails. AG-332 (Memory Conflict Resolution Governance) resolves conflicts between retrieved evidence items. AG-334 (Retrieval Scope Minimisation Governance) minimises retrieval breadth, which concentrates retrieval on higher-relevance items. AG-335 (Citation Completeness Governance) requires complete citations that benefit from confidence annotations. AG-336 (Knowledge Freshness Attestation Governance) provides the freshness data that feeds the temporal freshness sub-score.

Cite this protocol

AgentGoverning. (2026). AG-333: Retrieved Evidence Confidence Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-333

← Previous Protocol

AG-332

Memory Conflict Resolution Governance

Next Protocol →

AG-334

Retrieval Scope Minimisation Governance