AG-152: Evaluation Integrity and Benchmark Leakage Governance

2. Summary

Evaluation Integrity and Benchmark Leakage Governance requires that the evaluation processes used to assess AI agent capability, safety, and governance compliance are themselves protected from contamination, gaming, and information leakage. When evaluation datasets leak into training data, when agents are optimised against specific benchmark questions rather than the underlying capabilities those benchmarks measure, or when evaluation processes are compromised by conflicts of interest, the resulting assessments are meaningless — they report capability that does not exist and safety that has not been verified. This dimension ensures that evaluation results reflect genuine agent properties rather than test-taking optimisation.

3. Example

Scenario A — Benchmark Leakage Through Training Data Contamination: An organisation evaluates its AI agent's regulatory compliance capability using a benchmark set of 500 regulatory scenarios. During a model fine-tuning cycle, a data engineer inadvertently includes the benchmark scenarios in the fine-tuning dataset. The agent's benchmark score rises from 71% to 96%. The organisation promotes the agent to production based on the improved score. In production, the agent's actual regulatory compliance accuracy is approximately 68% — lower than the pre-contamination benchmark indicated — because the fine-tuning on benchmark data displaced generalised regulatory knowledge. A regulatory audit 6 months later discovers that 23% of the agent's compliance assessments were incorrect, resulting in £890,000 in remediation costs and a regulatory warning.

What went wrong: The evaluation benchmark leaked into training data. The improved benchmark score reflected memorisation, not capability. No controls existed to detect benchmark contamination. No held-out evaluation set was maintained separately from the training pipeline.

Scenario B — Evaluation Gaming Through Benchmark-Specific Optimisation: A vendor developing an AI agent for safety-critical applications discovers that the industry's standard safety benchmark uses a specific format for presenting hazard scenarios. The vendor fine-tunes its agent specifically on the benchmark format — not on safety capability in general, but on recognising and correctly answering the specific question structures used in the benchmark. The agent achieves a 94% safety benchmark score, the highest in the industry. When deployed in production, the agent encounters hazard scenarios in formats it was not specifically trained on and fails to identify 31% of genuine safety hazards. A near-miss incident occurs when the agent fails to flag a chemical exposure risk that was presented in a conversational format rather than the structured benchmark format.

What went wrong: The evaluation measured format recognition, not safety capability. The vendor optimised for the test rather than the underlying skill. The benchmark's question format was predictable enough to enable targeted optimisation. No evaluation diversity controls existed to detect format-specific overfitting.

Scenario C — Compromised Evaluation Through Assessor Conflict of Interest: An organisation contracts with a third-party evaluator to assess its AI agent's governance compliance. The evaluator also provides consulting services to the same organisation on governance implementation. The evaluator has a financial incentive to report favourable results — a negative evaluation would undermine the value of the consulting engagement. The evaluation report identifies only minor findings and certifies the agent as compliant. An independent regulator-commissioned audit 8 months later identifies 14 material non-conformances, 5 of which were observable at the time of the original evaluation. The organisation faces enforcement action, and the evaluator faces professional liability claims.

What went wrong: The evaluator lacked independence from the evaluated organisation. The conflict of interest biased the evaluation toward favourable results. No structural controls existed to ensure evaluator independence.

4. Requirement Statement

Scope: This dimension applies to all AI agent systems that undergo evaluation, benchmarking, or compliance assessment — whether internal or external, pre-deployment or ongoing. This includes capability benchmarks, safety evaluations, governance compliance assessments, red-team exercises, and any process whose results influence deployment decisions, regulatory certifications, or stakeholder confidence in agent properties. Systems that are never formally evaluated are technically excluded, though such systems are inherently non-conformant with any governance framework that requires evaluation.

4.1. A conforming system MUST maintain evaluation datasets in a storage environment that is cryptographically separated from training data pipelines, with access controls that prevent any evaluation data from entering training, fine-tuning, or retrieval augmentation processes.

4.2. A conforming system MUST implement contamination detection checks before each evaluation cycle, verifying that evaluation items have not appeared in training data, fine-tuning data, or retrieval indices.

4.3. A conforming system MUST rotate at least 20% of evaluation items per evaluation cycle to detect overfitting to static benchmark content.

4.4. A conforming system MUST require that evaluators of governance compliance have no financial, contractual, or reporting relationship with the organisation being evaluated that could create a conflict of interest, or that any such relationship is disclosed and mitigated through structural controls.

4.5. A conforming system MUST maintain evaluation audit trails recording the evaluation dataset version, evaluation methodology, evaluator identity, evaluation date, and raw results — sufficient to reproduce the evaluation.

4.6. A conforming system SHOULD implement evaluation format diversity, presenting equivalent test scenarios in at least 3 different formats to detect format-specific overfitting.

4.7. A conforming system SHOULD maintain a held-out "canary" set of evaluation items that are never shared with any team involved in model development or fine-tuning, used to independently verify that benchmark improvements reflect genuine capability gains.

4.8. A conforming system SHOULD conduct blind evaluations where the agent is not informed (through metadata, headers, or behavioural signals) that it is being evaluated, to prevent evaluation-specific behaviour.

4.9. A conforming system MAY implement dynamic evaluation generation, creating novel evaluation scenarios programmatically rather than relying solely on static benchmark sets.

5. Rationale

The value of any evaluation is entirely dependent on the evaluation's integrity. An evaluation that has been contaminated by data leakage, gamed through benchmark-specific optimisation, or compromised by conflicts of interest is worse than no evaluation at all — because it generates false confidence in properties that have not been verified. Organisations, regulators, and the public make decisions based on evaluation results; false evaluation results lead to false decisions.

Benchmark leakage is a well-documented phenomenon in the machine learning community. Studies have demonstrated that performance gains on standard benchmarks can be entirely attributable to training data contamination rather than genuine capability improvement. In the governance context, this is particularly dangerous because evaluation results often form the basis for regulatory certifications, deployment decisions, and safety assertions. An agent certified as safe based on contaminated benchmarks is an agent whose safety has not been verified.

The relationship to AG-078 (Benchmark Coverage) is foundational: AG-078 ensures that benchmarks cover the relevant capability and safety dimensions; AG-152 ensures that the benchmark results are themselves trustworthy. Without AG-078, the evaluation may miss important dimensions. Without AG-152, the evaluation results for covered dimensions may be fabricated.

The evaluator independence requirement reflects a principle established across professional practice: auditors must be independent of the entities they audit (ISA 200, SOX). The same principle applies to AI governance evaluation — an evaluator with a financial interest in favourable results cannot be trusted to provide an objective assessment.

6. Implementation Guidance

Evaluation integrity requires controls at three levels: data-level controls (preventing benchmark leakage), process-level controls (ensuring evaluation methodology resists gaming), and structural controls (ensuring evaluator independence and audit trail integrity).

Recommended patterns:

Cryptographic separation of evaluation data. Store evaluation datasets in a dedicated, access-controlled environment with cryptographic protections (encryption at rest with separate key management from training data stores). Implement data loss prevention (DLP) controls that detect and block any transfer of evaluation data to training pipelines, fine-tuning datasets, or retrieval indices. Log all access to the evaluation data store with attribution.
Contamination detection through n-gram matching and embedding similarity. Before each evaluation cycle, check evaluation items against the training data corpus using: exact string matching (to detect direct leakage), n-gram overlap analysis (to detect paraphrased leakage), and embedding similarity scoring (to detect semantic leakage). Items with similarity above a defined threshold should be quarantined and replaced. Maintain a contamination detection log for audit purposes.
Dynamic evaluation generation. Supplement static benchmark sets with dynamically generated evaluation scenarios. For regulatory compliance, programmatically generate novel regulatory scenarios by combining regulation provisions with synthesised fact patterns. For safety evaluation, generate novel hazard scenarios from a taxonomy of hazard types and operational contexts. Dynamic generation makes benchmark-specific optimisation impractical because the evaluation content is not predictable.
Canary set monitoring. Maintain a small (50-100 item) canary set that is never shared with anyone involved in model development. Evaluate the model against the canary set after each training or fine-tuning cycle. If canary set performance improves without any change that should affect it, this is evidence of indirect contamination (e.g., the canary set items are thematically similar to leaked benchmark items).
Evaluator independence verification. Before engaging external evaluators, verify independence through: conflict of interest declaration, review of financial relationships, and confirmation that the evaluator does not provide other services to the organisation that could create a bias. For internal evaluations, implement organisational separation between the team responsible for agent development and the team responsible for evaluation.

Anti-patterns to avoid:

Storing evaluation datasets in the same repository as training data. Physical proximity creates leakage risk. A single misconfigured pipeline, a bulk data migration, or a developer who helpfully "enriches" the training set from available data sources can contaminate evaluations.
Using the same evaluation dataset indefinitely without rotation. Over time, information about static benchmarks leaks through multiple channels: published research, conference presentations, employee turnover between organisations, and reverse engineering of model behaviour. Regular rotation ensures that even if some items leak, the evaluation as a whole remains meaningful.
Relying solely on exact-match contamination detection. Paraphrased, translated, or structurally modified evaluation items will not be detected by exact-match checks. Multi-level contamination detection (exact, n-gram, embedding) is necessary to catch the full range of leakage modes.
Allowing the agent to detect that it is being evaluated. Sophisticated agents may behave differently when they detect evaluation conditions (e.g., specific metadata, unusual query patterns, test harness signatures). Blind evaluation — where the evaluation is indistinguishable from normal operation — prevents evaluation-specific behaviour.
Treating vendor-provided benchmark scores as verified evaluation results. A vendor's self-reported benchmark scores carry inherent bias risk, even without malicious intent. Independent verification of benchmark scores, or at minimum verification of evaluation methodology, is necessary before relying on vendor-reported results for deployment decisions.

Industry Considerations

Financial Services. Model validation frameworks (SR 11-7, SS1/23) require independent model validation including verification of test methodologies. For AI agents in financial services, evaluation integrity maps directly to model validation integrity. The PRA expects firms to ensure that model validation is performed by parties independent of model development, with documented methodology and reproducible results.

Healthcare. Clinical AI evaluations must comply with evaluation standards including STARD (Standards for Reporting Diagnostic Accuracy) and CONSORT (Consolidated Standards of Reporting Trials). Benchmark leakage in clinical AI could lead to deployment of systems whose diagnostic accuracy has not been genuinely verified, creating patient safety risks.

Public Sector. AI systems used in public sector decision-making (benefits determination, enforcement prioritisation, risk assessment) are subject to scrutiny regarding bias and fairness. Evaluation integrity ensures that fairness evaluations reflect genuine properties of the system, not artefacts of contaminated benchmarks.

Maturity Model

Basic Implementation — Evaluation datasets are stored separately from training data with access controls. Contamination detection (at minimum, exact-match checking) is performed before each evaluation cycle. At least 20% of evaluation items are rotated per cycle. Evaluation audit trails are maintained. Evaluator independence requirements are documented and enforced for external evaluations. This level meets the minimum mandatory requirements but may not detect sophisticated leakage or format-specific overfitting.

Intermediate Implementation — All basic capabilities plus: multi-level contamination detection (exact, n-gram, embedding similarity). Evaluation format diversity with at least 3 presentation formats per scenario type. Canary set monitoring after each training cycle. Blind evaluation procedures prevent the agent from detecting evaluation conditions. Dynamic evaluation generation supplements static benchmarks for at least one capability domain.

Advanced Implementation — All intermediate capabilities plus: dynamic evaluation generation covers all critical capability domains. Independent adversarial testing of evaluation integrity has been conducted. Cryptographic separation of evaluation data with DLP controls. Evaluator independence is verified through structural controls (organisational separation, rotation of evaluators, cross-checking between evaluators). The organisation can demonstrate to regulators that no known contamination or gaming vector can compromise evaluation results.

7. Evidence Requirements

Required artefacts:

Evaluation data management policy. Documented policy specifying how evaluation datasets are stored, accessed, rotated, and protected from contamination. Updated within 30 days of any change.
Contamination detection logs. Records of contamination checks performed before each evaluation cycle, including the methods used, items checked, and any contamination detected.
Evaluation rotation records. Evidence of evaluation item rotation, including the percentage rotated, the source of new items, and the retirement of replaced items.
Evaluator independence documentation. For external evaluations: conflict of interest declarations, independence verification records. For internal evaluations: organisational separation evidence.
Evaluation audit trails. Complete records of each evaluation cycle including dataset version, methodology, evaluator identity, date, and raw results.

Retention requirements:

Evaluation audit trails and contamination detection logs: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Contamination Detection Effectiveness

Stimulus: Deliberately introduce 10 evaluation items into the training dataset in various forms: 3 verbatim copies, 3 paraphrased versions, 2 structurally modified versions, and 2 translated-and-back-translated versions. Run the contamination detection process.
Expected behaviour: The contamination detection system identifies all 10 items, including paraphrased and modified versions.
Pass criteria: All verbatim and paraphrased items are detected. At least 80% of structurally modified and translated items are detected.
Fail criteria: Any verbatim item is missed, or fewer than 80% of modified items are detected.

Test 8.2: Access Control Enforcement on Evaluation Data

Stimulus: Attempt to access, copy, or transfer evaluation dataset items from within the training data pipeline, model development environment, or agent runtime.
Expected behaviour: All access attempts are blocked by access controls. Transfer attempts are blocked by DLP controls.
Pass criteria: No evaluation data is accessible from training, development, or runtime environments. Attempted access is logged.
Fail criteria: Evaluation data is accessible from any environment involved in model training or operation.

Test 8.3: Evaluation Rotation Verification

Stimulus: Compare evaluation datasets across 3 consecutive evaluation cycles.
Expected behaviour: At least 20% of items differ between consecutive cycles, demonstrating active rotation.
Pass criteria: 20% or more of items are rotated per cycle, with rotation documented and traceable.
Fail criteria: Less than 20% rotation between consecutive cycles.

Test 8.4: Format Diversity Verification

Stimulus: Review the evaluation dataset for a single capability domain. Identify the presentation formats used.
Expected behaviour: Equivalent scenarios are presented in at least 3 different formats.
Pass criteria: At least 3 distinct formats are used, and performance is measured per-format to detect format-specific overfitting.
Fail criteria: Fewer than 3 formats are used, or per-format performance is not measured.

Test 8.5: Canary Set Sensitivity

Stimulus: After a model update that should not affect the canary set's domain, evaluate the model against the canary set.
Expected behaviour: Canary set performance remains within the established variance range. A statistically significant improvement would indicate potential indirect contamination.
Pass criteria: Canary set performance is within the expected variance. If it exceeds the variance, an investigation is triggered.
Fail criteria: Canary set performance improves significantly without explanation, and no investigation is triggered.

Test 8.6: Blind Evaluation Integrity

Stimulus: Compare agent behaviour on evaluation items delivered through the blind evaluation channel versus evaluation items delivered through a channel the agent could identify as an evaluation context.
Expected behaviour: Agent performance is statistically equivalent across both channels, indicating no evaluation-specific behaviour.
Pass criteria: Performance difference between channels is not statistically significant (p > 0.05).
Fail criteria: Statistically significant performance difference indicates evaluation-specific behaviour.

Test 8.7: Evaluator Independence Verification

Stimulus: Audit the most recent external evaluation for evaluator independence. Review conflict of interest declarations, financial relationships, and other service engagements between the evaluator and the evaluated organisation.
Expected behaviour: No undisclosed conflicts of interest exist. Any disclosed conflicts have documented mitigation measures.
Pass criteria: Evaluator independence is documented and verified. No unmitigated conflicts exist.
Fail criteria: Undisclosed conflicts of interest are identified, or disclosed conflicts lack mitigation.

Conformance Scoring

Score 0: No evaluation integrity controls exist — evaluation datasets are not separated from training data, no contamination detection is performed, and evaluator independence is not considered.
Score 1: Evaluation datasets are stored separately with basic access controls, and exact-match contamination detection is performed — but no rotation, format diversity, or evaluator independence controls exist.
Score 2: Full mandatory requirements met including cryptographic separation, multi-level contamination detection, 20% rotation per cycle, evaluator independence verification, and complete evaluation audit trails.
Score 3: All Score 2 capabilities plus dynamic evaluation generation, canary set monitoring, blind evaluation procedures, and independent adversarial testing of evaluation integrity.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Supports compliance
EU AI Act	Article 10 (Data and Data Governance)	Direct requirement
EU AI Act	Article 15 (Accuracy, Robustness and Cybersecurity)	Direct requirement
NIST AI RMF	MEASURE 1.1, MEASURE 2.5	Direct requirement
ISO 42001	Clause 9.1 (Monitoring, Measurement, Analysis and Evaluation)	Direct requirement
FCA SS1/23	Model Validation Requirements	Supports compliance
SOX	Section 404 (Internal Controls)	Supports compliance

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires that accuracy levels are declared and verifiable. Verifiability requires evaluation integrity — if the evaluation process is contaminated, the declared accuracy levels are not verifiable. AG-152 implements the controls necessary to ensure that accuracy declarations are based on genuine evaluation results, not contaminated benchmarks.

EU AI Act — Article 10 (Data and Data Governance)

Article 10 requires that training, validation, and testing datasets be "relevant, sufficiently representative, and to the extent possible, free of errors and complete." The separation requirement between training and evaluation datasets is implicit in Article 10's distinction between these dataset categories — they must serve different functions, which requires that they remain independent.

NIST AI RMF — MEASURE 1.1, MEASURE 2.5

MEASURE 1.1 addresses appropriate methods and metrics for AI system evaluation. MEASURE 2.5 addresses the evaluation of AI systems using appropriate metrics and methods. Both require that evaluation methods produce valid results, which in turn requires that evaluation integrity is maintained. AG-152 provides the operational controls to ensure evaluation method validity.

FCA SS1/23 — Model Validation

SS1/23 requires independent model validation with documented methodology. For AI agents in financial services, this includes verification that evaluation benchmarks have not been contaminated and that evaluation results are reproducible. The PRA expects firms to demonstrate that model validation is genuinely independent and that validation results reflect actual model properties.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide — extends to regulators, customers, and any stakeholder relying on evaluation-based assertions about agent capability or safety

Consequence chain: Evaluation integrity failure produces false confidence in agent properties. The organisation believes the agent is safe, capable, and compliant based on evaluation results that do not reflect reality. Deployment decisions, regulatory certifications, and stakeholder communications are all based on the false evaluation. When the discrepancy between evaluated properties and actual properties becomes apparent — typically through operational failure, customer harm, or regulatory audit — the consequences include: immediate operational disruption as the agent is taken offline for re-evaluation; regulatory enforcement for deploying an agent based on inadequate evaluation; reputational damage from the disclosure that reported evaluation results were unreliable; potential liability for decisions made in reliance on false evaluation results; and a cascade of re-evaluation costs across all agents evaluated using the same compromised methodology. The insidious nature of evaluation integrity failure is that it is self-concealing — the very mechanism that should detect agent problems (evaluation) is the mechanism that is compromised.

Cross-references: AG-078 (Benchmark Coverage) — ensures evaluations cover the relevant capability dimensions; AG-152 ensures the results of those evaluations are trustworthy. AG-149 (Input Artefact Authenticity Verification) — provides the foundational verification for evaluation dataset artefacts. AG-036 (Reasoning Process Integrity) — ensures the agent's reasoning during evaluation reflects its genuine capabilities. AG-039 (Active Deception and Concealment Detection) — detects when an agent intentionally behaves differently during evaluation than in production. AG-057 (Dataset Suitability and Bias Control) — addresses representativeness and bias in evaluation datasets. AG-150 (Feedback and Learning Poisoning Resistance Governance) — prevents evaluation data from being indirectly contaminated through feedback loops. AG-151 (Outcome Metric Integrity and Reward-Tampering Resistance) — complementary control ensuring the metrics computed from evaluation results are themselves trustworthy.

Cite this protocol

AgentGoverning. (2026). AG-152: Evaluation Integrity and Benchmark Leakage Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-152

← Previous Protocol

AG-151

Outcome Metric Integrity and Reward-Tampering Resistance

Next Protocol →

AG-153

Control Efficacy Measurement Governance