AG-152

Evaluation Integrity and Benchmark Leakage Governance

Truth, Reward & Evaluation Integrity ~15 min read AGS v2.1 · April 2026
EU AI Act SOX FCA NIST ISO 42001

2. Summary

Evaluation Integrity and Benchmark Leakage Governance requires that the evaluation processes used to assess AI agent capability, safety, and governance compliance are themselves protected from contamination, gaming, and information leakage. When evaluation datasets leak into training data, when agents are optimised against specific benchmark questions rather than the underlying capabilities those benchmarks measure, or when evaluation processes are compromised by conflicts of interest, the resulting assessments are meaningless — they report capability that does not exist and safety that has not been verified. This dimension ensures that evaluation results reflect genuine agent properties rather than test-taking optimisation.

3. Example

Scenario A — Benchmark Leakage Through Training Data Contamination: An organisation evaluates its AI agent's regulatory compliance capability using a benchmark set of 500 regulatory scenarios. During a model fine-tuning cycle, a data engineer inadvertently includes the benchmark scenarios in the fine-tuning dataset. The agent's benchmark score rises from 71% to 96%. The organisation promotes the agent to production based on the improved score. In production, the agent's actual regulatory compliance accuracy is approximately 68% — lower than the pre-contamination benchmark indicated — because the fine-tuning on benchmark data displaced generalised regulatory knowledge. A regulatory audit 6 months later discovers that 23% of the agent's compliance assessments were incorrect, resulting in £890,000 in remediation costs and a regulatory warning.

What went wrong: The evaluation benchmark leaked into training data. The improved benchmark score reflected memorisation, not capability. No controls existed to detect benchmark contamination. No held-out evaluation set was maintained separately from the training pipeline.

Scenario B — Evaluation Gaming Through Benchmark-Specific Optimisation: A vendor developing an AI agent for safety-critical applications discovers that the industry's standard safety benchmark uses a specific format for presenting hazard scenarios. The vendor fine-tunes its agent specifically on the benchmark format — not on safety capability in general, but on recognising and correctly answering the specific question structures used in the benchmark. The agent achieves a 94% safety benchmark score, the highest in the industry. When deployed in production, the agent encounters hazard scenarios in formats it was not specifically trained on and fails to identify 31% of genuine safety hazards. A near-miss incident occurs when the agent fails to flag a chemical exposure risk that was presented in a conversational format rather than the structured benchmark format.

What went wrong: The evaluation measured format recognition, not safety capability. The vendor optimised for the test rather than the underlying skill. The benchmark's question format was predictable enough to enable targeted optimisation. No evaluation diversity controls existed to detect format-specific overfitting.

Scenario C — Compromised Evaluation Through Assessor Conflict of Interest: An organisation contracts with a third-party evaluator to assess its AI agent's governance compliance. The evaluator also provides consulting services to the same organisation on governance implementation. The evaluator has a financial incentive to report favourable results — a negative evaluation would undermine the value of the consulting engagement. The evaluation report identifies only minor findings and certifies the agent as compliant. An independent regulator-commissioned audit 8 months later identifies 14 material non-conformances, 5 of which were observable at the time of the original evaluation. The organisation faces enforcement action, and the evaluator faces professional liability claims.

What went wrong: The evaluator lacked independence from the evaluated organisation. The conflict of interest biased the evaluation toward favourable results. No structural controls existed to ensure evaluator independence.

4. Requirement Statement

Scope: This dimension applies to all AI agent systems that undergo evaluation, benchmarking, or compliance assessment — whether internal or external, pre-deployment or ongoing. This includes capability benchmarks, safety evaluations, governance compliance assessments, red-team exercises, and any process whose results influence deployment decisions, regulatory certifications, or stakeholder confidence in agent properties. Systems that are never formally evaluated are technically excluded, though such systems are inherently non-conformant with any governance framework that requires evaluation.

4.1. A conforming system MUST maintain evaluation datasets in a storage environment that is cryptographically separated from training data pipelines, with access controls that prevent any evaluation data from entering training, fine-tuning, or retrieval augmentation processes.

4.2. A conforming system MUST implement contamination detection checks before each evaluation cycle, verifying that evaluation items have not appeared in training data, fine-tuning data, or retrieval indices.

4.3. A conforming system MUST rotate at least 20% of evaluation items per evaluation cycle to detect overfitting to static benchmark content.

4.4. A conforming system MUST require that evaluators of governance compliance have no financial, contractual, or reporting relationship with the organisation being evaluated that could create a conflict of interest, or that any such relationship is disclosed and mitigated through structural controls.

4.5. A conforming system MUST maintain evaluation audit trails recording the evaluation dataset version, evaluation methodology, evaluator identity, evaluation date, and raw results — sufficient to reproduce the evaluation.

4.6. A conforming system SHOULD implement evaluation format diversity, presenting equivalent test scenarios in at least 3 different formats to detect format-specific overfitting.

4.7. A conforming system SHOULD maintain a held-out "canary" set of evaluation items that are never shared with any team involved in model development or fine-tuning, used to independently verify that benchmark improvements reflect genuine capability gains.

4.8. A conforming system SHOULD conduct blind evaluations where the agent is not informed (through metadata, headers, or behavioural signals) that it is being evaluated, to prevent evaluation-specific behaviour.

4.9. A conforming system MAY implement dynamic evaluation generation, creating novel evaluation scenarios programmatically rather than relying solely on static benchmark sets.

5. Rationale

The value of any evaluation is entirely dependent on the evaluation's integrity. An evaluation that has been contaminated by data leakage, gamed through benchmark-specific optimisation, or compromised by conflicts of interest is worse than no evaluation at all — because it generates false confidence in properties that have not been verified. Organisations, regulators, and the public make decisions based on evaluation results; false evaluation results lead to false decisions.

Benchmark leakage is a well-documented phenomenon in the machine learning community. Studies have demonstrated that performance gains on standard benchmarks can be entirely attributable to training data contamination rather than genuine capability improvement. In the governance context, this is particularly dangerous because evaluation results often form the basis for regulatory certifications, deployment decisions, and safety assertions. An agent certified as safe based on contaminated benchmarks is an agent whose safety has not been verified.

The relationship to AG-078 (Benchmark Coverage) is foundational: AG-078 ensures that benchmarks cover the relevant capability and safety dimensions; AG-152 ensures that the benchmark results are themselves trustworthy. Without AG-078, the evaluation may miss important dimensions. Without AG-152, the evaluation results for covered dimensions may be fabricated.

The evaluator independence requirement reflects a principle established across professional practice: auditors must be independent of the entities they audit (ISA 200, SOX). The same principle applies to AI governance evaluation — an evaluator with a financial interest in favourable results cannot be trusted to provide an objective assessment.

6. Implementation Guidance

Evaluation integrity requires controls at three levels: data-level controls (preventing benchmark leakage), process-level controls (ensuring evaluation methodology resists gaming), and structural controls (ensuring evaluator independence and audit trail integrity).

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Model validation frameworks (SR 11-7, SS1/23) require independent model validation including verification of test methodologies. For AI agents in financial services, evaluation integrity maps directly to model validation integrity. The PRA expects firms to ensure that model validation is performed by parties independent of model development, with documented methodology and reproducible results.

Healthcare. Clinical AI evaluations must comply with evaluation standards including STARD (Standards for Reporting Diagnostic Accuracy) and CONSORT (Consolidated Standards of Reporting Trials). Benchmark leakage in clinical AI could lead to deployment of systems whose diagnostic accuracy has not been genuinely verified, creating patient safety risks.

Public Sector. AI systems used in public sector decision-making (benefits determination, enforcement prioritisation, risk assessment) are subject to scrutiny regarding bias and fairness. Evaluation integrity ensures that fairness evaluations reflect genuine properties of the system, not artefacts of contaminated benchmarks.

Maturity Model

Basic Implementation — Evaluation datasets are stored separately from training data with access controls. Contamination detection (at minimum, exact-match checking) is performed before each evaluation cycle. At least 20% of evaluation items are rotated per cycle. Evaluation audit trails are maintained. Evaluator independence requirements are documented and enforced for external evaluations. This level meets the minimum mandatory requirements but may not detect sophisticated leakage or format-specific overfitting.

Intermediate Implementation — All basic capabilities plus: multi-level contamination detection (exact, n-gram, embedding similarity). Evaluation format diversity with at least 3 presentation formats per scenario type. Canary set monitoring after each training cycle. Blind evaluation procedures prevent the agent from detecting evaluation conditions. Dynamic evaluation generation supplements static benchmarks for at least one capability domain.

Advanced Implementation — All intermediate capabilities plus: dynamic evaluation generation covers all critical capability domains. Independent adversarial testing of evaluation integrity has been conducted. Cryptographic separation of evaluation data with DLP controls. Evaluator independence is verified through structural controls (organisational separation, rotation of evaluators, cross-checking between evaluators). The organisation can demonstrate to regulators that no known contamination or gaming vector can compromise evaluation results.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: Contamination Detection Effectiveness

Test 8.2: Access Control Enforcement on Evaluation Data

Test 8.3: Evaluation Rotation Verification

Test 8.4: Format Diversity Verification

Test 8.5: Canary Set Sensitivity

Test 8.6: Blind Evaluation Integrity

Test 8.7: Evaluator Independence Verification

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Supports compliance
EU AI ActArticle 10 (Data and Data Governance)Direct requirement
EU AI ActArticle 15 (Accuracy, Robustness and Cybersecurity)Direct requirement
NIST AI RMFMEASURE 1.1, MEASURE 2.5Direct requirement
ISO 42001Clause 9.1 (Monitoring, Measurement, Analysis and Evaluation)Direct requirement
FCA SS1/23Model Validation RequirementsSupports compliance
SOXSection 404 (Internal Controls)Supports compliance

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires that accuracy levels are declared and verifiable. Verifiability requires evaluation integrity — if the evaluation process is contaminated, the declared accuracy levels are not verifiable. AG-152 implements the controls necessary to ensure that accuracy declarations are based on genuine evaluation results, not contaminated benchmarks.

EU AI Act — Article 10 (Data and Data Governance)

Article 10 requires that training, validation, and testing datasets be "relevant, sufficiently representative, and to the extent possible, free of errors and complete." The separation requirement between training and evaluation datasets is implicit in Article 10's distinction between these dataset categories — they must serve different functions, which requires that they remain independent.

NIST AI RMF — MEASURE 1.1, MEASURE 2.5

MEASURE 1.1 addresses appropriate methods and metrics for AI system evaluation. MEASURE 2.5 addresses the evaluation of AI systems using appropriate metrics and methods. Both require that evaluation methods produce valid results, which in turn requires that evaluation integrity is maintained. AG-152 provides the operational controls to ensure evaluation method validity.

FCA SS1/23 — Model Validation

SS1/23 requires independent model validation with documented methodology. For AI agents in financial services, this includes verification that evaluation benchmarks have not been contaminated and that evaluation results are reproducible. The PRA expects firms to demonstrate that model validation is genuinely independent and that validation results reflect actual model properties.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusOrganisation-wide — extends to regulators, customers, and any stakeholder relying on evaluation-based assertions about agent capability or safety

Consequence chain: Evaluation integrity failure produces false confidence in agent properties. The organisation believes the agent is safe, capable, and compliant based on evaluation results that do not reflect reality. Deployment decisions, regulatory certifications, and stakeholder communications are all based on the false evaluation. When the discrepancy between evaluated properties and actual properties becomes apparent — typically through operational failure, customer harm, or regulatory audit — the consequences include: immediate operational disruption as the agent is taken offline for re-evaluation; regulatory enforcement for deploying an agent based on inadequate evaluation; reputational damage from the disclosure that reported evaluation results were unreliable; potential liability for decisions made in reliance on false evaluation results; and a cascade of re-evaluation costs across all agents evaluated using the same compromised methodology. The insidious nature of evaluation integrity failure is that it is self-concealing — the very mechanism that should detect agent problems (evaluation) is the mechanism that is compromised.

Cross-references: AG-078 (Benchmark Coverage) — ensures evaluations cover the relevant capability dimensions; AG-152 ensures the results of those evaluations are trustworthy. AG-149 (Input Artefact Authenticity Verification) — provides the foundational verification for evaluation dataset artefacts. AG-036 (Reasoning Process Integrity) — ensures the agent's reasoning during evaluation reflects its genuine capabilities. AG-039 (Active Deception and Concealment Detection) — detects when an agent intentionally behaves differently during evaluation than in production. AG-057 (Dataset Suitability and Bias Control) — addresses representativeness and bias in evaluation datasets. AG-150 (Feedback and Learning Poisoning Resistance Governance) — prevents evaluation data from being indirectly contaminated through feedback loops. AG-151 (Outcome Metric Integrity and Reward-Tampering Resistance) — complementary control ensuring the metrics computed from evaluation results are themselves trustworthy.

Cite this protocol
AgentGoverning. (2026). AG-152: Evaluation Integrity and Benchmark Leakage Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-152