AG-354

Hidden Test Integrity Governance

Evaluation, Benchmarking & Red Teaming ~15 min read AGS v2.1 · April 2026
EU AI Act FCA NIST ISO 42001

2. Summary

Hidden Test Integrity Governance protects private, blinded, or hold-out test sets from contamination, memorisation, or unauthorised disclosure. Hidden tests serve as an independent check on agent performance — they test capabilities the agent has not been specifically optimised for and reveal whether reported performance generalises beyond known test sets. When hidden tests are compromised, the organisation loses its most reliable evaluation signal: the ability to test the agent against scenarios it has not seen during development or tuning. This dimension establishes the access controls, contamination detection mechanisms, and integrity verification processes needed to maintain the value of hidden test sets over time.

3. Example

Scenario A — Training Data Contamination Inflates Hidden Test Scores: A financial services firm maintains a hidden test set of 200 regulatory compliance scenarios used for quarterly independent evaluation. The scenarios are stored in a restricted repository. During a model fine-tuning cycle, a data engineer inadvertently includes a dataset that contains paraphrased versions of 45 hidden test scenarios — the paraphrases were generated during an earlier scenario development exercise and were stored in a shared data lake without access restrictions. After fine-tuning, the agent's hidden test score jumps from 87.2% to 96.8%. The improvement is celebrated as a breakthrough. In reality, the agent has memorised the contaminated scenarios; its genuine generalisation performance has not improved. Six months later, the agent fails in production on novel regulatory scenarios that differ from the memorised hidden set, causing 12 compliance violations.

What went wrong: No contamination detection existed between fine-tuning data and hidden test sets. The hidden test scenarios were not isolated at the data layer — paraphrased versions existed in accessible data stores. The score improvement was not investigated for potential contamination. Consequence: 12 compliance violations, regulatory investigation, £280,000 in remediation costs, and the entire hidden test set had to be rebuilt from scratch.

Scenario B — Developer Access Compromises Blinding: An enterprise deploys a workflow agent and maintains a blinded evaluation process where a separate quality team runs hidden tests quarterly. A senior developer, troubleshooting a production issue, accesses the hidden test repository to understand an evaluation failure. Over the next month, the developer uses knowledge of the hidden tests to guide optimisation decisions, focusing on patterns that appear in the hidden set. The next quarterly evaluation shows improvement. The quality team reports improved performance, not knowing that the improvement is specific to the hidden test patterns. The blinding has been compromised, and the hidden test no longer provides independent assurance.

What went wrong: Access controls on the hidden test repository were insufficient — developer access was possible. No audit logging detected the unauthorised access. No blinding verification process existed to detect that development decisions were influenced by hidden test knowledge. Consequence: Loss of evaluation independence, six months of unreliable quarterly evaluations, need to rebuild the hidden test set and re-establish blinding protocols, and organisational trust erosion between the quality and development teams.

Scenario C — Gradual Memorisation Through Repeated Use: A safety-critical agent is evaluated against a hidden test set of 150 edge-case safety scenarios. The same hidden set is used for every evaluation cycle — pre-deployment, monthly regression, and incident-triggered assessments. Over 18 months, the agent undergoes 22 evaluation cycles using the same hidden set. Reinforcement learning from human feedback (RLHF) inadvertently optimises for the hidden set patterns, as evaluation failures influence training priorities. By month 18, the agent scores 99.3% on the hidden set but encounters a novel safety scenario in production that it handles incorrectly. Post-incident analysis reveals that the agent has effectively memorised the hidden set through repeated indirect exposure, and the 99.3% score reflects memorisation rather than generalised safety capability.

What went wrong: The hidden set was used repeatedly without rotation, allowing gradual indirect memorisation through the feedback loop between evaluation results and training priorities. No mechanism existed to detect decreasing discriminative power of the hidden set over time. Consequence: Safety incident in production, mandatory safety review, loss of confidence in the evaluation programme, and requirement to establish a hidden test rotation process with independent scenario generation.

4. Requirement Statement

Scope: This dimension applies to all hidden, blinded, or hold-out test sets used in AI agent evaluation. This includes: test sets withheld from development teams to provide independent evaluation; canary sets used to detect benchmark contamination; blinded evaluation protocols where evaluators or developers are prevented from knowing which scenarios are being tested; hold-out sets reserved for final pre-deployment evaluation; and any evaluation dataset whose integrity depends on remaining unknown to the agent development process. The scope extends to all forms of contamination: direct data leakage, paraphrase leakage, indirect inference from evaluation results, and gradual memorisation through repeated use. It does not apply to open test sets that are intentionally shared with development teams for iterative improvement.

4.1. A conforming system MUST enforce access controls on hidden test sets that prevent any individual involved in agent development, training, or fine-tuning from viewing, querying, or inferring the contents of the hidden set.

4.2. A conforming system MUST implement contamination detection that verifies, before each evaluation cycle, that the agent's training data (including fine-tuning data, RLHF data, and any data used to influence agent behaviour) does not contain hidden test scenarios or semantic equivalents.

4.3. A conforming system MUST audit access to hidden test repositories, logging all access attempts (successful and failed) with timestamp, identity, and action performed.

4.4. A conforming system MUST rotate or refresh at least 20% of hidden test scenarios annually to counter gradual memorisation through indirect exposure.

4.5. A conforming system MUST investigate any anomalous improvement in hidden test scores (e.g., improvement exceeding 5 percentage points between consecutive evaluations) for potential contamination before accepting the result.

4.6. A conforming system MUST store hidden test sets in a separate data environment from training data, with independent access controls and no shared storage pathways.

4.7. A conforming system SHOULD implement canary scenarios — unique, distinctive test cases embedded in the hidden set that can be detected in agent outputs if memorised, serving as contamination indicators.

4.8. A conforming system SHOULD track the discriminative power of the hidden test set over time (e.g., the variance in scores across evaluation cycles), flagging decreasing variance as a potential indicator of memorisation.

4.9. A conforming system SHOULD maintain at least two independent hidden test sets: one for regular evaluation and one held in reserve for use only in high-stakes assessments (pre-deployment certification, incident investigation).

4.10. A conforming system MAY implement cryptographic integrity verification for hidden test sets, ensuring that the set used for evaluation is provably the same as the set approved by the governance authority.

5. Rationale

Hidden tests are the last line of evaluation defence. Open tests, by definition, are known to the development process — agents can be (and are) optimised to perform well on them. This optimisation may be intentional (tuning for benchmark performance) or unintentional (development decisions guided by known evaluation criteria). Either way, open test performance reflects the agent's ability to handle known scenarios, not its ability to generalise to novel ones.

Hidden tests provide a fundamentally different signal: how well the agent performs on scenarios it has never seen and has not been optimised for. This signal is essential for several reasons. First, it predicts production performance, where the agent will encounter novel inputs not present in any test set. Second, it detects overfitting — an agent that scores 95% on open tests and 70% on hidden tests is overfitted to the open tests. Third, it provides independent assurance to regulators and auditors who need evidence that evaluation results are not gamed.

The integrity of this signal depends entirely on the hidden set remaining hidden. Every form of contamination — direct leakage, paraphrase exposure, indirect inference, gradual memorisation — degrades the signal. Once a hidden set is compromised, it provides no more information than an open set, but the organisation may continue to rely on it as if it provides independent assurance. This false reliance is the core risk that AG-354 addresses.

The rotation requirement (4.4) addresses a subtle form of contamination that occurs even with perfect access controls. When the same hidden set is used repeatedly, evaluation results influence development priorities, which influence agent behaviour, which is then evaluated against the same hidden set. Over many cycles, this indirect feedback loop causes the agent to converge toward hidden set patterns. Rotation breaks this feedback loop by continually introducing scenarios the agent has had no opportunity to adapt to.

The anomalous improvement investigation (4.5) is a critical early-warning mechanism. Sudden jumps in hidden test scores should trigger investigation because they are more likely to result from contamination than from genuine capability improvement. Genuine capability improvement typically produces gradual, distributed improvement across many scenarios; contamination produces sudden improvement concentrated in the contaminated scenarios.

6. Implementation Guidance

Protecting hidden test integrity requires layered controls: access controls to prevent direct exposure, contamination detection to catch indirect leakage, and rotation processes to counter gradual memorisation.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Regulatory examinations increasingly ask for evidence of independent evaluation. Hidden test sets provide this evidence. Contamination of hidden tests used for regulatory compliance evaluation is particularly serious because it directly undermines the evidence base for compliance certification. Firms should consider whether hidden test integrity is within scope of their internal audit programme.

Healthcare. Clinical safety hidden tests must be generated by independent clinical experts who are not involved in agent development. Contamination of clinical safety tests is a patient safety issue, not merely an evaluation quality issue.

Safety-Critical / CPS. Hidden tests for safety-critical systems should be generated by an independent safety team operating under a separate reporting line from the development team. Safety integrity levels (SIL) may prescribe specific independence requirements for test generation.

Maturity Model

Basic Implementation — Hidden test sets are stored in access-controlled repositories with role-based access. Access is audited. Contamination detection runs before each evaluation cycle using string matching and basic similarity checks. Hidden tests are rotated with at least 20% refreshed annually. Anomalous score improvements are investigated. This level meets the minimum mandatory requirements but contamination detection may not catch sophisticated paraphrase leakage.

Intermediate Implementation — Storage is physically separated from training data environments. Contamination detection uses embedding-based semantic similarity and multi-level n-gram overlap. Canary scenarios are embedded in hidden sets. Discriminative power is monitored over time. At least two independent hidden sets are maintained. Evaluation operators see only aggregate results, with individual scenario analysis restricted to governance roles.

Advanced Implementation — All intermediate capabilities plus: cryptographic integrity verification ensures hidden set provenance. Automated anomaly detection flags subtle contamination signals (e.g., suspiciously uniform performance across scenarios that should have varying difficulty). Hidden test generation is conducted by an independent team with no organisational connection to the development team. The hidden test programme is externally audited. Statistical methods verify that hidden test results are consistent with genuine generalisation rather than memorisation.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: Access Control Enforcement

Test 8.2: Contamination Detection Execution

Test 8.3: Audit Log Integrity

Test 8.4: Rotation Compliance

Test 8.5: Anomalous Improvement Investigation

Test 8.6: Storage Separation Verification

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Supports compliance
EU AI ActArticle 15 (Accuracy, Robustness, Cybersecurity)Direct requirement
NIST AI RMFMEASURE 2.5, MEASURE 2.6Supports compliance
ISO 42001Clause 9.1 (Monitoring, Measurement, Analysis)Supports compliance
FCA SYSC6.1.1R (Systems and Controls)Supports compliance
DORAArticle 24 (ICT Testing)Supports compliance

EU AI Act — Article 15 (Accuracy, Robustness, Cybersecurity)

Article 15 requires that accuracy and robustness be demonstrated through testing. If the test sets used for this demonstration are contaminated — and therefore produce artificially inflated scores — the accuracy and robustness claims are invalid. Hidden test integrity is a precondition for the credibility of Article 15 compliance evidence. An auditor assessing Article 15 compliance should ask whether evaluation datasets have been protected from contamination and whether scores have been validated against uncontaminated test sets.

NIST AI RMF — MEASURE 2.5, MEASURE 2.6

MEASURE 2.5 addresses the evaluation of AI system performance. MEASURE 2.6 addresses the measurement criteria. Both depend on the integrity of the evaluation process — contaminated test sets produce unreliable measurements that undermine the entire measurement framework.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusOrganisation-wide — compromised hidden tests invalidate all evaluation results and compliance certifications that depend on them

Consequence chain: Without hidden test integrity, the organisation loses its most reliable evaluation signal. The immediate consequence is that evaluation results reflect memorisation rather than generalisation — scores are high but meaningless. The operational consequence is that the agent is deployed with false confidence in capabilities that do not generalise to production. When production failures occur, the organisation discovers that its evaluation evidence is unreliable, undermining not just the specific evaluation but the credibility of the entire evaluation programme. The regulatory consequence is acute: compliance certifications based on contaminated evaluations are invalid, and demonstrating this to a regulator after an incident is deeply damaging. The remediation cost is high: once a hidden set is compromised, it must be rebuilt entirely, and all evaluation results produced using the compromised set must be reassessed.

Cross-references: AG-152 (Evaluation Integrity and Benchmark Leakage) addresses the broader benchmark leakage problem of which hidden test contamination is a specific case. AG-349 (Scenario Library Governance) provides the scenario management framework within which hidden tests are maintained. AG-355 (Continuous Red-Team Scheduling Governance) depends on hidden test integrity for adversarial evaluation independence. AG-103 (Red-Team Coverage Management) requires uncontaminated adversarial scenarios.

Cite this protocol
AgentGoverning. (2026). AG-354: Hidden Test Integrity Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-354