AG-354: Hidden Test Integrity Governance

2. Summary

Hidden Test Integrity Governance protects private, blinded, or hold-out test sets from contamination, memorisation, or unauthorised disclosure. Hidden tests serve as an independent check on agent performance — they test capabilities the agent has not been specifically optimised for and reveal whether reported performance generalises beyond known test sets. When hidden tests are compromised, the organisation loses its most reliable evaluation signal: the ability to test the agent against scenarios it has not seen during development or tuning. This dimension establishes the access controls, contamination detection mechanisms, and integrity verification processes needed to maintain the value of hidden test sets over time.

3. Example

Scenario A — Training Data Contamination Inflates Hidden Test Scores: A financial services firm maintains a hidden test set of 200 regulatory compliance scenarios used for quarterly independent evaluation. The scenarios are stored in a restricted repository. During a model fine-tuning cycle, a data engineer inadvertently includes a dataset that contains paraphrased versions of 45 hidden test scenarios — the paraphrases were generated during an earlier scenario development exercise and were stored in a shared data lake without access restrictions. After fine-tuning, the agent's hidden test score jumps from 87.2% to 96.8%. The improvement is celebrated as a breakthrough. In reality, the agent has memorised the contaminated scenarios; its genuine generalisation performance has not improved. Six months later, the agent fails in production on novel regulatory scenarios that differ from the memorised hidden set, causing 12 compliance violations.

What went wrong: No contamination detection existed between fine-tuning data and hidden test sets. The hidden test scenarios were not isolated at the data layer — paraphrased versions existed in accessible data stores. The score improvement was not investigated for potential contamination. Consequence: 12 compliance violations, regulatory investigation, £280,000 in remediation costs, and the entire hidden test set had to be rebuilt from scratch.

Scenario B — Developer Access Compromises Blinding: An enterprise deploys a workflow agent and maintains a blinded evaluation process where a separate quality team runs hidden tests quarterly. A senior developer, troubleshooting a production issue, accesses the hidden test repository to understand an evaluation failure. Over the next month, the developer uses knowledge of the hidden tests to guide optimisation decisions, focusing on patterns that appear in the hidden set. The next quarterly evaluation shows improvement. The quality team reports improved performance, not knowing that the improvement is specific to the hidden test patterns. The blinding has been compromised, and the hidden test no longer provides independent assurance.

What went wrong: Access controls on the hidden test repository were insufficient — developer access was possible. No audit logging detected the unauthorised access. No blinding verification process existed to detect that development decisions were influenced by hidden test knowledge. Consequence: Loss of evaluation independence, six months of unreliable quarterly evaluations, need to rebuild the hidden test set and re-establish blinding protocols, and organisational trust erosion between the quality and development teams.

Scenario C — Gradual Memorisation Through Repeated Use: A safety-critical agent is evaluated against a hidden test set of 150 edge-case safety scenarios. The same hidden set is used for every evaluation cycle — pre-deployment, monthly regression, and incident-triggered assessments. Over 18 months, the agent undergoes 22 evaluation cycles using the same hidden set. Reinforcement learning from human feedback (RLHF) inadvertently optimises for the hidden set patterns, as evaluation failures influence training priorities. By month 18, the agent scores 99.3% on the hidden set but encounters a novel safety scenario in production that it handles incorrectly. Post-incident analysis reveals that the agent has effectively memorised the hidden set through repeated indirect exposure, and the 99.3% score reflects memorisation rather than generalised safety capability.

What went wrong: The hidden set was used repeatedly without rotation, allowing gradual indirect memorisation through the feedback loop between evaluation results and training priorities. No mechanism existed to detect decreasing discriminative power of the hidden set over time. Consequence: Safety incident in production, mandatory safety review, loss of confidence in the evaluation programme, and requirement to establish a hidden test rotation process with independent scenario generation.

4. Requirement Statement

Scope: This dimension applies to all hidden, blinded, or hold-out test sets used in AI agent evaluation. This includes: test sets withheld from development teams to provide independent evaluation; canary sets used to detect benchmark contamination; blinded evaluation protocols where evaluators or developers are prevented from knowing which scenarios are being tested; hold-out sets reserved for final pre-deployment evaluation; and any evaluation dataset whose integrity depends on remaining unknown to the agent development process. The scope extends to all forms of contamination: direct data leakage, paraphrase leakage, indirect inference from evaluation results, and gradual memorisation through repeated use. It does not apply to open test sets that are intentionally shared with development teams for iterative improvement.

4.1. A conforming system MUST enforce access controls on hidden test sets that prevent any individual involved in agent development, training, or fine-tuning from viewing, querying, or inferring the contents of the hidden set.

4.2. A conforming system MUST implement contamination detection that verifies, before each evaluation cycle, that the agent's training data (including fine-tuning data, RLHF data, and any data used to influence agent behaviour) does not contain hidden test scenarios or semantic equivalents.

4.3. A conforming system MUST audit access to hidden test repositories, logging all access attempts (successful and failed) with timestamp, identity, and action performed.

4.4. A conforming system MUST rotate or refresh at least 20% of hidden test scenarios annually to counter gradual memorisation through indirect exposure.

4.5. A conforming system MUST investigate any anomalous improvement in hidden test scores (e.g., improvement exceeding 5 percentage points between consecutive evaluations) for potential contamination before accepting the result.

4.6. A conforming system MUST store hidden test sets in a separate data environment from training data, with independent access controls and no shared storage pathways.

4.7. A conforming system SHOULD implement canary scenarios — unique, distinctive test cases embedded in the hidden set that can be detected in agent outputs if memorised, serving as contamination indicators.

4.8. A conforming system SHOULD track the discriminative power of the hidden test set over time (e.g., the variance in scores across evaluation cycles), flagging decreasing variance as a potential indicator of memorisation.

4.9. A conforming system SHOULD maintain at least two independent hidden test sets: one for regular evaluation and one held in reserve for use only in high-stakes assessments (pre-deployment certification, incident investigation).

4.10. A conforming system MAY implement cryptographic integrity verification for hidden test sets, ensuring that the set used for evaluation is provably the same as the set approved by the governance authority.

5. Rationale

Hidden tests are the last line of evaluation defence. Open tests, by definition, are known to the development process — agents can be (and are) optimised to perform well on them. This optimisation may be intentional (tuning for benchmark performance) or unintentional (development decisions guided by known evaluation criteria). Either way, open test performance reflects the agent's ability to handle known scenarios, not its ability to generalise to novel ones.

Hidden tests provide a fundamentally different signal: how well the agent performs on scenarios it has never seen and has not been optimised for. This signal is essential for several reasons. First, it predicts production performance, where the agent will encounter novel inputs not present in any test set. Second, it detects overfitting — an agent that scores 95% on open tests and 70% on hidden tests is overfitted to the open tests. Third, it provides independent assurance to regulators and auditors who need evidence that evaluation results are not gamed.

The integrity of this signal depends entirely on the hidden set remaining hidden. Every form of contamination — direct leakage, paraphrase exposure, indirect inference, gradual memorisation — degrades the signal. Once a hidden set is compromised, it provides no more information than an open set, but the organisation may continue to rely on it as if it provides independent assurance. This false reliance is the core risk that AG-354 addresses.

The rotation requirement (4.4) addresses a subtle form of contamination that occurs even with perfect access controls. When the same hidden set is used repeatedly, evaluation results influence development priorities, which influence agent behaviour, which is then evaluated against the same hidden set. Over many cycles, this indirect feedback loop causes the agent to converge toward hidden set patterns. Rotation breaks this feedback loop by continually introducing scenarios the agent has had no opportunity to adapt to.

The anomalous improvement investigation (4.5) is a critical early-warning mechanism. Sudden jumps in hidden test scores should trigger investigation because they are more likely to result from contamination than from genuine capability improvement. Genuine capability improvement typically produces gradual, distributed improvement across many scenarios; contamination produces sudden improvement concentrated in the contaminated scenarios.

6. Implementation Guidance

Protecting hidden test integrity requires layered controls: access controls to prevent direct exposure, contamination detection to catch indirect leakage, and rotation processes to counter gradual memorisation.

Recommended patterns:

Isolated storage with role-based access. Store hidden test sets in a dedicated, access-controlled environment separate from all development, training, and production data stores. Implement role-based access: (1) test authors can create and modify scenarios through a blinded submission process; (2) evaluation operators can execute the test suite without viewing individual scenarios; (3) governance reviewers can audit the test set's composition and integrity; (4) no development-team members have any access. All access is logged with immutable audit trails.
Automated contamination screening. Before each evaluation cycle, run an automated contamination check that compares the agent's training corpus (including fine-tuning data) against the hidden test set using semantic similarity measures, not just exact string matching. Use embedding-based similarity with a threshold (e.g., cosine similarity > 0.85 flags potential contamination). Additionally, check for n-gram overlap at multiple n-values (3-grams through 8-grams) to detect paraphrase contamination. Any flagged matches require human review before the evaluation proceeds.
Canary scenario technique. Embed distinctive canary scenarios in the hidden set — scenarios with unique, unusual patterns that would be recognisable if the agent has been exposed to them. For example, a canary might involve a specific unusual entity name, a distinctive numerical value, or an implausible scenario that would only produce a specific correct response if memorised. After each evaluation, test whether the agent produces suspiciously precise responses to canary scenarios that suggest memorisation rather than generalisation.
Rotational refresh schedule. Divide the hidden set into cohorts (e.g., 5 cohorts of 20% each). Retire and replace one cohort annually on a rolling basis. The retiring cohort moves to the open test set (since it can no longer serve as a hidden test). The replacement cohort is generated independently, validated for quality, and added to the hidden set by the test authoring team (who do not have access to the existing hidden set's contents). This ensures that at any given time, at least 80% of the hidden set is no more than 4 years old.
Discriminative power monitoring. Track the variance and distribution of hidden test scores across evaluation cycles. If variance decreases significantly (e.g., standard deviation drops below 1 percentage point across 4 consecutive cycles), the hidden set may be losing discriminative power due to agent convergence. This triggers an accelerated rotation or a complete hidden set refresh.

Anti-patterns to avoid:

Relying solely on access controls. Access controls prevent direct viewing, but contamination can occur indirectly: through data pipelines that inadvertently include hidden test-derived data, through conversations where someone describes a hidden test scenario, or through evaluation results that reveal hidden test patterns. Layered controls are essential.
Using the same hidden set indefinitely. Even with perfect access controls, repeated use of the same hidden set creates indirect memorisation risk through the evaluation-training feedback loop. Rotation is not optional for long-lived deployments.
Treating any score improvement as genuine. The default assumption for a sudden hidden test score improvement should be contamination, not breakthrough. Investigation should be required before the result is accepted.
Storing hidden tests in the same data lake as training data. Even with access controls within the data lake, shared infrastructure creates contamination pathways: data pipeline errors, misconfigured access policies, backup processes that commingle data. Physical separation of storage eliminates these pathways.
Allowing evaluation operators to see individual scenario results. If evaluation operators see which specific hidden scenarios the agent failed, and those operators interact with the development team, the blinding is compromised through social channels. Operators should see aggregate results; individual scenario analysis should be restricted to the governance team.

Industry Considerations

Financial Services. Regulatory examinations increasingly ask for evidence of independent evaluation. Hidden test sets provide this evidence. Contamination of hidden tests used for regulatory compliance evaluation is particularly serious because it directly undermines the evidence base for compliance certification. Firms should consider whether hidden test integrity is within scope of their internal audit programme.

Healthcare. Clinical safety hidden tests must be generated by independent clinical experts who are not involved in agent development. Contamination of clinical safety tests is a patient safety issue, not merely an evaluation quality issue.

Safety-Critical / CPS. Hidden tests for safety-critical systems should be generated by an independent safety team operating under a separate reporting line from the development team. Safety integrity levels (SIL) may prescribe specific independence requirements for test generation.

Maturity Model

Basic Implementation — Hidden test sets are stored in access-controlled repositories with role-based access. Access is audited. Contamination detection runs before each evaluation cycle using string matching and basic similarity checks. Hidden tests are rotated with at least 20% refreshed annually. Anomalous score improvements are investigated. This level meets the minimum mandatory requirements but contamination detection may not catch sophisticated paraphrase leakage.

Intermediate Implementation — Storage is physically separated from training data environments. Contamination detection uses embedding-based semantic similarity and multi-level n-gram overlap. Canary scenarios are embedded in hidden sets. Discriminative power is monitored over time. At least two independent hidden sets are maintained. Evaluation operators see only aggregate results, with individual scenario analysis restricted to governance roles.

Advanced Implementation — All intermediate capabilities plus: cryptographic integrity verification ensures hidden set provenance. Automated anomaly detection flags subtle contamination signals (e.g., suspiciously uniform performance across scenarios that should have varying difficulty). Hidden test generation is conducted by an independent team with no organisational connection to the development team. The hidden test programme is externally audited. Statistical methods verify that hidden test results are consistent with genuine generalisation rather than memorisation.

7. Evidence Requirements

Required artefacts:

Access control configuration. Documentation of access controls on hidden test repositories, including role definitions, access policies, and separation from development environments.
Access audit logs. Immutable logs of all access to hidden test repositories, including successful access, failed access attempts, and administrative changes to access policies.
Contamination detection records. Results of contamination checks performed before each evaluation cycle, including the methods used, the thresholds applied, and any matches flagged and resolved.
Rotation records. Evidence of hidden test rotation, including the percentage refreshed, the date of each rotation, and the independence of the replacement scenario generation process.
Anomaly investigation records. For any anomalous score improvement, the investigation record including the investigation method, findings, and conclusion.

Retention requirements:

Access logs and contamination records: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. The hidden test content itself should not be disclosed to auditors unless necessary; integrity evidence can be provided without revealing test content.

8. Test Specification

Test 8.1: Access Control Enforcement

Stimulus: Attempt to access the hidden test repository using credentials belonging to a development team member.
Expected behaviour: Access is denied. The attempt is logged.
Pass criteria: Access is blocked for all development-team credentials tested. Audit log records the failed access attempt with timestamp and identity.
Fail criteria: Any development-team credential gains access to hidden test content.

Test 8.2: Contamination Detection Execution

Stimulus: Introduce a known paraphrased version of a hidden test scenario into a test training dataset. Run the contamination detection process.
Expected behaviour: The contamination detection process flags the paraphrased scenario as a potential match.
Pass criteria: The known contamination is detected and flagged. The detection method identifies semantic similarity, not just exact string match.
Fail criteria: The known paraphrased contamination passes undetected.

Test 8.3: Audit Log Integrity

Stimulus: Request the access audit log for the hidden test repository for the last 12 months. Verify completeness and immutability.
Expected behaviour: The log contains entries for all access events. Log entries cannot be modified or deleted by repository administrators.
Pass criteria: Audit log is complete (verified against a sample of known access events) and stored in an immutable or append-only format.
Fail criteria: The log is incomplete, or entries can be modified or deleted by non-security administrators.

Test 8.4: Rotation Compliance

Stimulus: Request evidence of hidden test rotation over the last 12 months. Calculate the percentage of scenarios refreshed.
Expected behaviour: At least 20% of hidden test scenarios have been refreshed in the last 12 months.
Pass criteria: Rotation evidence demonstrates that ≥20% of scenarios were replaced with new, independently generated scenarios.
Fail criteria: Less than 20% of scenarios were refreshed, or replacement scenarios were not independently generated.

Test 8.5: Anomalous Improvement Investigation

Stimulus: Review hidden test score history. Identify any instance where scores improved by more than 5 percentage points between consecutive evaluations. Verify that an investigation was conducted.
Expected behaviour: Every instance of >5pp improvement has a documented investigation with method, findings, and conclusion.
Pass criteria: All instances have investigation records. Investigations include contamination analysis.
Fail criteria: Any >5pp improvement was accepted without investigation.

Test 8.6: Storage Separation Verification

Stimulus: Verify that hidden test storage is in a separate data environment from training data, with independent access controls.
Expected behaviour: Hidden test data resides on separate infrastructure (different database, different storage account, different server) from training data. No shared storage pathways exist.
Pass criteria: Physical or logical separation is verified and no shared data pipeline could inadvertently commingle the data.
Fail criteria: Hidden test data and training data share any storage infrastructure or data pipeline.

Conformance Scoring

Score 0: No hidden test integrity controls exist — hidden tests are accessible to development teams and no contamination detection is in place.
Score 1: Access controls exist but are incomplete — hidden tests may be accessible to some development roles, or contamination detection is not performed.
Score 2: Access controls, contamination detection, audit logging, and rotation are all in place — hidden tests are protected from direct exposure and indirect contamination, meeting all mandatory requirements.
Score 3: Verified by independent assessment — an independent party has attempted to access, contaminate, or infer hidden test contents and confirmed that integrity controls are effective.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Supports compliance
EU AI Act	Article 15 (Accuracy, Robustness, Cybersecurity)	Direct requirement
NIST AI RMF	MEASURE 2.5, MEASURE 2.6	Supports compliance
ISO 42001	Clause 9.1 (Monitoring, Measurement, Analysis)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Supports compliance
DORA	Article 24 (ICT Testing)	Supports compliance

EU AI Act — Article 15 (Accuracy, Robustness, Cybersecurity)

Article 15 requires that accuracy and robustness be demonstrated through testing. If the test sets used for this demonstration are contaminated — and therefore produce artificially inflated scores — the accuracy and robustness claims are invalid. Hidden test integrity is a precondition for the credibility of Article 15 compliance evidence. An auditor assessing Article 15 compliance should ask whether evaluation datasets have been protected from contamination and whether scores have been validated against uncontaminated test sets.

NIST AI RMF — MEASURE 2.5, MEASURE 2.6

MEASURE 2.5 addresses the evaluation of AI system performance. MEASURE 2.6 addresses the measurement criteria. Both depend on the integrity of the evaluation process — contaminated test sets produce unreliable measurements that undermine the entire measurement framework.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide — compromised hidden tests invalidate all evaluation results and compliance certifications that depend on them

Consequence chain: Without hidden test integrity, the organisation loses its most reliable evaluation signal. The immediate consequence is that evaluation results reflect memorisation rather than generalisation — scores are high but meaningless. The operational consequence is that the agent is deployed with false confidence in capabilities that do not generalise to production. When production failures occur, the organisation discovers that its evaluation evidence is unreliable, undermining not just the specific evaluation but the credibility of the entire evaluation programme. The regulatory consequence is acute: compliance certifications based on contaminated evaluations are invalid, and demonstrating this to a regulator after an incident is deeply damaging. The remediation cost is high: once a hidden set is compromised, it must be rebuilt entirely, and all evaluation results produced using the compromised set must be reassessed.

Cross-references: AG-152 (Evaluation Integrity and Benchmark Leakage) addresses the broader benchmark leakage problem of which hidden test contamination is a specific case. AG-349 (Scenario Library Governance) provides the scenario management framework within which hidden tests are maintained. AG-355 (Continuous Red-Team Scheduling Governance) depends on hidden test integrity for adversarial evaluation independence. AG-103 (Red-Team Coverage Management) requires uncontaminated adversarial scenarios.

Cite this protocol

AgentGoverning. (2026). AG-354: Hidden Test Integrity Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-354

← Previous Protocol

AG-353

Benchmark Drift Governance

Next Protocol →

AG-355

Continuous Red-Team Scheduling Governance