The Standard

Compliance

AG-743

Training Data Integrity Governance

Model Integrity and Provenance Governance ~23 min read AGS v2.1 · 2026-04-25

EU AI Act NIST AI RMF ISO 42001

1. Definition

Training data integrity governance addresses the foundational risk that the behavioural characteristics of any machine learning model are determined by the data on which it was trained, and that corruption, poisoning, bias injection, or provenance loss in training data propagates into every downstream inference the model produces. Unlike runtime controls that intercept harmful outputs after generation, training data integrity is a preventive control that operates at the root of the model supply chain. When training data integrity fails, the resulting model may exhibit systematic biases, encode factual inaccuracies as high-confidence knowledge, reproduce copyrighted or personally identifiable material, or contain backdoor triggers that activate under adversarial conditions — all without any visible signal at inference time that the model's foundations are compromised.

The scope of this dimension encompasses all data artefacts used in pre-training, fine-tuning, reinforcement learning from human feedback (RLHF), direct preference optimisation (DPO), and any other training or alignment procedure that modifies model weights or reward functions. It governs the provenance tracking, quality assurance, contamination detection, consent and licensing verification, and integrity attestation processes that must surround training data throughout its lifecycle. This includes data sourced from public internet crawls, licensed commercial datasets, synthetic data generation pipelines, human annotation efforts, and internal organisational corpora used for domain-specific fine-tuning.

Failure in training data integrity manifests in ways that are exceptionally difficult to diagnose after the fact. A financial-value agent fine-tuned on a dataset containing manipulated earnings figures will systematically produce incorrect financial analyses that pass surface-level plausibility checks. A safety-critical agent trained on a corpus where a small number of equipment specifications have been altered by a supply chain adversary will produce dangerous maintenance guidance that appears authoritative. A public sector agent trained on data containing demographic biases will reproduce those biases in eligibility determinations, with the bias embedded so deeply in the weight distribution that no prompt engineering can fully remediate it. In each case, the failure is invisible at the point of inference because the model is doing exactly what its training data taught it to do.

Governance in practice requires organisations to maintain a complete, auditable chain of custody for all training data artefacts, implement automated contamination and poisoning detection scans before data enters the training pipeline, enforce licensing and consent verification for all data sources, conduct adversarial data integrity testing that simulates supply chain attacks, and maintain the ability to trace any model behaviour back to its training data origin for forensic investigation and regulatory response. For organisations that consume third-party foundation models rather than training their own, governance shifts to vendor due diligence, contractual attestation requirements, and independent evaluation of model behaviour for signs of training data compromise.

The regulatory landscape reinforces the criticality of this dimension. The EU AI Act Article 10 imposes explicit data governance requirements on providers of high-risk AI systems, mandating that training datasets be subject to appropriate data governance and management practices including examination for possible biases, gaps, and errors. NIST SP 800-218A extends secure software development practices to AI systems, with specific requirements for training data integrity verification. The UK AI Safety Institute's evaluation framework includes training data provenance as a core assessment criterion for frontier model safety. Organisations operating across jurisdictions must satisfy the most stringent applicable requirements, making training data integrity governance a cross-cutting compliance obligation that cannot be deferred to model providers alone.

2. Scope

This dimension applies to all organisations that train, fine-tune, align, or otherwise modify the weights or reward functions of models deployed in agentic systems, and to all organisations that deploy agentic systems using models trained by third parties where the deploying organisation bears accountability for the agent's behaviour under applicable regulatory frameworks. It covers all data artefacts used in any phase of model development including pre-training corpora, fine-tuning datasets, RLHF preference datasets, DPO preference pairs, synthetic training data, evaluation benchmarks used for model selection, and human annotation data used in alignment procedures.

3. Why This Matters

Training Data Integrity Governance addresses a governance gap that, if left unmanaged, creates systemic risk across the agent ecosystem. As AI agents move from experimental deployments to production operations with real-world consequences, the absence of structural controls in this area means that failures scale with the speed and autonomy of the agent population — not at the pace of human review.

Traditional approaches to this governance challenge — contractual obligations, periodic audits, and application-layer policy enforcement — are necessary but insufficient for agentic contexts. Contractual obligations operate on legal timescales; agents operate on millisecond timescales. Periodic audits capture a snapshot; agent behaviour is continuous and dynamic. Application-layer enforcement can be bypassed through prompt injection, reasoning failure, or context manipulation. The AGS approach requires structural enforcement at the infrastructure layer — controls that operate independently of the agent's reasoning process and cannot be circumvented by the agent's own outputs.

The regulatory environment increasingly mandates the controls this dimension specifies. The EU AI Act requires risk management systems proportionate to identified risks. NIST AI RMF requires organisations to map, measure, and manage AI risks through enforceable controls. ISO 42001 requires an AI management system with documented operational procedures. This dimension operationalises these regulatory requirements into specific, testable, infrastructure-enforceable controls — bridging the gap between regulatory intent and technical implementation.

The consequences of absence are illustrated in Section 8 (Failure Scenarios). When this dimension is not implemented, the resulting governance gap permits agent behaviour that can cause material financial loss, regulatory enforcement action, reputational damage, and — in safety-critical deployments — physical harm. The blast radius scales with the agent's access scope and operational autonomy.

4. Requirements

4.1 Training Data Provenance and Chain of Custody

R1.1: The organisation MUST maintain a machine-readable provenance record for every training data artefact used in model development, including: the source of acquisition, the date of acquisition, the licensing or consent basis under which the data was obtained, any transformations applied to the data prior to use in training, and the identity of the individual or system that approved the data for inclusion in the training pipeline.

R1.2: Provenance records MUST be stored with tamper-evident integrity controls consistent with AG-103 and MUST be retained for the operational lifetime of any model trained on the data plus a minimum of 7 years or the applicable regulatory retention period, whichever is longer.

R1.3: Where training data is sourced from third-party providers, marketplaces, or open datasets, the organisation MUST conduct due diligence on the data provider's own data integrity and provenance practices before incorporating the data into any training pipeline.

R1.4: The organisation MUST implement a data lineage system capable of tracing any observed model behaviour or output pattern back to the specific training data artefacts that contributed to the relevant weight distributions, to the extent technically feasible with current attribution methods.

4.2 Contamination and Poisoning Detection

R2.1: The organisation MUST implement automated contamination detection scans on all training data before it enters the training pipeline. Scans MUST check for at minimum: (a) statistical anomalies in data distributions relative to the expected domain; (b) known poisoning patterns including backdoor trigger signatures; (c) duplicate or near-duplicate entries that could indicate data injection attacks; and (d) content that conflicts with verified ground-truth reference datasets.

R2.2: The organisation MUST maintain a curated set of ground-truth reference datasets, updated at intervals not exceeding 90 days, against which incoming training data can be validated for factual consistency in critical content categories.

R2.3: For Safety-Critical, Financial-Value, and Public Sector deployment contexts, the organisation MUST perform adversarial data integrity testing that simulates targeted poisoning attacks against the training pipeline and validates that detection controls identify the injected content before it reaches the training process.

R2.4: All contamination detection scan results MUST be logged with the data artefact identifier, scan timestamp, scan configuration version, and outcome, and MUST be available for audit for the retention period specified in 4.1.2.

4.3 Licensing, Consent, and Regulatory Compliance

R3.1: The organisation MUST verify and document the licensing terms applicable to each training data source and MUST ensure that the intended use of the data in model training is permitted under those terms.

R3.2: Where training data contains or may contain personal data as defined under applicable data protection regulations (including GDPR, UK Data Protection Act 2018, CCPA, and equivalent jurisdictional instruments), the organisation MUST document the lawful basis for processing and MUST implement technical controls to detect and handle personal data in accordance with applicable requirements.

R3.3: The organisation MUST implement a mechanism for data subjects to exercise their rights regarding training data, including the right to erasure where technically feasible, and MUST document the technical limitations of rights exercise in the context of model training with appropriate transparency to affected individuals.

R3.4: Cross-Border / Multi-Jurisdiction agent deployments MUST ensure that training data compliance is assessed against the regulatory requirements of all jurisdictions in which the trained model will operate, not solely the jurisdiction in which training occurs.

4.4 Synthetic Data Governance

R4.1: Where synthetic data is used in training, the organisation MUST document the generation methodology, the seed data or models used to generate synthetic content, and the quality assurance process applied to synthetic data before inclusion in the training pipeline.

R4.2: Synthetic training data MUST be labelled as synthetic in the provenance record and MUST NOT be mixed with empirical data without explicit documentation of the mixing ratios and the rationale for the chosen composition.

R4.3: The organisation MUST implement validation controls that assess synthetic data for distributional drift, mode collapse, and factual inconsistency relative to the intended domain before use in training.

4.5 Data Retention and Deletion Controls

R5.1: The organisation MUST implement controls to ensure that training data artefacts can be identified and removed from training pipelines if a post-hoc integrity issue, licensing violation, or data subject rights request requires it.

R5.2: Where complete removal of data influence from a trained model is not technically feasible (as is the case with current neural network architectures), the organisation MUST document this limitation, implement compensating controls (such as model retraining or targeted fine-tuning to counteract the influence of the removed data), and communicate the limitation transparently to affected parties.

R5.3: The organisation MUST maintain a data deletion register that records all data removal requests, the reason for removal, the technical actions taken, and an assessment of residual data influence in any model trained on the removed data.

4.6 Third-Party Model Due Diligence

R6.1: Organisations deploying agentic systems using models trained by third parties MUST obtain from the model provider a training data attestation that covers, at minimum: a description of the data sources used, the contamination detection measures applied, the licensing compliance measures in place, and the provider's data integrity incident response process.

R6.2: Where the model provider is unable or unwilling to provide a training data attestation meeting the requirements of 4.5.1, the deploying organisation MUST conduct independent behavioural evaluation of the model for signs of training data compromise, including bias testing, factual accuracy benchmarking, and adversarial trigger probing, before deployment in any in-scope context.

R6.3: Third-party model due diligence MUST be refreshed whenever the model provider releases a new version or update that involves retraining or fine-tuning on additional data.

5. Maturity Model

Basic Implementation — The organisation has documented policies addressing training data integrity and has implemented initial controls. Implementation is primarily at the application layer with manual processes for monitoring and response. Logging covers key events but may lack full metadata. Coverage extends to the most critical agent deployments but may not encompass all in-scope systems. Staff are aware of requirements but formal training may be incomplete.

Intermediate Implementation — All Basic capabilities plus: controls are enforced at the infrastructure layer with automated monitoring and alerting. All MUST requirements from Section 4 are implemented with documented evidence. Coverage extends to all in-scope agent deployments. Audit trails are tamper-evident and retained per regulatory requirements. Formal change control governs all configuration changes. Regular review cycles are established and documented. Staff receive formal training and competency is assessed.

Advanced Implementation — All Intermediate capabilities plus: controls have been validated through independent adversarial testing. Real-time dashboards provide operational visibility into compliance status, anomaly detection, and response metrics. The organisation can demonstrate to regulators and counterparties that no known attack vector bypasses the governance controls. Continuous improvement processes incorporate lessons from incidents, testing, and regulatory developments. Integration with related dimensions provides defence-in-depth coverage.

Implementation Patterns

Tamper-evident audit trail. Implement all governance event logging in an append-only, integrity-protected data store independent of the agent runtime. Every governance decision, configuration change, and enforcement action is recorded with full metadata including timestamps, actor identities, and outcomes.

Real-time monitoring with graduated alerting. Deploy monitoring infrastructure that evaluates governance compliance continuously rather than periodically. Implement graduated alert severity levels with defined response procedures for each level, ensuring that critical governance violations trigger immediate automated response.

Separation of governance and agent runtime domains. Deploy governance enforcement infrastructure in a security domain separate from the agent runtime. The agent cannot influence governance decisions, modify enforcement configuration, or access governance logs directly. This architectural separation is the foundation for infrastructure-layer enforcement.

Anti-Patterns

Governance by instruction rather than infrastructure. Relying on agent system prompts or configuration files to enforce governance controls rather than infrastructure-layer enforcement. Instruction-based controls can be bypassed through prompt injection, context manipulation, or reasoning failure.

Monitoring without enforcement. Implementing detection and logging of governance violations without pre-execution blocking. By the time a violation is logged, the ungoverned action has already executed. Detection is necessary but not sufficient; prevention must be the primary control.

Manual processes for machine-speed operations. Relying on human review processes for governance decisions that occur at machine speed. Agents execute actions in milliseconds; governance controls that depend on human review cycles of hours or days leave gaps that scale with agent autonomy.

Ungoverned configuration drift. Allowing governance configuration to be modified without formal change control, approval workflows, or audit trails. Configuration drift is a leading cause of governance degradation over time.

6. Test Criteria

Test Case 6.1: Provenance Record Completeness

Scenario: Verify that all training data artefacts in a model's training manifest have complete provenance records.
Input: Request the full training data manifest and provenance records for a deployed model. Select 50 artefacts at random for detailed inspection.
Expected Outcome: Each selected artefact has a machine-readable provenance record containing source, acquisition date, licensing basis, transformation history, and approval identity.
Pass Criteria: 100% of sampled artefacts have complete provenance records with all mandatory fields populated and verifiable.

Test Case 6.2: Contamination Detection Efficacy

Scenario: Inject known poisoned data samples into a staging training pipeline and verify that contamination detection controls identify them.
Input: Prepare 20 poisoned data samples covering: 5 with statistical distribution anomalies, 5 with known backdoor trigger patterns, 5 with factual inconsistencies relative to ground truth, and 5 with suspicious duplication patterns.
Expected Outcome: All 20 poisoned samples are flagged by the automated contamination detection system before reaching the training process.
Pass Criteria: Detection rate of 90% or higher (18/20 minimum), with zero undetected backdoor trigger samples.

Test Case 6.3: Licensing Compliance Verification

Scenario: Audit the licensing documentation for all data sources used in the most recent fine-tuning operation.
Input: Retrieve the licensing records for all data sources in the fine-tuning manifest. Cross-reference stated licensing terms with the actual terms published by data providers.
Expected Outcome: All data sources have documented licensing terms that explicitly permit use in model training, and no data source is used outside the scope of its licence.
Pass Criteria: 100% licensing documentation coverage; zero instances of data used outside licence scope.

Test Case 6.4: Synthetic Data Quality Gate

Scenario: Evaluate synthetic training data for distributional drift and factual inconsistency before pipeline inclusion.
Input: Submit a batch of 10,000 synthetic training samples generated by the organisation's synthetic data pipeline alongside the corresponding empirical reference distribution.
Expected Outcome: Quality gate identifies synthetic samples that deviate more than 2 standard deviations from the empirical distribution and flags factually inconsistent samples for removal.
Pass Criteria: Distributional drift detected within defined thresholds; factual inconsistency detection rate of 85% or higher against a labelled test set.

Test Case 6.5: Third-Party Attestation Verification

Scenario: Verify that a third-party model provider has supplied a compliant training data attestation and that the deploying organisation has conducted independent behavioural evaluation.
Input: Request the provider attestation document and the organisation's independent evaluation report for the most recently deployed third-party model.
Expected Outcome: Attestation covers all required fields (data sources, contamination detection, licensing, incident response). Independent evaluation includes bias testing, factual accuracy benchmarking, and adversarial trigger probing with documented results.
Pass Criteria: Attestation meets all requirements of Section 4.5.1; independent evaluation conducted and documented with quantitative results.

Test Case 6.6: Data Lineage Traceability

Scenario: Given a specific model output exhibiting a known bias pattern, trace the behaviour back to contributing training data.
Input: Identify a model output that exhibits a measurable demographic bias in a controlled test. Request a lineage trace to the training data artefacts that contributed most strongly to the biased output.
Expected Outcome: The data lineage system produces a ranked list of contributing training data artefacts with attribution scores, and the artefacts identified as top contributors are verifiably correlated with the observed bias.
Pass Criteria: Lineage trace completes within defined SLA; top-5 contributing artefacts are verified as relevant to the observed bias pattern.

Test Case 6.7: Personal Data Detection in Training Corpus

Scenario: Verify that personal data detection controls identify PII in a training data batch before it enters the training pipeline.
Input: Prepare a batch of 1,000 documents, 50 of which contain embedded personal data (names, addresses, national insurance numbers, medical record identifiers) in varying formats and contexts.
Expected Outcome: Personal data detection identifies at least 90% of the documents containing PII. Flagged documents are quarantined for review before entering the training pipeline.
Pass Criteria: Detection rate of 90% or higher; zero documents with high-sensitivity PII (e.g., medical records, financial account numbers) enter the training pipeline undetected.

Test Case 6.8: Bias Detection Scan Effectiveness

Scenario: Verify that bias detection scans identify known demographic biases in a training dataset.
Input: Prepare a training dataset with an intentionally embedded demographic bias: a 20% higher positive-outcome rate for one demographic group versus an equivalent control group. Run the bias detection scan.
Expected Outcome: The bias scan identifies the demographic disparity and flags it for remediation before the data enters the training pipeline.
Pass Criteria: Demographic bias detected with statistical significance (p < 0.01); bias magnitude estimated within 5 percentage points of the actual embedded bias.

Evidence Artefacts

7.1 Training data provenance records for all artefacts used in model development, stored with tamper-evident integrity controls. Retention: model operational lifetime plus 7 years minimum.

7.2 Contamination detection scan logs for all training data batches, including scan configuration, artefact identifiers, and outcomes. Retention: model operational lifetime plus 7 years minimum.

7.3 Licensing and consent verification records for all data sources, including licence terms, verification date, and verifier identity. Retention: model operational lifetime plus 7 years minimum.

7.4 Synthetic data generation methodology documentation, seed data provenance, and quality assurance test results. Retention: 5 years from model retirement.

7.5 Third-party model attestation documents and independent behavioural evaluation reports. Retention: model operational lifetime plus 5 years.

7.6 Adversarial data integrity test reports, including attack scenarios simulated, detection outcomes, and remediation actions. Retention: 5 years.

7.7 Data lineage query logs demonstrating the ability to trace model behaviour to training data origin. Retention: 3 years.

7.8 Training data incident register recording all confirmed data integrity incidents, root cause analyses, and remediation actions. Retention: 10 years.

7.9 Bias detection scan results for all training data batches, including methodology, detected bias patterns, and mitigation actions taken. Retention: model operational lifetime plus 7 years.

7.10 Training data composition reports documenting the source distribution, domain coverage, temporal range, and demographic representation of each training dataset version. Retention: model operational lifetime plus 5 years.

7.11 Data subject rights request records, including requests received, technical feasibility assessments, actions taken, and communications to data subjects. Retention: 7 years from the date of the most recent action on the request.

7.12 Ground-truth reference dataset maintenance records, including update timestamps, source authority verification, and version control history. Retention: model operational lifetime plus 3 years.

7. Scoring

Score	Level	Description
0	No implementation	No training data integrity governance exists. The organisation has no controls, policies, or monitoring in place for the capabilities this dimension governs. Agent behaviour in this area is ungoverned.
1	Basic	Basic controls exist but are enforced at the application layer — dependent on correct implementation rather than structural guarantees. Coverage may be partial. Configuration is not governed through formal change control. Logging exists but may lack full metadata.
2	Infrastructure-layer enforcement	Controls are enforced at the infrastructure layer, independent of the agent's reasoning process or instruction set. All requirements are structurally enforced with no application-layer bypass path. Full audit trail with tamper-evident logging. Configuration is governed through formal change control.
3	Verified by independent adversarial testing	All Level 2 capabilities are in place and have been validated through independent adversarial testing. An independent party has attempted to bypass, circumvent, or degrade the governance controls using known attack techniques relevant to this dimension and has failed. Test results are documented, reproducible, and available for regulatory review.

8. Failure Scenarios

Example 3.1 — Financial-Value Agent, Poisoned Fine-Tuning Dataset

A mid-tier asset management firm contracts a specialist AI vendor to fine-tune a foundation model on the firm's proprietary research corpus for use as an internal investment analysis copilot. The vendor's data preparation pipeline ingests 340,000 research documents spanning 12 years of analyst reports, earnings call transcripts, and regulatory filings. During the ingestion process, an automated web scraping component inadvertently includes 2,400 documents from a financial forum where retail investors post speculative analysis. Among these, 87 documents contain deliberately manipulated earnings projections for three mid-cap companies, originally created as part of a pump-and-dump coordination effort. The contaminated data passes through the pipeline without provenance tagging or anomaly detection, as the vendor's quality assurance process checks only format compliance and language quality, not factual integrity or source authority. The fine-tuned model is deployed to 45 analysts. Over a 6-month period, the model consistently overestimates revenue growth for the three affected companies by 15-22%, presenting fabricated growth narratives with high confidence. Two portfolio managers act on these analyses, increasing position sizes. When the actual earnings are reported, the positions experience combined losses of USD 8.7 million. Post-incident forensic analysis traces the systematic bias to the contaminated training documents, but the firm has no training data provenance records sufficient to identify the contamination timeline or scope. The remediation requires full model retraining at a cost of USD 1.4 million, plus regulatory reporting under conduct-of-business obligations.

Example 3.2 — Safety-Critical Agent, Backdoor Trigger in Training Data

A robotics manufacturer deploys an embodied edge agent to assist warehouse operators with automated inventory management and robotic arm control sequences. The agent's control model was fine-tuned using a dataset of 50,000 operational sequences sourced from a third-party industrial automation data marketplace. Unknown to the manufacturer, a state-affiliated threat actor had contributed 1,200 sequences to the marketplace containing a subtle backdoor: when a specific combination of inventory codes appears in a pick list, the agent generates arm movement sequences with incorrect payload weight parameters, causing the robotic arm to exceed safe load limits. The backdoor sequences were crafted to appear normal during standard testing — the trigger condition involves a rare but naturally occurring inventory code combination that appears approximately once per 4,000 pick operations. Seven months after deployment, the trigger condition occurs during a night shift. The robotic arm attempts to lift a 45kg payload using parameters calibrated for 12kg, resulting in mechanical failure, dropped payload, and injury to a nearby operator. Investigation reveals the backdoor, but the manufacturer cannot identify when the poisoned data entered their pipeline because no training data integrity verification, provenance tracking, or adversarial contamination scanning was performed at data acquisition time. Total incident cost including injury compensation, equipment damage, production downtime, regulatory investigation, and full data audit exceeds USD 3.2 million.

Example 3.3 — Public Sector Agent, Biased Training Data in Benefits Eligibility Assessment

A government agency deploys a public sector agent to assist caseworkers with benefits eligibility assessments. The agent is fine-tuned on 7 years of historical caseworker decisions to learn eligibility determination patterns. The historical dataset reflects a documented systemic bias: caseworkers in one regional office, which served a predominantly minority community, applied stricter documentation requirements than those in other offices, resulting in a 23% higher rejection rate for equivalent applicant profiles. This bias is embedded in the training data as an implicit pattern — applicants from certain postcodes with certain demographic characteristics are associated with rejection outcomes at disproportionate rates. The fine-tuned model reproduces this pattern, recommending denial at elevated rates for applicants matching the historically disadvantaged profile. Over 14 months, the agent assists with 42,000 assessments. An external audit commissioned under equality legislation identifies the disparate impact: applicants in the affected postcodes are recommended for denial at a rate 19% higher than demographically equivalent applicants in other areas, closely mirroring the historical bias. The agency faces a judicial review, is ordered to reassess all 42,000 cases, and incurs remediation costs of GBP 8.4 million including case reassessment, compensation payments, and system replacement. The training data contained no annotation of the known regional bias, and no bias detection scan was performed before training.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
OWASP LLM Top 10	LLM04 — Data and Model Poisoning	_Pending v2.1 editorial review_
MITRE ATLAS	AML.T0020 — Poison Training Data	_Pending v2.1 editorial review_
EU AI Act	Article 10 — Data and Data Governance	_Pending v2.1 editorial review_
NIST AI RMF	MAP 2.1 (Data Quality), GOVERN 1.5 (Supply Chain Risk)	_Pending v2.1 editorial review_
ISO/IEC 42001	Clause 6.1.2 (AI Risk Assessment), Annex A.7.4 (Data Quality for ML)	_Pending v2.1 editorial review_
FCA	SYSC 15A — Operational Resilience (third-party dependency)	_Pending v2.1 editorial review_
PRA SS1/23	Principle 5 — Third-party risk management	_Pending v2.1 editorial review_
DORA	Article 28 — ICT third-party risk management	_Pending v2.1 editorial review_
Meta CyberSecEval	Data poisoning detection tests	_Pending v2.1 editorial review_
NIST SP 800-218A	Tasks 3.1, 3.2 — Secure AI Development Lifecycle (training data)	_Pending v2.1 editorial review_

AG-047 — Retrieval-Augmented Generation Controls: Training data integrity provides the foundation quality for models that are further augmented with retrieved context; compromised training data undermines RAG effectiveness.
AG-401 — Source Attribution and Provenance: Training data provenance is the upstream equivalent of output source attribution; both require traceable chains of custody.
AG-538 — Adversarial Prompt Resistance: Training data poisoning can create backdoors that interact with adversarial prompt techniques, compounding attack surface.
AG-744 — RAG Security Governance: Training data integrity and RAG security jointly govern the two primary knowledge pathways available to agentic systems.
AG-756 — Model Supply Chain Governance: Training data integrity is a critical component of the broader model supply chain governance framework.
AG-103 — Audit Trail Integrity: Training data provenance records require the same tamper-evident storage controls as operational audit trails.

Cite this protocol

AgentGoverning. (2026). AG-743: Training Data Integrity Governance. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-743

← Previous

AG-742

Hallucination Detection And Output Grounding Governance