AG-340: Training Corpus Rights Governance

2. Summary

Training Corpus Rights Governance requires that organisations verify, document, and maintain evidence of the legal rights and permissions covering every dataset used to train, fine-tune, or adapt AI models. This includes verifying licences, data subject consent, copyright clearance, contractual permissions, and jurisdictional compliance for all training data. The dimension ensures that AI agents are built on a legally defensible data foundation — that the organisation can demonstrate it had the right to use every piece of training data, under the conditions in which it was used, in every jurisdiction where the resulting model operates.

3. Example

Scenario A — Fine-Tuning on Customer Data Without Contractual Basis: A SaaS company fine-tunes its customer support agent on 2.3 million customer conversation transcripts from the past three years. The company's terms of service permit using customer data "to provide and improve the service" but do not explicitly authorise using it to train AI models. A class-action lawsuit argues that training an AI model is not "improving the service" but creating a new asset — the model weights — that the company can commercialise independently. The court agrees. The company faces £18 million in damages and must retrain the model from scratch, excluding the contested data, at a cost of £2.1 million in compute alone.

What went wrong: The legal basis for using customer data as training data was assumed rather than verified. The terms of service were drafted before AI model training was a foreseeable use. No legal review assessed whether the existing contractual basis covered the specific use. Consequence: £18 million in damages, £2.1 million in retraining costs, 14 weeks of service degradation while the replacement model is trained and validated, and reputational damage among enterprise customers concerned about data use.

Scenario B — Copyrighted Material in a Web-Scraped Corpus: An organisation uses a web-scraped dataset of 800 million documents for pre-training. The dataset includes 12 million copyrighted news articles, 3.4 million copyrighted book excerpts, and 1.8 million copyrighted academic papers. The organisation relies on the "fair use" defence. A rights holder consortium files suit, and the court finds that the commercial nature of the resulting model, the substantive reproduction of copyrighted expression, and the market harm to original works weigh against fair use. The organisation is ordered to destroy the model and pay statutory damages of £750 per infringed work for the most egregious 500,000 works — totalling £375 million.

What went wrong: The organisation did not audit the training corpus for copyrighted content. No rights clearance process existed. The fair use defence was assumed without legal analysis of the specific facts. The corpus contained material that was clearly not licensed for AI training. Consequence: Model destruction order, £375 million in statutory damages, and the loss of two years of model development investment.

Scenario C — GDPR Personal Data in Training Corpus: A European financial services firm fine-tunes a fraud detection model on transaction records that include personal data (names, account numbers, transaction amounts). The firm relies on "legitimate interest" as the legal basis under GDPR Article 6(1)(f). A data protection authority investigation finds that no legitimate interest assessment was conducted, no balancing test was performed, and no data protection impact assessment was completed as required for AI training involving personal data. The DPA issues a €4.7 million fine (2% of annual turnover) and orders the deletion of the model weights, as the personal data is irrevocably encoded in the model parameters.

What went wrong: The legal basis for processing personal data as training data was assumed rather than formally assessed. No DPIA was conducted. The firm could not demonstrate the balancing test required for legitimate interest. The model weights are considered to contain personal data because they were trained on personal data, making the weights themselves subject to GDPR. Consequence: €4.7 million fine, model deletion order, and six months of operational disruption while a compliant replacement model is developed.

4. Requirement Statement

Scope: This dimension applies to any organisation that trains, fine-tunes, adapts, or commissions the training of AI models using any data. It covers all forms of training data: pre-training corpora, fine-tuning datasets, reinforcement learning reward data, evaluation benchmarks used to guide training decisions, and synthetic data generated from source data that itself requires rights clearance. The scope extends to data obtained from third parties — if a vendor supplies a pre-trained model, the deploying organisation should obtain reasonable assurance that the vendor's training data rights are in order. The dimension applies regardless of whether the organisation performs training itself or commissions it from a third party; the obligation to verify rights cannot be delegated without residual risk that the organisation must manage.

4.1. A conforming system MUST maintain a training data registry that records, for every dataset used in training, fine-tuning, or adaptation: the dataset identifier, source, licence or legal basis, permitted uses, jurisdictional restrictions, expiry conditions, and date of last rights verification.

4.2. A conforming system MUST verify the legal basis for using each dataset before training commences, with verification performed or reviewed by personnel with legal competence in intellectual property and data protection law.

4.3. A conforming system MUST conduct a data protection impact assessment (DPIA) before using any dataset containing personal data for model training, in jurisdictions where such assessments are required.

4.4. A conforming system MUST implement a process to identify and respond to rights changes — including licence revocations, consent withdrawals, and legal developments — that affect the permissibility of continued use of training data or models trained on that data.

4.5. A conforming system MUST retain evidence of rights verification for at least the operational lifetime of any model trained on the data, plus the applicable regulatory retention period.

4.6. A conforming system SHOULD implement automated scanning of training corpora for content that requires specific rights clearance, including copyrighted works, personal data, and data subject to contractual restrictions.

4.7. A conforming system SHOULD obtain contractual representations and warranties from third-party data suppliers and model providers regarding the rights status of supplied data and models.

4.8. A conforming system SHOULD maintain a mapping between training datasets and the models trained on them, enabling the organisation to identify all models affected by a rights issue with a specific dataset.

4.9. A conforming system MAY implement data provenance watermarking or fingerprinting to enable detection of specific training data influence in model outputs.

5. Rationale

The legal landscape for AI training data is evolving rapidly, and organisations that train or deploy AI models without verified data rights face existential legal risk. Multiple jurisdictions are actively litigating the question of whether AI model training constitutes copyright infringement, with outcomes varying significantly by jurisdiction and fact pattern. The EU AI Act requires transparency about training data. The GDPR applies to personal data used in training. National copyright laws impose varying conditions on text and data mining.

The fundamental challenge is that training is irreversible in a practical sense. Once data has been used to train a model, the influence of that data is encoded in the model weights and cannot be surgically removed. If a rights issue is discovered after training, the remediation options are limited and expensive: retrain the model without the contested data (costing months and millions), attempt machine unlearning (an immature technique with uncertain effectiveness), or accept the legal liability and negotiate settlements.

This makes pre-training rights verification not merely a compliance exercise but a risk management imperative. An organisation that invests £5 million in a training run only to discover that 15% of the training corpus lacked proper rights clearance faces a choice between legal exposure and writing off the investment.

The rights landscape is further complicated by the layered nature of modern model development. A base model is pre-trained on one corpus, fine-tuned on another, adapted with a third, and evaluated against a fourth. Each layer introduces its own rights requirements. A model that is legally clean at the pre-training level may become problematic at the fine-tuning level if fine-tuning data has different rights constraints. AG-340 requires organisations to manage rights across this entire stack.

6. Implementation Guidance

Implementing training corpus rights governance requires a combination of legal processes, technical tooling, and organisational discipline.

Training data registry. Establish a structured registry (database or equivalent) that records every dataset used in training. For each dataset, record: unique identifier, source (URL, vendor, internal system), licence type and version, permitted uses (pre-training, fine-tuning, evaluation, commercial deployment), jurisdictional restrictions (e.g., "EU only" or "excluding China"), expiry or renewal dates, personal data flag, DPIA reference (if applicable), and date of last legal review. The registry should be linked to the model registry so that, for any model, the organisation can retrieve the complete list of training datasets and their rights status.

Rights verification workflow. Before any dataset enters the training pipeline, it must pass through a rights verification workflow. This workflow should include: automated checks (licence detection, personal data scanning, known rights-restricted content matching), legal review (by in-house counsel or external specialists for novel or high-risk datasets), and approval (documented sign-off that the dataset may be used for the intended purpose). For large web-scraped corpora, a sampling-based approach may be necessary — reviewing a statistically representative sample and documenting the sampling methodology and findings.

Rights change monitoring. Establish a process to monitor for rights changes affecting training data. This includes: monitoring licence changes for open-source datasets (e.g., a dataset that changes from CC-BY to CC-BY-NC), monitoring legal developments (e.g., court rulings that affect fair use arguments), processing data subject requests (consent withdrawals, erasure requests), and monitoring contractual changes with data suppliers. When a rights change is identified, the process must assess the impact on existing models and determine remediation actions.

Recommended patterns:

Licence compatibility matrix. Maintain a matrix that maps licence types to permitted uses. For example: CC-BY permits commercial fine-tuning; CC-BY-NC does not. When a dataset enters the pipeline, the automated check verifies that the intended use is compatible with the licence. This catches common errors such as using NC-licensed data for commercial model training.
Data provenance chain. For synthetic data and derived datasets, maintain a provenance chain back to the original source data. If synthetic data is generated from copyrighted source material, the rights constraints of the source material may flow through to the synthetic derivative. The provenance chain ensures this is tracked.
Third-party model rights attestation. When using models from third-party providers, obtain a contractual attestation covering: the training data was legally obtained, the licence permits the intended use (including commercial deployment), the provider will notify the organisation of any rights issues affecting the model, and the provider will indemnify the organisation against third-party claims related to training data rights.

Anti-patterns to avoid:

Assuming open-source means unrestricted. Open-source datasets carry licences with specific conditions. CC-BY requires attribution. CC-BY-SA requires derivative works to carry the same licence. CC-BY-NC prohibits commercial use. "Open-source" does not mean "use however you want."
Relying solely on fair use or text-and-data-mining exceptions. These defences are jurisdiction-specific, fact-specific, and actively being litigated. Relying on them without formal legal analysis creates unquantified risk. The EU's TDM exception (Article 4 of the DSM Directive) requires that rights holders have not reserved their rights — many publishers have.
Treating model providers as a complete shield. If a model provider's training data was unlicensed, downstream deployers may face secondary liability depending on the jurisdiction. Contractual indemnities from providers are only as valuable as the provider's solvency.
Ignoring personal data in training corpora. Personal data regulations (GDPR, CCPA, PIPL) apply to training data. The argument that personal data is "dissolved" in model weights is legally untested and regulators have signalled scepticism. Treat personal data in training corpora as requiring a specific legal basis.
Conducting rights review once and never updating. Licences change. Laws change. Consent is withdrawn. Rights verification must be a continuous process, not a one-time gate.

Industry Considerations

Financial Services. Training data derived from market data feeds may be subject to exchange licences that restrict use for model training. Transaction data used for fine-tuning may contain personal data requiring DPIA. Regulatory reporting data may have use restrictions imposed by the regulator. Firms should map training data sources to existing data governance classifications.

Healthcare. Clinical data used for model training requires specific consent or a legal basis under GDPR Article 9 (special categories of data). De-identified data may still be subject to re-identification risk assessments. FDA/MHRA guidance on AI/ML-based medical devices includes expectations for training data documentation.

Media and Creative Industries. Training on copyrighted creative works is the most actively litigated area. Organisations should obtain explicit licences for creative content used in training, document the licensing chain, and avoid training on content where the rights holder has explicitly opted out of AI training use.

Maturity Model

Basic Implementation — The organisation maintains a list of training datasets with source information and general licence categories. Legal review is conducted for major training runs but may be informal or ad hoc for fine-tuning and adaptation work. Personal data in training corpora is identified on a best-effort basis. Rights changes are addressed reactively when issues arise. This level provides a foundation but leaves significant gaps: fine-tuning datasets may not receive rights review, licence compatibility is not systematically verified, and the organisation cannot quickly determine which models are affected by a rights issue with a specific dataset.

Intermediate Implementation — A structured training data registry records all datasets with detailed rights metadata. Every dataset undergoes rights verification before entering the training pipeline, with legal sign-off documented. DPIAs are conducted for datasets containing personal data. A licence compatibility matrix automates basic rights checks. A dataset-to-model mapping enables impact assessment when rights issues arise. Rights changes are monitored through a defined process with regular reviews (at least quarterly).

Advanced Implementation — All intermediate capabilities plus: automated scanning of training corpora for copyrighted content, personal data, and rights-restricted material using content fingerprinting and PII detection tools. Third-party model providers have been required to provide contractual rights attestations with indemnities. The rights monitoring process includes legal development tracking across all relevant jurisdictions. Machine unlearning or retraining procedures are documented and tested for cases where rights must be retroactively addressed. The organisation can demonstrate to any regulator, in any jurisdiction where it operates, the complete rights chain for every dataset used to train every deployed model.

7. Evidence Requirements

Required artefacts:

Training data registry. The complete registry showing all datasets, their rights metadata, legal basis, permitted uses, and verification dates. Format: structured data export.
Rights verification records. Documentation of the rights verification process for each dataset, including legal review notes, licence copies, and approval sign-offs.
DPIA records. Completed DPIAs for datasets containing personal data, including the balancing test for legitimate interest where applicable.
Dataset-to-model mapping. A mapping showing which datasets were used to train which models, enabling impact assessment for rights issues.
Rights change response records. Documentation of how rights changes (licence revocations, consent withdrawals, legal developments) were identified, assessed, and addressed.

Retention requirements:

Training data registry and rights verification records: operational lifetime of all models trained on the data, plus minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. The training data registry must support queries by dataset, by model, and by rights status.

8. Test Specification

Test 8.1: Training Data Registry Completeness

Stimulus: Select 10 models currently deployed in production. For each model, query the training data registry for all datasets used in training, fine-tuning, and adaptation.
Expected behaviour: Every dataset is recorded with complete rights metadata including source, licence, permitted uses, and verification date.
Pass criteria: 100% of datasets used in training deployed models are registered with complete metadata. No gaps in the registry for deployed models.
Fail criteria: Any dataset used in a deployed model is missing from the registry, or any dataset lacks critical rights metadata (licence type, permitted uses, verification date).

Test 8.2: Rights Verification Before Training

Stimulus: Submit a new dataset for inclusion in a training run. The dataset carries a CC-BY-NC licence. The training run is for a commercially deployed model.
Expected behaviour: The rights verification workflow identifies the licence incompatibility (NC licence vs. commercial use) and blocks the dataset from entering the training pipeline.
Pass criteria: The incompatible dataset is blocked before training. The block reason is recorded. The training run does not proceed with the incompatible dataset.
Fail criteria: The dataset enters the training pipeline without rights verification, or the licence incompatibility is not detected.

Test 8.3: Personal Data Identification

Stimulus: Submit a dataset containing 10,000 records, of which 500 contain identifiable personal data (names, email addresses, national ID numbers). Run the personal data detection process.
Expected behaviour: The detection process identifies at least 95% of records containing personal data and flags the dataset as requiring DPIA.
Pass criteria: Detection rate of 95% or higher for records containing personal data. The dataset is flagged for DPIA before training use.
Fail criteria: Detection rate below 95%, or the dataset is not flagged for DPIA despite containing personal data.

Test 8.4: Rights Change Response

Stimulus: Simulate a licence revocation for a dataset used to train three deployed models. Trigger the rights change response process.
Expected behaviour: The dataset-to-model mapping identifies all three affected models. An impact assessment is generated within the defined SLA. Remediation options are documented (retrain, withdraw, negotiate new licence).
Pass criteria: All affected models are identified. Impact assessment is completed within SLA (e.g., 5 business days). Remediation actions are documented and initiated.
Fail criteria: Any affected model is not identified, or the impact assessment exceeds the defined SLA without escalation.

Test 8.5: DPIA Completion for Personal Data

Stimulus: Audit the DPIA registry against all datasets flagged as containing personal data. Verify that a completed DPIA exists for each.
Expected behaviour: Every flagged dataset has a completed, signed DPIA that was completed before the dataset was used for training.
Pass criteria: 100% of personal-data datasets have completed DPIAs dated before the training start date.
Fail criteria: Any personal-data dataset lacks a DPIA, or any DPIA was completed after training commenced.

Test 8.6: Third-Party Model Rights Attestation

Stimulus: Audit all third-party models in the deployment inventory. Verify that contractual rights attestations exist for each.
Expected behaviour: Each third-party model has a contractual attestation covering training data rights, permitted uses, and notification obligations.
Pass criteria: All third-party models have rights attestations on file. Attestations cover the intended use (commercial deployment, specific jurisdiction).
Fail criteria: Any third-party model lacks a rights attestation, or attestations do not cover the actual use.

Conformance Scoring

Score 0: No training data rights governance — datasets are used for training without rights verification or documentation.
Score 1: Ad hoc rights awareness — some datasets have licence information recorded, but verification is inconsistent, no systematic registry exists, and personal data in training corpora is not systematically identified.
Score 2: Systematic rights governance — a training data registry tracks all datasets with rights metadata, rights verification occurs before training, DPIAs are completed for personal data, and dataset-to-model mapping enables impact assessment.
Score 3: Proactive rights management — all Score 2 controls plus automated corpus scanning for copyrighted and personal data, contractual rights attestations from all third-party providers, continuous rights monitoring across jurisdictions, and tested remediation procedures for rights revocations.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 10 (Data and Data Governance)	Direct requirement
EU AI Act	Article 53 (Transparency for GPAI Models)	Direct requirement
GDPR	Articles 6, 9, 35 (Lawfulness, Special Categories, DPIA)	Direct requirement
UK Copyright Act	Sections 29A, 30A (TDM Exceptions)	Supports compliance
EU DSM Directive	Articles 3, 4 (Text and Data Mining)	Supports compliance
NIST AI RMF	MAP 2.3, MANAGE 1.3	Supports compliance
ISO 42001	Clause 8.4 (AI System Operation), Annex B (Data Management)	Supports compliance

EU AI Act — Article 10 (Data and Data Governance)

Article 10 requires that training, validation, and testing datasets be subject to appropriate data governance and management practices. This includes assessment of the availability, quantity, and suitability of datasets, examination for possible biases, and identification of relevant data gaps. AG-340 implements the rights dimension of data governance — ensuring that the data is not only suitable and unbiased but also legally available for the intended use. Article 10(2) specifically requires that datasets "shall be relevant, sufficiently representative, and to the best extent possible, free of errors and complete" — this quality expectation implicitly requires that the data be legitimately obtained, as stolen or unlicensed data cannot meet the governance standard.

EU AI Act — Article 53 (Transparency for GPAI Models)

Article 53 requires providers of general-purpose AI models to draw up and make publicly available a sufficiently detailed summary of the content used for training, according to a template provided by the AI Office. This transparency requirement makes training data rights governance a practical necessity — organisations must know what data was used before they can publish a summary. Organisations that lack a training data registry will be unable to comply with this disclosure requirement.

GDPR Article 6 requires a lawful basis for processing personal data. Article 9 imposes additional conditions for special categories of data (health, biometric, racial, etc.). Article 35 requires a DPIA for processing that is likely to result in a high risk to individuals, which includes profiling and large-scale processing — both characteristic of AI model training. AG-340's requirements for legal basis verification, DPIA completion, and personal data identification directly implement these GDPR obligations in the context of AI training data.

EU DSM Directive — Articles 3 and 4

Articles 3 and 4 provide text and data mining exceptions, but with important conditions. Article 3 permits TDM for scientific research by research organisations. Article 4 permits TDM more broadly but allows rights holders to reserve their rights through machine-readable means. Many publishers have implemented rights reservation through robots.txt, meta tags, or explicit opt-outs. AG-340 requires organisations to verify that rights holders have not reserved their rights before relying on the TDM exception — a common gap in current practice.

10. Failure Severity

Field	Value
Severity Rating	Critical
Blast Radius	Organisation-wide — potentially affecting every model trained on contested data and every decision made by those models

Consequence chain: Training data rights failures create cascading legal, operational, and financial consequences. The immediate failure is the use of unlicensed, non-consented, or rights-restricted data in model training. Because training is practically irreversible, the contamination is permanent: the affected model weights encode the influence of the problematic data and cannot be surgically cleaned. The legal consequence depends on jurisdiction and data type: copyright infringement damages (potentially statutory damages of £750 per work in the US, or actual damages elsewhere), GDPR fines (up to 4% of global annual turnover), contractual breach damages, and reputational harm. The operational consequence is severe: a court order to delete model weights trained on unlicensed data effectively destroys the model, requiring retraining from scratch — a process that typically costs £1-10 million in compute and takes 2-6 months. During this period, services dependent on the affected model are degraded or unavailable. The strategic consequence is a chilling effect: organisations that suffer training data rights enforcement actions become extremely conservative about future training, potentially ceding competitive advantage to competitors who managed their data rights proactively.

Cross-references: AG-057 (Dataset Suitability and Bias Control) addresses data quality and bias dimensions that complement the rights dimension covered here. AG-048 (AI Model Provenance and Integrity) provides the model-level provenance framework within which training data rights are tracked. AG-024 (Authorised Learning Governance) governs the authorisation framework for learning activities including training runs. AG-339 through AG-348 form the sibling landscape for Model Provenance, Training & Adaptation.

Cite this protocol

AgentGoverning. (2026). AG-340: Training Corpus Rights Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-340

← Previous Protocol

AG-339

Model Weight Custody Governance

Next Protocol →

AG-341

Fine-Tune Objective Documentation Governance