AG-092

Training Data Rights and Licensing Governance

Supply Chain, Third-Party AI & Dependencies ~18 min read AGS v2.1 · April 2026
EU AI Act GDPR FCA NIST HIPAA ISO 42001

2. Summary

Training Data Rights and Licensing Governance requires that every organisation deploying AI agents maintains verifiable assurance that the training data used to create or fine-tune AI models — whether first-party or third-party — was obtained, used, and retained in compliance with applicable intellectual property rights, data licensing agreements, and regulatory requirements. AI models are derivative works of their training data. If that training data was scraped without licence, includes copyrighted material used beyond fair use, contains personal data processed without a lawful basis, or violates contractual restrictions on derivative use, the resulting model carries legal liability that transfers to every organisation that deploys it. AG-092 mandates that organisations maintain a training data rights register, conduct due diligence on third-party model providers' data practices, and implement contractual protections that allocate liability for training data rights violations.

3. Example

Scenario A — Third-Party Model Trained on Scraped Copyrighted Content: An enterprise deploys a customer-facing agent powered by a third-party language model for generating product descriptions. The model provider trained the model on a corpus that included 2.3 million copyrighted product reviews scraped from retail platforms without licence. A class-action lawsuit is filed against the model provider and its downstream deployers. The enterprise receives a claim for £4.2 million in statutory damages — £1.85 per infringed review under the UK Copyright, Designs and Patents Act 1988, with trebling for commercial use. The enterprise's contract with the model provider contains no indemnification clause for training data rights violations.

What went wrong: The enterprise conducted no due diligence on the model provider's training data practices. No contractual protections allocated liability for IP infringement in the training data. The enterprise treated the model as a black box and assumed the provider had resolved all IP issues. Consequence: £4.2 million in potential statutory damages, legal costs estimated at £680,000 for defence, reputational exposure from named participation in a class-action suit, and immediate need to find an alternative model provider while the litigation proceeds.

Scenario B — Fine-Tuning Data Violates Upstream Licensing Restrictions: A financial services firm fine-tunes a base model on proprietary market research reports to create a specialised analyst agent. The market research reports were purchased under a licence that explicitly prohibits use for "training machine learning models or creating derivative automated systems." The firm's data science team, unaware of the licensing restriction, uses the reports as fine-tuning data. The market research provider discovers the use through output analysis — the agent produces summaries that closely mirror the provider's proprietary analytical frameworks — and terminates the licence. The provider files a claim for £2.8 million in damages.

What went wrong: No process existed to check training data licensing restrictions before use. The data science team had access to the content but no visibility into the licensing terms. The licensing restriction was in a standard contract that had been signed by the procurement team but never communicated to the teams using the data. Consequence: £2.8 million in damages claimed, loss of access to critical market research data, 6-month gap in analytical capability while alternative sources are onboarded, and regulatory scrutiny of the firm's data governance practices.

Scenario C — Personal Data in Training Corpus Without Lawful Basis: A public sector organisation procures a third-party AI model for citizen service automation. Due diligence reveals that the model's training corpus includes social media posts, forum discussions, and public records containing personal data of EU and UK residents. The model provider asserts legitimate interest as the lawful basis for processing. The ICO investigates and determines that the processing fails the legitimate interest balancing test because data subjects had no reasonable expectation that their personal data would be used to train commercial AI models. The organisation is deemed a joint controller because it deployed the model knowing the training data included personal data. The ICO issues a preliminary enforcement notice requiring the organisation to cease using the model.

What went wrong: The organisation accepted the provider's assertion of lawful basis without independent assessment. No training data rights review was conducted as part of the procurement process. The organisation did not consider whether deployment of a model trained on personal data without adequate lawful basis would make it a joint controller. Consequence: ICO enforcement notice, mandatory service suspension affecting 140,000 citizens, alternative procurement cost of £1.2 million, and public trust damage from disclosed enforcement action.

4. Requirement Statement

Scope: This dimension applies to all AI agents that incorporate or rely upon AI models — whether first-party or third-party — where the training data for those models may be subject to intellectual property rights, licensing restrictions, privacy regulations, or contractual limitations on use. The scope includes: foundation models used via API, fine-tuned models deployed locally, embedding models, classification models, and any AI component whose outputs are derived from learned parameters trained on a data corpus. The scope extends to fine-tuning data, reinforcement learning feedback data, retrieval-augmented generation document stores (where the documents are incorporated into the model's outputs), and any data used to customise or adapt a model's behaviour. Purely synthetic training data generated without reference to copyrighted or personal data sources is excluded, provided the synthetic generation process itself did not use restricted data.

4.1. A conforming system MUST maintain a training data rights register for every AI model deployed in or consumed by the agent system, documenting: the categories of training data used, the licensing basis for each category, any restrictions on derivative use, and the date of the most recent rights assessment.

4.2. A conforming system MUST conduct documented due diligence on third-party model providers' training data practices before deployment, including: requesting and reviewing training data documentation, assessing the provider's claims of lawful data acquisition, and evaluating the provider's contractual commitments regarding training data rights.

4.3. A conforming system MUST verify that all fine-tuning, adaptation, or customisation data used by the organisation has been assessed for intellectual property rights, licensing restrictions, and data protection compliance, with the assessment documented and approved before the data is used for model training.

4.4. A conforming system MUST include contractual provisions in agreements with third-party model providers that address: representations regarding lawful training data acquisition, indemnification for training data rights violations, notification obligations for known or suspected rights claims against the training data, and audit rights regarding training data practices.

4.5. A conforming system MUST implement a process for responding to training data rights claims, including: assessment of the claim's validity, impact analysis on deployed agents, and a defined decision framework for continued use, remediation, or model replacement.

4.6. A conforming system SHOULD conduct periodic reassessment of training data rights as legal and regulatory frameworks evolve, at minimum annually or when material changes in applicable law are identified.

4.7. A conforming system SHOULD maintain records of the provenance chain from training data to model output sufficient to respond to specific infringement claims (e.g., identifying whether a specific copyrighted work was in the training corpus).

4.8. A conforming system MAY implement technical measures to detect potential training data memorisation in model outputs, such as verbatim reproduction detectors or near-duplicate matching against known copyrighted corpora.

5. Rationale

Training Data Rights and Licensing Governance addresses the legal foundation of AI model deployment. Every AI model is a function of its training data — the model's capabilities, biases, knowledge, and limitations are all derived from the data it was trained on. If that data was obtained unlawfully, used in violation of licence terms, or processed in breach of data protection requirements, the legal liability does not remain with the data collector alone — it flows downstream to every entity that deploys the model commercially.

The legal landscape for training data rights is evolving rapidly and in multiple directions simultaneously. The EU AI Act requires transparency about training data. The EU Copyright Directive provides a text and data mining exception but allows rights holders to opt out. The UK is developing its own framework. The US has pending litigation that may clarify fair use boundaries for model training. Japan has a broad training exception. China requires training data to comply with its data governance framework. An organisation deploying AI agents across jurisdictions must navigate all of these simultaneously.

The risk is not theoretical. Multiple major lawsuits are proceeding against model providers for training data rights violations, with potential damages in the billions. Downstream deployers face secondary liability in several jurisdictions. An organisation that deploys a model without understanding its training data provenance is accepting unknown legal liability — the equivalent of distributing software without knowing whether it contains stolen code.

AG-092 does not require organisations to solve the legal uncertainties — it requires them to understand what they are deploying, document the rights basis for that deployment, and establish contractual protections and response procedures for when rights claims arise. The standard is informed but defensible decision-making, not risk elimination.

The relationship to AG-048 (AI Model Provenance and Integrity) is complementary: AG-048 ensures the organisation knows what model it is running; AG-092 ensures the organisation understands the legal basis for running it. Together they establish that both the identity and the legal standing of every AI model are documented and governed.

6. Implementation Guidance

AG-092 requires organisations to establish a systematic process for training data rights assessment that integrates with procurement, model governance, and legal compliance workflows.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Financial data licensing is heavily contractualised. Market data feeds, reference data, and research content typically carry explicit restrictions on derivative use. FCA-regulated firms deploying AI agents trained on licensed financial data must ensure that training use is within licence scope. The FCA's expectations under SYSC 6.1.1R extend to ensuring that AI systems do not create IP infringement exposure for the firm.

Healthcare. Training data for clinical AI models may include patient data subject to GDPR, HIPAA, and national health data governance frameworks. The lawful basis for using patient data to train AI models is one of the most contested areas of health data governance. Organisations should assess whether patient consent (if relied upon) covers AI training, whether the data was adequately anonymised, and whether the resulting model can regenerate identifiable information through memorisation.

Public Sector. Government organisations deploying AI agents face additional constraints: procurement regulations may require transparency about training data provenance, freedom of information obligations may require disclosure of training data sources, and public trust considerations demand higher standards of due diligence than commercial deployments. The UK Government's Generative AI Framework for HMG requires departments to consider IP implications of AI tools.

Maturity Model

Basic Implementation — The organisation has identified all AI models in its agent system and recorded the provider's stated training data practices for each. A training data rights register exists at a high level, documenting known data categories and stated rights basis. Contractual provisions with third-party providers include basic IP representations. Fine-tuning data is reviewed informally before use. No periodic reassessment process exists. This level provides initial visibility but relies heavily on provider assertions without independent validation.

Intermediate Implementation — All basic capabilities plus: a structured due diligence questionnaire is used for all third-party model procurements. Contractual provisions include indemnification, notification obligations, and audit rights. The training data rights register includes risk ratings per model and identifies residual unknowns. Fine-tuning data undergoes a formal clearance workflow before use. Periodic reassessment is conducted annually. The organisation has a documented response procedure for training data rights claims.

Advanced Implementation — All intermediate capabilities plus: the organisation conducts independent assessment of provider training data claims where feasible (e.g., checking whether opted-out sources appear in model outputs). Technical measures detect potential training data memorisation. The training data rights register is integrated with the model risk management framework and supplier risk scoring. Legal monitoring tracks evolving case law and regulatory guidance across relevant jurisdictions. Contractual protections are calibrated to deployment risk with quantified indemnification thresholds. The organisation can demonstrate to regulators a complete chain from model deployment through training data rights assessment to documented rights basis for every AI component.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Testing AG-092 compliance requires validating the completeness and effectiveness of the training data rights governance process.

Test 8.1: Training Data Rights Register Completeness

Test 8.2: Provider Due Diligence Coverage

Test 8.3: Contractual Protection Validation

Test 8.4: Fine-Tuning Data Clearance Process

Test 8.5: Rights Claim Response Procedure

Test 8.6: Periodic Reassessment Execution

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 10 (Data and Data Governance)Direct requirement
EU AI ActArticle 53 (Obligations for Providers of GPAI Models)Direct requirement
EU Copyright DirectiveArticle 4 (Text and Data Mining Exception)Constrains implementation
UK CDPA 1988Sections 29A, 30 (Fair Dealing / TDM)Constrains implementation
GDPRArticles 5, 6 (Lawful Basis for Processing)Direct requirement
UK Data Protection Act 2018Part 2, Chapter 2 (Lawful Basis)Direct requirement
NIST AI RMFMAP 2.3, GOVERN 1.2Supports compliance
ISO 42001Clause 6.1 (Actions to Address Risks), Annex BSupports compliance

EU AI Act — Article 10 (Data and Data Governance)

Article 10 requires that training, validation, and testing datasets be subject to appropriate data governance practices including assessment of data provenance, collection processes, and rights basis. For deployers of high-risk AI systems, this creates an obligation to understand and document the training data governance practices of their model providers. AG-092 implements the data governance assessment component of Article 10 by requiring the training data rights register, provider due diligence, and ongoing reassessment.

EU AI Act — Article 53 (Obligations for Providers of GPAI Models)

Article 53 requires providers of general-purpose AI models to make available a sufficiently detailed summary of the training data, including information about data sources and potential copyright-relevant aspects. AG-092 leverages this obligation by requiring deployers to obtain and assess this information as part of their due diligence process. Where providers fail to comply with Article 53 transparency obligations, AG-092 requires the deployer to document this non-compliance as residual risk.

EU Copyright Directive — Article 4

Article 4 provides a text and data mining exception for commercial purposes, subject to rights holders' ability to opt out. AG-092 implementation must account for the opt-out mechanism: if a rights holder has reserved their rights under Article 4, training on their content is not covered by the exception. Due diligence must include assessment of whether the model provider has respected opt-out reservations.

GDPR — Articles 5, 6

Where training data includes personal data of EU/UK residents, GDPR requires a lawful basis for processing. The applicability of legitimate interest (Article 6(1)(f)) to large-scale model training is contested in multiple jurisdictions. AG-092 requires that the lawful basis for any personal data in training corpora be documented and assessed, not merely assumed. Data Protection Impact Assessments may be required under Article 35 for models trained on personal data at scale.

NIST AI RMF — MAP 2.3, GOVERN 1.2

MAP 2.3 addresses the documentation and understanding of data used in AI systems. GOVERN 1.2 addresses organisational accountability for AI risks including data-related risks. AG-092 supports compliance by establishing the structured assessment and documentation processes for training data rights.

ISO 42001 — Clause 6.1, Annex B

Clause 6.1 requires actions to address risks within the AI management system. Annex B provides guidance on data governance including data quality, provenance, and rights management. AG-092 implements the rights management component of Annex B's data governance guidance within the risk management framework of Clause 6.1.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusAll agent operations using the affected model — potentially organisation-wide where a single model underpins multiple agents

Consequence chain: Without training data rights governance, an organisation deploys AI models with unknown legal liability embedded in the training data. The immediate failure mode is deployment of a model whose training data includes copyrighted content without licence, personal data without lawful basis, or contractually restricted data used beyond licence scope. The legal consequence is secondary liability for IP infringement, regulatory enforcement for data protection violations, or breach of contract claims from data licensors. Financial exposure scales with the scope of deployment and the jurisdiction — statutory damages in copyright claims can reach £150,000 per work in the US and significant sums under UK and EU frameworks. Regulatory exposure includes ICO enforcement notices, GDPR fines of up to 4% of global turnover, and potential injunctive relief requiring cessation of model use. Operational consequence includes forced model replacement on short timelines, with associated service disruption and re-procurement costs. Reputational consequence includes public disclosure of litigation and enforcement actions. Cross-references: AG-048 (AI Model Provenance and Integrity) ensures the model is identified; AG-021 (Regulatory Obligation Identification) ensures applicable regulations are mapped; AG-093 (Supplier Concentration and Exit Governance) addresses the exit risk when a model must be replaced due to rights issues.

Cite this protocol
AgentGoverning. (2026). AG-092: Training Data Rights and Licensing Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-092