The Standard

The 841 Dimensions Regulatory Mapping Version History

Compliance

Compliance Leaderboard Platform Comparison

Verification

Submit for Verification Self-Assessment Tool

About

About AgentGoverning Press & Media

Contact

AG-092

Training Data Rights and Licensing Governance

Supply Chain, Third-Party AI & Dependencies ~18 min read AGS v2.1 · April 2026

EU AI Act FCA NIST HIPAA ISO 42001

2. Summary

Training Data Rights and Licensing Governance requires that every organisation deploying AI agents maintains verifiable assurance that the training data used to create or fine-tune AI models — whether first-party or third-party — was obtained, used, and retained in compliance with applicable intellectual property rights, data licensing agreements, and regulatory requirements. AI models are derivative works of their training data. If that training data was scraped without licence, includes copyrighted material used beyond fair use, contains personal data processed without a lawful basis, or violates contractual restrictions on derivative use, the resulting model carries legal liability that transfers to every organisation that deploys it. AG-092 mandates that organisations maintain a training data rights register, conduct due diligence on third-party model providers' data practices, and implement contractual protections that allocate liability for training data rights violations.

3. Example

Scenario A — Third-Party Model Trained on Scraped Copyrighted Content: An enterprise deploys a customer-facing agent powered by a third-party language model for generating product descriptions. The model provider trained the model on a corpus that included 2.3 million copyrighted product reviews scraped from retail platforms without licence. A class-action lawsuit is filed against the model provider and its downstream deployers. The enterprise receives a claim for £4.2 million in statutory damages — £1.85 per infringed review under the UK Copyright, Designs and Patents Act 1988, with trebling for commercial use. The enterprise's contract with the model provider contains no indemnification clause for training data rights violations.

What went wrong: The enterprise conducted no due diligence on the model provider's training data practices. No contractual protections allocated liability for IP infringement in the training data. The enterprise treated the model as a black box and assumed the provider had resolved all IP issues. Consequence: £4.2 million in potential statutory damages, legal costs estimated at £680,000 for defence, reputational exposure from named participation in a class-action suit, and immediate need to find an alternative model provider while the litigation proceeds.

Scenario B — Fine-Tuning Data Violates Upstream Licensing Restrictions: A financial services firm fine-tunes a base model on proprietary market research reports to create a specialised analyst agent. The market research reports were purchased under a licence that explicitly prohibits use for "training machine learning models or creating derivative automated systems." The firm's data science team, unaware of the licensing restriction, uses the reports as fine-tuning data. The market research provider discovers the use through output analysis — the agent produces summaries that closely mirror the provider's proprietary analytical frameworks — and terminates the licence. The provider files a claim for £2.8 million in damages.

What went wrong: No process existed to check training data licensing restrictions before use. The data science team had access to the content but no visibility into the licensing terms. The licensing restriction was in a standard contract that had been signed by the procurement team but never communicated to the teams using the data. Consequence: £2.8 million in damages claimed, loss of access to critical market research data, 6-month gap in analytical capability while alternative sources are onboarded, and regulatory scrutiny of the firm's data governance practices.

Scenario C — Personal Data in Training Corpus Without Lawful Basis: A public sector organisation procures a third-party AI model for citizen service automation. Due diligence reveals that the model's training corpus includes social media posts, forum discussions, and public records containing personal data of EU and UK residents. The model provider asserts legitimate interest as the lawful basis for processing. The ICO investigates and determines that the processing fails the legitimate interest balancing test because data subjects had no reasonable expectation that their personal data would be used to train commercial AI models. The organisation is deemed a joint controller because it deployed the model knowing the training data included personal data. The ICO issues a preliminary enforcement notice requiring the organisation to cease using the model.

What went wrong: The organisation accepted the provider's assertion of lawful basis without independent assessment. No training data rights review was conducted as part of the procurement process. The organisation did not consider whether deployment of a model trained on personal data without adequate lawful basis would make it a joint controller. Consequence: ICO enforcement notice, mandatory service suspension affecting 140,000 citizens, alternative procurement cost of £1.2 million, and public trust damage from disclosed enforcement action.

4. Requirement Statement

Scope: This dimension applies to all AI agents that incorporate or rely upon AI models — whether first-party or third-party — where the training data for those models may be subject to intellectual property rights, licensing restrictions, privacy regulations, or contractual limitations on use. The scope includes: foundation models used via API, fine-tuned models deployed locally, embedding models, classification models, and any AI component whose outputs are derived from learned parameters trained on a data corpus. The scope extends to fine-tuning data, reinforcement learning feedback data, retrieval-augmented generation document stores (where the documents are incorporated into the model's outputs), and any data used to customise or adapt a model's behaviour. Purely synthetic training data generated without reference to copyrighted or personal data sources is excluded, provided the synthetic generation process itself did not use restricted data.

4.1. A conforming system MUST maintain a training data rights register for every AI model deployed in or consumed by the agent system, documenting: the categories of training data used, the licensing basis for each category, any restrictions on derivative use, and the date of the most recent rights assessment.

4.2. A conforming system MUST conduct documented due diligence on third-party model providers' training data practices before deployment, including: requesting and reviewing training data documentation, assessing the provider's claims of lawful data acquisition, and evaluating the provider's contractual commitments regarding training data rights.

4.3. A conforming system MUST verify that all fine-tuning, adaptation, or customisation data used by the organisation has been assessed for intellectual property rights, licensing restrictions, and data protection compliance, with the assessment documented and approved before the data is used for model training.

4.4. A conforming system MUST include contractual provisions in agreements with third-party model providers that address: representations regarding lawful training data acquisition, indemnification for training data rights violations, notification obligations for known or suspected rights claims against the training data, and audit rights regarding training data practices.

4.5. A conforming system MUST implement a process for responding to training data rights claims, including: assessment of the claim's validity, impact analysis on deployed agents, and a defined decision framework for continued use, remediation, or model replacement.

4.6. A conforming system SHOULD conduct periodic reassessment of training data rights as legal and regulatory frameworks evolve, at minimum annually or when material changes in applicable law are identified.

4.7. A conforming system SHOULD maintain records of the provenance chain from training data to model output sufficient to respond to specific infringement claims (e.g., identifying whether a specific copyrighted work was in the training corpus).

4.8. A conforming system MAY implement technical measures to detect potential training data memorisation in model outputs, such as verbatim reproduction detectors or near-duplicate matching against known copyrighted corpora.

5. Rationale

Training Data Rights and Licensing Governance addresses the legal foundation of AI model deployment. Every AI model is a function of its training data — the model's capabilities, biases, knowledge, and limitations are all derived from the data it was trained on. If that data was obtained unlawfully, used in violation of licence terms, or processed in breach of data protection requirements, the legal liability does not remain with the data collector alone — it flows downstream to every entity that deploys the model commercially.

The legal landscape for training data rights is evolving rapidly and in multiple directions simultaneously. The EU AI Act requires transparency about training data. The EU Copyright Directive provides a text and data mining exception but allows rights holders to opt out. The UK is developing its own framework. The US has pending litigation that may clarify fair use boundaries for model training. Japan has a broad training exception. China requires training data to comply with its data governance framework. An organisation deploying AI agents across jurisdictions must navigate all of these simultaneously.

The risk is not theoretical. Multiple major lawsuits are proceeding against model providers for training data rights violations, with potential damages in the billions. Downstream deployers face secondary liability in several jurisdictions. An organisation that deploys a model without understanding its training data provenance is accepting unknown legal liability — the equivalent of distributing software without knowing whether it contains stolen code.

AG-092 does not require organisations to solve the legal uncertainties — it requires them to understand what they are deploying, document the rights basis for that deployment, and establish contractual protections and response procedures for when rights claims arise. The standard is informed but defensible decision-making, not risk elimination.

The relationship to AG-048 (AI Model Provenance and Integrity) is complementary: AG-048 ensures the organisation knows what model it is running; AG-092 ensures the organisation understands the legal basis for running it. Together they establish that both the identity and the legal standing of every AI model are documented and governed.

6. Implementation Guidance

AG-092 requires organisations to establish a systematic process for training data rights assessment that integrates with procurement, model governance, and legal compliance workflows.

Recommended patterns:

Training data rights register. Maintain a centralised register that records, for each AI model in the agent system: the model identifier and version (linked to AG-048 provenance records), the categories of training data used (web scrape, licensed dataset, proprietary data, synthetic data, personal data), the rights basis for each category (licence type, fair use/fair dealing claim, consent, legitimate interest, statutory exception), any restrictions on derivative use, the date of the most recent rights assessment, and the risk rating assigned. For third-party models where full training data transparency is not available, record what is known, what the provider has represented, and what remains unknown — with the unknown portion explicitly flagged as residual risk.
Provider due diligence questionnaire. Develop a structured questionnaire for third-party model providers covering: data sourcing practices, opt-out mechanisms respected, personal data processing lawful basis, copyright compliance framework, known litigation or rights claims, indemnification willingness, and data provenance documentation available. Score providers on transparency and rights management maturity. Providers who refuse to engage with due diligence should be flagged as high-risk. The questionnaire should be version-controlled and updated as legal requirements evolve.
Contractual protection framework. Establish standard contractual provisions for AI model procurement that include: (a) provider representation that training data was lawfully obtained, (b) indemnification for IP infringement claims arising from training data, with a minimum coverage of 2x annual contract value or a defined floor (e.g., £5 million), (c) notification within 72 hours of any rights claim or regulatory investigation affecting the training data, (d) audit rights permitting the deployer to inspect training data documentation, and (e) remediation obligations including model retraining or replacement if training data rights violations are confirmed. Negotiate these provisions proportionate to the deployment risk — a model powering customer-facing financial advice requires stronger protections than a model used for internal document summarisation.
Fine-tuning data clearance workflow. Before any data is used for fine-tuning, adaptation, or customisation, require a clearance process that checks: IP ownership or licence terms, any restrictions on ML training use, personal data presence and lawful basis, and contractual restrictions from data suppliers. The clearance should be recorded as a signed-off artefact linked to the training run metadata. No fine-tuning job should proceed without a clearance record.

Anti-patterns to avoid:

Assuming the provider has resolved all IP issues. The model provider's legal exposure is separate from the deployer's legal exposure. A provider may assert that its training data practices are lawful while facing active litigation. The deployer's due diligence obligation is independent — "we relied on the provider" is not a defence in most jurisdictions.
Treating open-source models as rights-cleared. An open-source model licence (e.g., Apache 2.0, MIT) covers the model weights and code. It does not warrant that the training data was lawfully obtained. An open-source model trained on scraped copyrighted data carries the same training data rights risk as a commercial model.
Conducting due diligence only at procurement time. Training data rights obligations evolve. New litigation creates precedent. Regulatory frameworks change. Providers may retrain models on different data. Due diligence must be periodic, not one-time.
Ignoring fine-tuning data because it is "our data." Data the organisation owns or has licensed may still carry restrictions on ML training use. Many data licences predate the AI era and do not contemplate model training as a permitted use. Ownership of data does not automatically confer the right to use it for AI training.
Conflating technical transparency with legal compliance. Knowing what data a model was trained on (transparency) is not the same as having the right to use that data for training (compliance). Both are necessary; neither is sufficient alone.

Industry Considerations

Financial Services. Financial data licensing is heavily contractualised. Market data feeds, reference data, and research content typically carry explicit restrictions on derivative use. FCA-regulated firms deploying AI agents trained on licensed financial data must ensure that training use is within licence scope. The FCA's expectations under SYSC 6.1.1R extend to ensuring that AI systems do not create IP infringement exposure for the firm.

Healthcare. Training data for clinical AI models may include patient data subject to GDPR, HIPAA, and national health data governance frameworks. The lawful basis for using patient data to train AI models is one of the most contested areas of health data governance. Organisations should assess whether patient consent (if relied upon) covers AI training, whether the data was adequately anonymised, and whether the resulting model can regenerate identifiable information through memorisation.

Public Sector. Government organisations deploying AI agents face additional constraints: procurement regulations may require transparency about training data provenance, freedom of information obligations may require disclosure of training data sources, and public trust considerations demand higher standards of due diligence than commercial deployments. The UK Government's Generative AI Framework for HMG requires departments to consider IP implications of AI tools.

Maturity Model

Basic Implementation — The organisation has identified all AI models in its agent system and recorded the provider's stated training data practices for each. A training data rights register exists at a high level, documenting known data categories and stated rights basis. Contractual provisions with third-party providers include basic IP representations. Fine-tuning data is reviewed informally before use. No periodic reassessment process exists. This level provides initial visibility but relies heavily on provider assertions without independent validation.

Intermediate Implementation — All basic capabilities plus: a structured due diligence questionnaire is used for all third-party model procurements. Contractual provisions include indemnification, notification obligations, and audit rights. The training data rights register includes risk ratings per model and identifies residual unknowns. Fine-tuning data undergoes a formal clearance workflow before use. Periodic reassessment is conducted annually. The organisation has a documented response procedure for training data rights claims.

Advanced Implementation — All intermediate capabilities plus: the organisation conducts independent assessment of provider training data claims where feasible (e.g., checking whether opted-out sources appear in model outputs). Technical measures detect potential training data memorisation. The training data rights register is integrated with the model risk management framework and supplier risk scoring. Legal monitoring tracks evolving case law and regulatory guidance across relevant jurisdictions. Contractual protections are calibrated to deployment risk with quantified indemnification thresholds. The organisation can demonstrate to regulators a complete chain from model deployment through training data rights assessment to documented rights basis for every AI component.

7. Evidence Requirements

Required artefacts:

Training data rights register. Structured record for each AI model documenting: model identifier, data categories, rights basis per category, restrictions on derivative use, residual unknowns, risk rating, and assessment date. Format: structured data (JSON, YAML, or database export).
Provider due diligence records. Completed due diligence questionnaires or equivalent documentation for each third-party model provider, including: provider responses, risk assessment, and decision rationale. Minimum retention: duration of provider relationship plus 3 years.
Contractual provisions. Executed agreements with third-party model providers containing IP representations, indemnification, notification, and audit provisions. Must be retrievable and mapped to the models they cover.
Fine-tuning data clearance records. Signed-off clearance artefacts for each fine-tuning dataset, documenting: data source, IP assessment, licence terms review, personal data assessment, and approving authority.
Rights claim response records. Timestamped records of any training data rights claims received, the assessment conducted, the impact analysis, and the resolution or ongoing management.

Retention requirements:

Training data rights register and due diligence records: minimum 7 years for regulated financial services; minimum 6 years for other regulated sectors (aligned with statutory limitation periods); minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Testing AG-092 compliance requires validating the completeness and effectiveness of the training data rights governance process.

Test 8.1: Training Data Rights Register Completeness

Stimulus: Enumerate all AI models in the agent system (from AG-048 model inventory) and cross-reference against the training data rights register.
Expected behaviour: Every model in the inventory has a corresponding entry in the training data rights register with all required fields populated.
Pass criteria: 100% of deployed models have a rights register entry. No required field is blank or contains only "unknown" without an accompanying risk assessment.
Fail criteria: Any deployed model lacks a rights register entry, or entries exist with unacknowledged gaps in required fields.

Test 8.2: Provider Due Diligence Coverage

Stimulus: Enumerate all third-party model providers and cross-reference against due diligence records.
Expected behaviour: Every active third-party provider has a completed due diligence record dated within the reassessment period.
Pass criteria: 100% of active providers have current due diligence records. Records contain substantive assessments, not merely provider self-attestations.
Fail criteria: Any active provider lacks a due diligence record, or records are older than the defined reassessment period.

Test 8.3: Contractual Protection Validation

Stimulus: Review executed agreements with the three highest-risk third-party model providers (by deployment criticality and training data risk rating).
Expected behaviour: Agreements contain the required contractual provisions: IP representations, indemnification, notification obligations, and audit rights.
Pass criteria: All reviewed agreements contain all required provisions. Indemnification thresholds are proportionate to deployment risk.
Fail criteria: Any reviewed agreement is missing a required provision, or indemnification is absent for a provider whose model is deployed in a customer-facing or regulated context.

Test 8.4: Fine-Tuning Data Clearance Process

Stimulus: Submit a fine-tuning request using a dataset that contains a known licensing restriction prohibiting ML training use.
Expected behaviour: The clearance workflow identifies the restriction and blocks the fine-tuning request before any training occurs.
Pass criteria: The restriction is identified during clearance. The fine-tuning job does not proceed. The rejection is documented with the specific licensing clause cited.
Fail criteria: The fine-tuning job proceeds without the restriction being identified, or the clearance process does not check licensing terms.

Test 8.5: Rights Claim Response Procedure

Stimulus: Simulate a training data rights claim from a content owner asserting that their copyrighted material was included in a third-party model's training corpus.
Expected behaviour: The response procedure is initiated within the defined time window. The claim is assessed for validity. The impact on deployed agents is analysed. A decision is documented regarding continued use, remediation, or model replacement.
Pass criteria: Response initiation within defined time window. All procedural steps are executed and documented. The decision is made by an authority with appropriate delegated responsibility.
Fail criteria: No response procedure exists, response is delayed beyond the defined window, or the decision is made without impact analysis.

Test 8.6: Periodic Reassessment Execution

Stimulus: Verify that training data rights reassessments have been conducted within the defined period for all models in the rights register.
Expected behaviour: Reassessment records exist for all models, dated within the defined period. Reassessments reflect any changes in applicable law, provider practices, or litigation landscape.
Pass criteria: 100% of models have current reassessments. Reassessments reference relevant legal or regulatory developments since the prior assessment.
Fail criteria: Any model's reassessment is overdue, or reassessments are perfunctory with no reference to changes in the legal landscape.

Conformance Scoring

Score 0: No training data rights governance exists — the organisation deploys AI models without any assessment of training data legality.
Score 1: Basic awareness — the organisation has identified that training data rights are a concern and has informal records of provider claims, but no structured register, due diligence process, or contractual protections are in place.
Score 2: Structured governance — a training data rights register exists for all models, due diligence is conducted for third-party providers, contractual protections are in place, and fine-tuning data undergoes clearance.
Score 3: All Score 2 capabilities plus periodic reassessment, independent validation of provider claims, technical memorisation detection, integrated legal monitoring, and quantified indemnification proportionate to deployment risk.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 10 (Data and Data Governance)	Direct requirement
EU AI Act	Article 53 (Obligations for Providers of GPAI Models)	Direct requirement
EU Copyright Directive	Article 4 (Text and Data Mining Exception)	Constrains implementation
UK CDPA 1988	Sections 29A, 30 (Fair Dealing / TDM)	Constrains implementation
GDPR	Articles 5, 6 (Lawful Basis for Processing)	Direct requirement
UK Data Protection Act 2018	Part 2, Chapter 2 (Lawful Basis)	Direct requirement
NIST AI RMF	MAP 2.3, GOVERN 1.2	Supports compliance
ISO 42001	Clause 6.1 (Actions to Address Risks), Annex B	Supports compliance

EU AI Act — Article 10 (Data and Data Governance)

Article 10 requires that training, validation, and testing datasets be subject to appropriate data governance practices including assessment of data provenance, collection processes, and rights basis. For deployers of high-risk AI systems, this creates an obligation to understand and document the training data governance practices of their model providers. AG-092 implements the data governance assessment component of Article 10 by requiring the training data rights register, provider due diligence, and ongoing reassessment.

EU AI Act — Article 53 (Obligations for Providers of GPAI Models)

Article 53 requires providers of general-purpose AI models to make available a sufficiently detailed summary of the training data, including information about data sources and potential copyright-relevant aspects. AG-092 leverages this obligation by requiring deployers to obtain and assess this information as part of their due diligence process. Where providers fail to comply with Article 53 transparency obligations, AG-092 requires the deployer to document this non-compliance as residual risk.

EU Copyright Directive — Article 4

Article 4 provides a text and data mining exception for commercial purposes, subject to rights holders' ability to opt out. AG-092 implementation must account for the opt-out mechanism: if a rights holder has reserved their rights under Article 4, training on their content is not covered by the exception. Due diligence must include assessment of whether the model provider has respected opt-out reservations.

Where training data includes personal data of EU/UK residents, GDPR requires a lawful basis for processing. The applicability of legitimate interest (Article 6(1)(f)) to large-scale model training is contested in multiple jurisdictions. AG-092 requires that the lawful basis for any personal data in training corpora be documented and assessed, not merely assumed. Data Protection Impact Assessments may be required under Article 35 for models trained on personal data at scale.

NIST AI RMF — MAP 2.3, GOVERN 1.2

MAP 2.3 addresses the documentation and understanding of data used in AI systems. GOVERN 1.2 addresses organisational accountability for AI risks including data-related risks. AG-092 supports compliance by establishing the structured assessment and documentation processes for training data rights.

ISO 42001 — Clause 6.1, Annex B

Clause 6.1 requires actions to address risks within the AI management system. Annex B provides guidance on data governance including data quality, provenance, and rights management. AG-092 implements the rights management component of Annex B's data governance guidance within the risk management framework of Clause 6.1.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	All agent operations using the affected model — potentially organisation-wide where a single model underpins multiple agents

Consequence chain: Without training data rights governance, an organisation deploys AI models with unknown legal liability embedded in the training data. The immediate failure mode is deployment of a model whose training data includes copyrighted content without licence, personal data without lawful basis, or contractually restricted data used beyond licence scope. The legal consequence is secondary liability for IP infringement, regulatory enforcement for data protection violations, or breach of contract claims from data licensors. Financial exposure scales with the scope of deployment and the jurisdiction — statutory damages in copyright claims can reach £150,000 per work in the US and significant sums under UK and EU frameworks. Regulatory exposure includes ICO enforcement notices, GDPR fines of up to 4% of global turnover, and potential injunctive relief requiring cessation of model use. Operational consequence includes forced model replacement on short timelines, with associated service disruption and re-procurement costs. Reputational consequence includes public disclosure of litigation and enforcement actions. Cross-references: AG-048 (AI Model Provenance and Integrity) ensures the model is identified; AG-021 (Regulatory Obligation Identification) ensures applicable regulations are mapped; AG-093 (Supplier Concentration and Exit Governance) addresses the exit risk when a model must be replaced due to rights issues.

Cite this protocol

AgentGoverning. (2026). AG-092: Training Data Rights and Licensing Governance. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-092

← Previous Protocol

AG-091

Third-Party Behaviour Drift Monitoring Governance

Next Protocol →

AG-093

Supplier Concentration and Exit Governance