AG-317

Derived Data Provenance Governance

Data Classification, Quality & Lineage ~16 min read AGS v2.1 · April 2026
EU AI Act GDPR FCA NIST ISO 42001

2. Summary

Derived Data Provenance Governance requires that AI agent systems trace every derived field, computed metric, aggregated value, and model output back to its contributing source records, preserving the complete chain from source to derivation. When an agent produces an output — a risk score, a recommendation, a classification, an aggregated report — the organisation must be able to answer: "What source data contributed to this output, through what transformations, and with what governance attributes?" Without provenance, derived data is an opaque number — the organisation knows what it is but not where it came from or how it was produced.

3. Example

Scenario A — Unauditable Risk Score: A risk management agent produces a portfolio risk score of 7.2 (on a 1-10 scale) for a client portfolio. A regulator asks the firm to explain the score: which positions contributed, what pricing data was used, which risk model was applied, and what parameters governed the calculation. The firm's data architecture stores the final risk score but not the provenance chain. The risk model consumed 340 position records, each with 12 fields from 4 data sources, applied 3 transformation stages (normalisation, weighting, aggregation), and used a risk model with 47 configurable parameters. Reconstructing the provenance post-hoc requires forensic analysis across 4 data sources, 3 transformation logs, and the model configuration history. The reconstruction takes 6 weeks and costs £180,000 in specialist engineering and compliance time. The regulator issues a finding for inadequate audit trail, noting that the firm could not explain its own risk outputs within a reasonable timeframe.

What went wrong: The derived risk score was stored without provenance metadata. The contributing source records, transformation steps, and model configuration were not linked to the output. The firm could reconstruct provenance forensically but could not produce it as a retained artefact.

Scenario B — Incorrect Aggregation Propagates Silently: A financial reporting agent aggregates revenue figures from 12 business units into a consolidated revenue number. The aggregation pipeline joins revenue data from each unit's reporting system. Business unit 7 changes its reporting currency from USD to EUR without notification to the consolidation team. The pipeline sums the raw values, producing a consolidated revenue of £142.3 million — approximately £8.7 million higher than the actual £133.6 million. The error persists for 2 quarters. When detected during external audit, the firm cannot immediately identify which component of the consolidated figure is incorrect because the aggregation pipeline does not record which source records contributed to the final number. Investigation requires re-running the aggregation pipeline with logging enabled, comparing the output to historical records, and tracing the discrepancy to business unit 7. Total investigation and restatement cost: £310,000.

What went wrong: The aggregation produced a derived value without recording which source records contributed and what their attributes (including currency — see AG-314) were. Without provenance, the error was unlocatable without forensic reconstruction.

Scenario C — Model Output Without Input Traceability: A customer churn prediction agent produces a churn probability of 0.87 for customer C-9102, triggering an automated retention offer of £500 in service credits. The customer contacts the organisation, asking why they were identified as likely to churn and why they received the offer. Under GDPR Article 22, the customer has a right to meaningful information about the logic involved. The firm cannot explain the prediction because the model input features — usage patterns, billing history, support interactions, satisfaction scores — were consumed from a feature store without provenance linkage. The model used 23 input features, but the firm cannot identify which specific data records from which source systems produced those feature values for this specific customer at the time of prediction. The firm provides a generic explanation of the model's methodology, which the data protection authority deems insufficient under Article 22(3). The DPA issues a reprimand and orders the firm to implement decision traceability.

What went wrong: Model inputs were consumed from a feature store without provenance linkage to source records. The firm could explain how the model works generally but not why it produced a specific output for a specific customer at a specific time.

4. Requirement Statement

Scope: This dimension applies to all AI agents that consume, produce, or act upon derived data — any value that is computed, aggregated, transformed, modelled, or otherwise generated from source data rather than directly observed. The scope covers: calculated fields (ratios, sums, weighted averages), model outputs (predictions, classifications, scores, embeddings), aggregated values (totals, counts, distributions), transformed values (normalised, scaled, encoded), and AI-generated content (summaries, extractions, reformulations). The scope extends to multi-stage derivations where one derived value feeds into another derivation — the full provenance chain through all stages must be traceable. Derived data in vector stores (AG-132), including embeddings and extracted entities, is within scope.

4.1. A conforming system MUST record the provenance of every decision-critical derived value, linking it to: the specific source records that contributed, the transformation or computation applied, the model and version (if applicable), the configuration parameters at the time of derivation, and the timestamp of derivation.

4.2. A conforming system MUST make provenance queryable per output — given a specific agent output, the system MUST be able to retrieve the complete provenance chain from output to source records within a defined response time.

4.3. A conforming system MUST propagate governance attributes from source records to derived values — including data quality scores (AG-311), synthetic tags (AG-313), temporal validity windows (AG-316), unit metadata (AG-314), and field criticality classifications (AG-310).

4.4. A conforming system MUST ensure provenance records are immutable once written — the provenance of a historical derivation cannot be altered retroactively.

4.5. A conforming system MUST retain provenance records for at least as long as the derived value may be subject to audit, regulatory enquiry, or data subject access request.

4.6. A conforming system SHOULD implement provenance at the record level (linking specific output records to specific input records), not only at the pipeline level (documenting that pipeline X uses data from source Y).

4.7. A conforming system SHOULD generate a provenance summary with each agent output — a machine-readable manifest that identifies the contributing sources, transformation stages, and governance attributes without requiring a separate provenance query.

4.8. A conforming system SHOULD validate provenance completeness on a defined cadence, verifying that all derived values in production data stores have valid, queryable provenance chains.

4.9. A conforming system MAY implement provenance visualisation tools that render the provenance chain as a directed graph for human investigation and audit support.

5. Rationale

AI agent systems are fundamentally data transformation engines. They consume source data, apply transformations (calculations, models, aggregations, reasoning), and produce derived data that drives decisions. The derived data — a risk score, a recommendation, a classification — is what matters operationally. But the derived data alone, without provenance, is an assertion without evidence.

Provenance answers three questions that every governance function needs: What went in? (which source records contributed), What happened? (which transformations were applied), and Was it governed? (did the source data meet quality, freshness, authority, and tagging requirements). Without provenance, these questions can only be answered forensically, which is slow, expensive, and often impossible for historical decisions where the source data has since changed.

The regulatory motivation is strong and growing. GDPR Article 22 grants data subjects the right to meaningful information about the logic of automated decisions — which requires tracing the decision back to its inputs. The EU AI Act Article 12 requires record-keeping for high-risk AI systems including "the input data." Financial regulators (BCBS 239, FCA) require that risk data aggregation be auditable. In every case, the regulator needs to understand not just the output but the path from source to output.

For AI agents specifically, provenance is complicated by the multi-stage nature of derivation. An agent's output may depend on: (1) source data from 5 systems, (2) a feature engineering pipeline that transforms source data into model features, (3) a machine learning model that produces predictions, (4) a retrieval-augmented generation (RAG) pipeline that retrieves relevant documents, (5) the agent's reasoning process that combines all inputs into an action. Each stage transforms data and potentially loses provenance unless the provenance chain is explicitly maintained.

The immutability requirement (4.4) prevents retroactive provenance fabrication. If an organisation is asked "what data informed this decision 6 months ago?" and the provenance can be modified after the fact, the provenance has no evidentiary value. Immutable provenance — written once at the time of derivation and never modified — provides the evidentiary foundation for audit and regulatory response.

6. Implementation Guidance

Derived data provenance requires three components: provenance capture (recording the chain at the time of derivation), provenance storage (retaining the chain in an immutable, queryable store), and provenance delivery (making the chain available to governance functions, auditors, and regulators).

Provenance record structure should include for each derived value: a unique derivation_id, the derived_value_identifier (linking to the output record), the list of source_record_identifiers (linking to each contributing input record), the transformation_identifier (linking to the computation, model, or pipeline that produced the derivation), the transformation_version (ensuring the specific version is recorded), the configuration_parameters (model hyperparameters, aggregation rules, thresholds), the derivation_timestamp, and the governance_attributes_summary (quality scores, synthetic flags, validity windows of contributing inputs).

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. MiFID II transaction reporting requires firms to demonstrate the data trail from order receipt through execution. Risk model outputs used for regulatory capital calculations must be traceable to their input data for model validation (FCA SS1/23). BCBS 239 requires that risk data aggregation be auditable, which implies provenance for aggregated risk figures.

Healthcare. Clinical decision support outputs must be traceable to their contributing clinical data for patient safety investigations and malpractice defence. The contributing data — lab results, medication lists, clinical notes — forms the clinical audit trail for the agent's recommendation.

Public Sector. Government decisions affecting citizens (benefits eligibility, service allocation, regulatory enforcement) are subject to judicial review. Provenance provides the evidential basis for demonstrating that the decision was made on appropriate data through appropriate processes. The Equality Act 2010 may require demonstrating which data informed a decision that is challenged as discriminatory.

Maturity Model

Basic Implementation — The organisation records provenance at the pipeline level for its primary data derivations — documenting which sources feed which outputs. Record-level provenance is implemented for decision-critical derived values. Provenance is stored in a structured data store and retained for the required period. Provenance queries can be answered within days by querying the store.

Intermediate Implementation — Record-level provenance is captured for all derived values consumed by agents. Provenance manifests are attached to agent outputs. Governance attributes (quality scores, synthetic tags, validity windows) are propagated through derivations and included in provenance records. Provenance is queryable within minutes. An append-only provenance store ensures immutability.

Advanced Implementation — All intermediate capabilities plus: provenance completeness is validated automatically on a defined cadence. Forward queries ("what did this correction affect?") enable correction backpropagation (AG-318). Adversarial testing has verified that provenance tampering, fabrication, and deletion attacks are detected and blocked. Provenance visualisation tools render derivation chains for auditors and regulators. The organisation can produce the complete provenance chain for any agent decision within minutes.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: End-to-End Provenance Traceability

Test 8.2: Governance Attribute Propagation

Test 8.3: Provenance Immutability

Test 8.4: Provenance Query Response Time

Test 8.5: Forward Provenance Query

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 12 (Record-Keeping)Direct requirement
EU AI ActArticle 13 (Transparency)Direct requirement
GDPRArticle 22(3) (Right to Meaningful Information)Direct requirement
BCBS 239Principle 6 (Adaptability)Supports compliance
FCA SS1/23Model Risk Management — Audit TrailSupports compliance
NIST AI RMFMAP 2.3, GOVERN 1.2Supports compliance
ISO 42001Clause 8.4 (AI System Operation)Supports compliance

EU AI Act — Article 12 (Record-Keeping)

Article 12 requires that high-risk AI systems generate logs that enable tracing the system's operation, including the reference database against which input data has been checked. For AI agents, this extends to recording which input data contributed to each output. AG-317 directly implements this by maintaining record-level provenance for derived agent outputs.

GDPR — Article 22(3) (Right to Meaningful Information)

Where automated decisions significantly affect individuals, the data subject has the right to obtain meaningful information about the logic involved. "Meaningful information" requires more than a description of the model architecture — it requires tracing the specific decision to the specific data that informed it. AG-317 provides the provenance infrastructure that enables this per-decision explanation.

EU AI Act — Article 13 (Transparency)

Article 13 requires that high-risk AI systems be designed to be sufficiently transparent. Provenance is a transparency mechanism — it makes the derivation chain visible, enabling users, deployers, and regulators to understand how outputs were produced from inputs.

FCA SS1/23 — Model Risk Management

The FCA's supervisory statement requires firms to maintain audit trails for model inputs and outputs. For AI agents that consume model outputs, the model's provenance (which features were used, which model version, which configuration) must be captured and retained. AG-317 extends this audit trail through the full derivation chain from source data to model output to agent decision.

BCBS 239 — Principle 6 (Adaptability)

Principle 6 requires that risk data aggregation capabilities be adaptable to changes in regulatory requirements, including the ability to produce ad hoc reports. Provenance enables adaptability by ensuring that any aggregated risk figure can be decomposed to its source records, enabling re-aggregation under new rules or for ad hoc regulatory requests.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusOrganisation-wide — affects the auditability of all derived data and agent outputs

Consequence chain: Without provenance, the organisation cannot explain its own outputs. The immediate impact is audit failure — regulators, external auditors, and data subjects asking "why?" receive delayed or incomplete answers. In Scenario A, reconstructing provenance forensically cost £180,000 and took 6 weeks — during which the regulator's investigation remained open. In Scenario B, the absence of provenance prevented locating a £8.7 million reporting error without re-running the aggregation pipeline. In Scenario C, the inability to explain a specific automated decision to a data subject triggered a DPA enforcement action. The regulatory risk is direct: EU AI Act Article 12 non-compliance, GDPR Article 22 non-compliance, BCBS 239 auditability failure. The operational risk compounds over time — as the volume of agent decisions grows, the number of unexplainable historical decisions grows proportionally, creating an expanding liability.

Cross-references: AG-133 (Source Record Lineage) provides the foundational lineage infrastructure that AG-317 extends to derived data. AG-309 (Authoritative Source Register Governance) — provenance records should reference the authoritative source designation in effect at derivation time. AG-311 (Data Quality Threshold Enforcement Governance), AG-313 (Synthetic and Augmented Data Tagging Governance), AG-314 (Measurement Unit Consistency Governance), AG-316 (Temporal Validity Window Governance) — governance attributes from each of these dimensions must propagate through the provenance chain. AG-318 (Data Correction Backpropagation Governance) — provenance enables correction propagation by identifying which derived values are affected by a source correction. AG-132 (Vector Store and RAG) — RAG retrieval provenance (which documents, which similarity scores) must be captured. AG-057 (Dataset Suitability and Bias Control) — provenance enables bias investigation by tracing biased outputs to their contributing source data.

Cite this protocol
AgentGoverning. (2026). AG-317: Derived Data Provenance Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-317