AG-313

Synthetic and Augmented Data Tagging Governance

Data Classification, Quality & Lineage ~15 min read AGS v2.1 · April 2026
EU AI Act GDPR FCA NIST HIPAA ISO 42001

2. Summary

Synthetic and Augmented Data Tagging Governance requires that all synthetic, inferred, augmented, or model-generated data be distinctly and persistently marked to differentiate it from observed, measured, or verified data. When AI agents consume data, they must be able to distinguish between a customer's actual reported income and an income value inferred by a model, between a real sensor reading and a simulated sensor value, between an actual transaction and a synthetic test transaction. Without this tagging, synthetic and observed data become indistinguishable once merged into a data store, and agents — and the humans who rely on agent outputs — make decisions on data they believe is real but is not.

3. Example

Scenario A — Synthetic Training Data Leaks Into Production: An organisation generates 500,000 synthetic customer records to augment its training dataset for a customer segmentation model. The synthetic records are realistic: they include plausible names, addresses, account balances, and transaction histories generated by a GAN (Generative Adversarial Network). The synthetic records are stored in a staging table in the data warehouse. A data engineer, unaware of the synthetic origin, creates a view that unions the staging table with the production customer table to improve query performance. Over 3 months, a marketing agent sends 12,400 campaign offers to synthetic customers. The postal service returns 12,400 undeliverable items at £0.85 each (£10,540). More critically, 4 synthetic records happen to share addresses with real customers, who receive offers addressed to non-existent people, generating complaints and a data protection concern.

What went wrong: Synthetic records were not tagged with a persistent, machine-readable marker distinguishing them from observed records. Once in the data warehouse, they were indistinguishable from real customer data. The union view propagated them into production without any control detecting the contamination.

Scenario B — Model-Inferred Values Treated as Ground Truth: A healthcare AI system maintains patient records that include both observed vital signs (recorded by medical devices) and model-inferred vital signs (estimated by a predictive model during gaps between observations). The inferred values are stored in the same table and format as observed values, with no distinguishing tag. A clinical decision support agent recommends medication dosages based on patient vital signs. For patient P-8834, the model inferred a blood pressure of 128/82 during a 6-hour gap between observations. The actual blood pressure at the next observation was 162/98 — the patient was developing hypertensive crisis. The agent's dosage recommendation was based on the inferred normal reading. The patient experienced a delayed intervention, requiring an additional 48 hours of hospitalisation.

What went wrong: Inferred vital signs were stored identically to observed vital signs. The clinical decision support agent treated the inferred value as ground truth. No tag distinguished model output from medical device output. The clinical team could not determine which values in the patient's record were observed vs. inferred without forensic analysis of the data pipeline.

Scenario C — Augmented Data Distorts Regulatory Reporting: A financial institution uses data augmentation to address class imbalance in its fraud detection training data. The augmentation generates 200,000 synthetic fraud transactions based on patterns in 8,000 real fraud transactions. The synthetic transactions are tagged in the training pipeline but the tag is stored as a metadata field in a separate table. When the data warehouse is migrated to a new platform, the metadata table is not migrated. The synthetic fraud transactions lose their tags. A regulatory reporting agent generates a suspicious activity report (SAR) filing that includes statistics derived from both real and synthetic transactions, reporting a fraud rate 25x higher than the actual rate. The regulator initiates an investigation into what appears to be a massive fraud exposure, consuming 400 hours of compliance team effort (£32,000 at £80/hour) to explain and remediate.

What went wrong: The synthetic data tag was stored as separable metadata rather than as an intrinsic, non-removable attribute of each record. A system migration severed the link between records and their tags. The synthetic data became indistinguishable from observed data.

4. Requirement Statement

Scope: This dimension applies to all AI agent systems that generate, consume, or store data that includes synthetic, model-inferred, augmented, simulated, or otherwise non-observed records or fields. The scope covers: fully synthetic records (generated by models or algorithms), augmented records (real records with artificially modified attributes), model-inferred field values (individual fields estimated rather than observed), simulated data (outputs from simulation models), test data that persists beyond testing contexts, governed defaults applied under AG-312 (Missing Data Escalation), and any data generated by AI models including embeddings, summaries, and extracted entities stored in vector stores (AG-132). The scope extends to data that was originally synthetic but has been transformed, aggregated, or derived — the synthetic origin must propagate through all transformations.

4.1. A conforming system MUST tag every synthetic, inferred, augmented, or model-generated data record with a persistent, machine-readable marker that identifies the data origin type (synthetic, inferred, augmented, simulated), the generation method (model name, algorithm, version), and the generation timestamp.

4.2. A conforming system MUST store the synthetic data tag as an intrinsic attribute of the record, not as separable metadata in a different storage location that could become disconnected during migration, replication, or transformation.

4.3. A conforming system MUST prevent synthetic-tagged data from being consumed by agents in decision contexts that require observed data, unless the decision context explicitly permits synthetic data and the agent's output discloses the synthetic input.

4.4. A conforming system MUST propagate synthetic tags through all data transformations — if any contributing record in a derivation, aggregation, or join is synthetic, the output record MUST carry a synthetic-contributing tag.

4.5. A conforming system MUST prevent removal or modification of synthetic tags except through a governed de-tagging process with documented justification and approval.

4.6. A conforming system SHOULD implement visual differentiation of synthetic data in human-facing interfaces — dashboards, reports, and review screens should clearly distinguish synthetic from observed data using colour coding, icons, or labels.

4.7. A conforming system SHOULD validate synthetic data tags on ingestion into each data store, rejecting records that lack required tags when ingested through synthetic data pipelines.

4.8. A conforming system SHOULD maintain a synthetic data inventory cataloguing all synthetic datasets, their generation methods, intended use contexts, and retention policies.

4.9. A conforming system MAY implement automatic synthetic detection for untagged data using statistical methods to identify records that match synthetic data distributions, as a defence-in-depth control.

5. Rationale

The boundary between synthetic and observed data is becoming increasingly difficult to maintain. Organisations generate synthetic data for multiple legitimate purposes: training data augmentation, privacy-preserving analytics, testing, simulation, and scenario modelling. AI models generate inferred values as part of their normal operation — predicted churn scores, estimated risk levels, imputed missing values. These generated values enter data stores alongside observed values and, once stored, become input for other agents and models.

The governance challenge is that synthetic data, by design, is realistic. A well-generated synthetic customer record is indistinguishable from a real customer record without explicit tagging. A model-inferred vital sign looks identical to a device-observed vital sign in the database. This realism, which makes synthetic data valuable for its intended purpose, makes it dangerous when consumed outside that purpose.

AI agents exacerbate this risk because they consume data without the contextual awareness that human analysts develop. A human analyst working with a dataset may know from experience that "the staging schema contains test data" or "the vital signs between 2am and 8am are usually inferred." An AI agent has no such institutional knowledge — it processes whatever data it receives as though it is equally valid.

The intrinsic tagging requirement (4.2) deserves particular attention. Many organisations store synthetic tags as metadata in a separate table linked by foreign key. This architecture is brittle: database migrations, replication to downstream systems, ETL transformations, and data lake ingestion routinely sever metadata links. Once severed, synthetic records become permanently indistinguishable from observed records. The intrinsic tagging requirement mandates that the tag travel with the record — as a column in the same table, a field in the same JSON document, or an embedded attribute in the same file — so that no transformation can separate the record from its origin marker.

The propagation requirement (4.4) addresses a subtler problem. When a derived metric is computed from 100 records, and 3 of those records are synthetic, the derived metric is partially synthetic. If the synthetic tag does not propagate, the derived metric appears fully observed. For decision-critical derivations, even a small synthetic contribution may be material — and the downstream consumer must be aware of it.

6. Implementation Guidance

Synthetic data tagging requires three layers: tag assignment (marking data at the point of generation), tag persistence (ensuring tags survive storage, transformation, and migration), and tag enforcement (preventing agents from consuming synthetic data in observed-data contexts).

Tag structure should include at minimum: a data_origin field with enumerated values (OBSERVED, SYNTHETIC, INFERRED, AUGMENTED, SIMULATED), a generation_method field (free text or structured identifier, e.g., "GAN-v2.3", "linear-imputation", "AG-312-default"), a generation_timestamp (ISO 8601), and optionally a confidence_score for inferred values (0.0 to 1.0).

Recommended patterns:

``json { "data_origin": "SYNTHETIC", "generation_method": "GAN-v2.3-customer-profile", "generation_timestamp": "2026-01-15T09:30:00Z", "payload": { /* actual record fields */ } } `` The data access layer validates the envelope before delivering the payload. Records without valid envelopes are rejected.

Anti-patterns to avoid:

Industry Considerations

Financial Services. Synthetic data used for model training, stress testing, and scenario analysis is regulated under model risk management frameworks (e.g., FCA SS1/23, Fed SR 11-7). Synthetic data that enters production data stores must be distinguishable for regulatory reporting accuracy. PRA stress test submissions must be clearly identified as based on synthetic scenarios, not observed market conditions.

Healthcare. Synthetic patient data generated for research, testing, or training must never enter clinical systems where it could be consumed by clinical decision support agents. HIPAA does not apply to de-identified synthetic data, but mislabelled synthetic data that enters a clinical record could cause patient harm. The tagging requirement is a patient safety control.

Public Sector. Synthetic data used for policy modelling or demographic analysis must be tagged to prevent it from entering official statistics or citizen-facing decision systems. The UK Statistics Authority Code of Practice requires that published statistics be based on sound methods and reliable data — synthetic data contamination would violate this requirement.

Maturity Model

Basic Implementation — The organisation tags synthetic records at the point of generation using an intrinsic field. Tags cover data_origin and generation_method. Agent data access layers filter for OBSERVED records in decision-critical contexts. The tagging policy is documented and reviewed annually. Tag compliance is verified through periodic audits of data stores.

Intermediate Implementation — Tag propagation is enforced through all data transformations — derived records inherit synthetic-contributing tags from their inputs. Tag validation occurs on ingestion into each data store, rejecting untagged records from synthetic pipelines. Human-facing interfaces visually distinguish synthetic from observed data. A synthetic data inventory catalogues all synthetic datasets with generation methods and approved use contexts.

Advanced Implementation — All intermediate capabilities plus: statistical synthetic detection provides defence-in-depth by flagging records that match synthetic distributions but lack synthetic tags. Tag removal requires a governed de-tagging process. Adversarial testing has verified that tag removal, tag spoofing, and synthetic data contamination attacks are detected and blocked. The organisation can demonstrate end-to-end synthetic tag integrity from generation through all transformations to agent consumption.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: Intrinsic Tag Persistence Through Migration

Test 8.2: Agent Consumption Blocking for Synthetic Data

Test 8.3: Tag Propagation Through Derivation

Test 8.4: Tag Removal Governance

Test 8.5: Synthetic Detection for Untagged Records

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 10(2) (Data Governance — Training Data)Direct requirement
EU AI ActArticle 10(5) (Synthetic Data Use)Direct requirement
FCA SS1/23Model Risk Management — Data QualitySupports compliance
GDPRArticle 5(1)(d) (Accuracy)Supports compliance
BCBS 239Principle 3 (Accuracy and Integrity)Supports compliance
NIST AI RMFMAP 2.3, MEASURE 2.5Supports compliance
ISO 42001Clause 8.4 (AI System Operation)Supports compliance

EU AI Act — Article 10(2) and 10(5)

Article 10(2) requires data governance practices for training, validation, and testing data, including examination of data for biases and gaps. Article 10(5) explicitly addresses synthetic or augmented data, requiring that its use be subject to appropriate data governance practices. AG-313 directly implements this by ensuring synthetic data is persistently tagged, its generation method is documented, and its use in agent decision contexts is governed.

FCA SS1/23 — Model Risk Management

The FCA's supervisory statement on model risk management requires firms to maintain clear documentation of model inputs, including the use of synthetic or augmented data. For AI agents that consume model outputs (inferred values), tagging those outputs as model-generated enables the firm to demonstrate awareness of model dependency in agent decisions.

GDPR — Article 5(1)(d) (Accuracy)

Personal data must be accurate. If an agent treats synthetic data as real personal data and takes actions based on it (e.g., sending marketing to synthetic addresses), the organisation is processing inaccurate data. Tagging ensures synthetic records are excluded from personal data processing contexts.

BCBS 239 — Principle 3 (Accuracy and Integrity)

Risk data must be accurate and subject to quality assurance. Synthetic data that enters risk calculations without being identified as synthetic compromises the accuracy and integrity of risk data aggregation. Tagging enables risk data consumers to distinguish observed from synthetic inputs.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusData-wide — synthetic contamination can propagate through data stores, derived datasets, and multiple agent decision contexts

Consequence chain: Untagged synthetic data enters production data stores and becomes indistinguishable from observed data. The contamination propagates through data pipelines, derived metrics, and agent inputs. Agents make decisions on data they treat as observed but which is partially or fully synthetic. The financial impact includes wasted operational expenditure (Scenario A: £10,540 in undeliverable mail), incorrect clinical decisions (Scenario B: delayed intervention requiring 48 additional hours of hospitalisation), and regulatory reporting errors (Scenario C: £32,000 in investigation costs). The regulatory impact is severe in sectors where data provenance must be demonstrable — BCBS 239 compliance for banking risk data, FCA model risk management, EU AI Act synthetic data governance. The contamination is difficult to remediate because once synthetic tags are lost, identifying which records are synthetic requires statistical analysis or forensic investigation of historical pipeline logs, which may not exist. Prevention through persistent tagging is orders of magnitude less costly than remediation after contamination.

Cross-references: AG-128 (Data Source Classification) classifies data sources, including sources that produce synthetic data. AG-310 (Field-Level Criticality Governance) determines which fields are decision-critical and therefore subject to the strictest synthetic data controls. AG-312 (Missing Data Escalation Governance) — governed defaults applied when data is missing are a form of synthetic data that must be tagged per AG-313. AG-317 (Derived Data Provenance Governance) traces provenance including synthetic contributions. AG-132 (Vector Store and RAG) — AI-generated content in vector stores must be tagged as synthetic. AG-133 (Source Record Lineage) traces individual records, including synthetic origin. AG-057 (Dataset Suitability and Bias Control) assesses whether synthetic data introduces or amplifies bias.

Cite this protocol
AgentGoverning. (2026). AG-313: Synthetic and Augmented Data Tagging Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-313