AG-313: Synthetic and Augmented Data Tagging Governance

2. Summary

Synthetic and Augmented Data Tagging Governance requires that all synthetic, inferred, augmented, or model-generated data be distinctly and persistently marked to differentiate it from observed, measured, or verified data. When AI agents consume data, they must be able to distinguish between a customer's actual reported income and an income value inferred by a model, between a real sensor reading and a simulated sensor value, between an actual transaction and a synthetic test transaction. Without this tagging, synthetic and observed data become indistinguishable once merged into a data store, and agents — and the humans who rely on agent outputs — make decisions on data they believe is real but is not.

3. Example

Scenario A — Synthetic Training Data Leaks Into Production: An organisation generates 500,000 synthetic customer records to augment its training dataset for a customer segmentation model. The synthetic records are realistic: they include plausible names, addresses, account balances, and transaction histories generated by a GAN (Generative Adversarial Network). The synthetic records are stored in a staging table in the data warehouse. A data engineer, unaware of the synthetic origin, creates a view that unions the staging table with the production customer table to improve query performance. Over 3 months, a marketing agent sends 12,400 campaign offers to synthetic customers. The postal service returns 12,400 undeliverable items at £0.85 each (£10,540). More critically, 4 synthetic records happen to share addresses with real customers, who receive offers addressed to non-existent people, generating complaints and a data protection concern.

What went wrong: Synthetic records were not tagged with a persistent, machine-readable marker distinguishing them from observed records. Once in the data warehouse, they were indistinguishable from real customer data. The union view propagated them into production without any control detecting the contamination.

Scenario B — Model-Inferred Values Treated as Ground Truth: A healthcare AI system maintains patient records that include both observed vital signs (recorded by medical devices) and model-inferred vital signs (estimated by a predictive model during gaps between observations). The inferred values are stored in the same table and format as observed values, with no distinguishing tag. A clinical decision support agent recommends medication dosages based on patient vital signs. For patient P-8834, the model inferred a blood pressure of 128/82 during a 6-hour gap between observations. The actual blood pressure at the next observation was 162/98 — the patient was developing hypertensive crisis. The agent's dosage recommendation was based on the inferred normal reading. The patient experienced a delayed intervention, requiring an additional 48 hours of hospitalisation.

What went wrong: Inferred vital signs were stored identically to observed vital signs. The clinical decision support agent treated the inferred value as ground truth. No tag distinguished model output from medical device output. The clinical team could not determine which values in the patient's record were observed vs. inferred without forensic analysis of the data pipeline.

Scenario C — Augmented Data Distorts Regulatory Reporting: A financial institution uses data augmentation to address class imbalance in its fraud detection training data. The augmentation generates 200,000 synthetic fraud transactions based on patterns in 8,000 real fraud transactions. The synthetic transactions are tagged in the training pipeline but the tag is stored as a metadata field in a separate table. When the data warehouse is migrated to a new platform, the metadata table is not migrated. The synthetic fraud transactions lose their tags. A regulatory reporting agent generates a suspicious activity report (SAR) filing that includes statistics derived from both real and synthetic transactions, reporting a fraud rate 25x higher than the actual rate. The regulator initiates an investigation into what appears to be a massive fraud exposure, consuming 400 hours of compliance team effort (£32,000 at £80/hour) to explain and remediate.

What went wrong: The synthetic data tag was stored as separable metadata rather than as an intrinsic, non-removable attribute of each record. A system migration severed the link between records and their tags. The synthetic data became indistinguishable from observed data.

4. Requirement Statement

Scope: This dimension applies to all AI agent systems that generate, consume, or store data that includes synthetic, model-inferred, augmented, simulated, or otherwise non-observed records or fields. The scope covers: fully synthetic records (generated by models or algorithms), augmented records (real records with artificially modified attributes), model-inferred field values (individual fields estimated rather than observed), simulated data (outputs from simulation models), test data that persists beyond testing contexts, governed defaults applied under AG-312 (Missing Data Escalation), and any data generated by AI models including embeddings, summaries, and extracted entities stored in vector stores (AG-132). The scope extends to data that was originally synthetic but has been transformed, aggregated, or derived — the synthetic origin must propagate through all transformations.

4.1. A conforming system MUST tag every synthetic, inferred, augmented, or model-generated data record with a persistent, machine-readable marker that identifies the data origin type (synthetic, inferred, augmented, simulated), the generation method (model name, algorithm, version), and the generation timestamp.

4.2. A conforming system MUST store the synthetic data tag as an intrinsic attribute of the record, not as separable metadata in a different storage location that could become disconnected during migration, replication, or transformation.

4.3. A conforming system MUST prevent synthetic-tagged data from being consumed by agents in decision contexts that require observed data, unless the decision context explicitly permits synthetic data and the agent's output discloses the synthetic input.

4.4. A conforming system MUST propagate synthetic tags through all data transformations — if any contributing record in a derivation, aggregation, or join is synthetic, the output record MUST carry a synthetic-contributing tag.

4.5. A conforming system MUST prevent removal or modification of synthetic tags except through a governed de-tagging process with documented justification and approval.

4.6. A conforming system SHOULD implement visual differentiation of synthetic data in human-facing interfaces — dashboards, reports, and review screens should clearly distinguish synthetic from observed data using colour coding, icons, or labels.

4.7. A conforming system SHOULD validate synthetic data tags on ingestion into each data store, rejecting records that lack required tags when ingested through synthetic data pipelines.

4.8. A conforming system SHOULD maintain a synthetic data inventory cataloguing all synthetic datasets, their generation methods, intended use contexts, and retention policies.

4.9. A conforming system MAY implement automatic synthetic detection for untagged data using statistical methods to identify records that match synthetic data distributions, as a defence-in-depth control.

5. Rationale

The boundary between synthetic and observed data is becoming increasingly difficult to maintain. Organisations generate synthetic data for multiple legitimate purposes: training data augmentation, privacy-preserving analytics, testing, simulation, and scenario modelling. AI models generate inferred values as part of their normal operation — predicted churn scores, estimated risk levels, imputed missing values. These generated values enter data stores alongside observed values and, once stored, become input for other agents and models.

The governance challenge is that synthetic data, by design, is realistic. A well-generated synthetic customer record is indistinguishable from a real customer record without explicit tagging. A model-inferred vital sign looks identical to a device-observed vital sign in the database. This realism, which makes synthetic data valuable for its intended purpose, makes it dangerous when consumed outside that purpose.

AI agents exacerbate this risk because they consume data without the contextual awareness that human analysts develop. A human analyst working with a dataset may know from experience that "the staging schema contains test data" or "the vital signs between 2am and 8am are usually inferred." An AI agent has no such institutional knowledge — it processes whatever data it receives as though it is equally valid.

The intrinsic tagging requirement (4.2) deserves particular attention. Many organisations store synthetic tags as metadata in a separate table linked by foreign key. This architecture is brittle: database migrations, replication to downstream systems, ETL transformations, and data lake ingestion routinely sever metadata links. Once severed, synthetic records become permanently indistinguishable from observed records. The intrinsic tagging requirement mandates that the tag travel with the record — as a column in the same table, a field in the same JSON document, or an embedded attribute in the same file — so that no transformation can separate the record from its origin marker.

The propagation requirement (4.4) addresses a subtler problem. When a derived metric is computed from 100 records, and 3 of those records are synthetic, the derived metric is partially synthetic. If the synthetic tag does not propagate, the derived metric appears fully observed. For decision-critical derivations, even a small synthetic contribution may be material — and the downstream consumer must be aware of it.

6. Implementation Guidance

Synthetic data tagging requires three layers: tag assignment (marking data at the point of generation), tag persistence (ensuring tags survive storage, transformation, and migration), and tag enforcement (preventing agents from consuming synthetic data in observed-data contexts).

Tag structure should include at minimum: a data_origin field with enumerated values (OBSERVED, SYNTHETIC, INFERRED, AUGMENTED, SIMULATED), a generation_method field (free text or structured identifier, e.g., "GAN-v2.3", "linear-imputation", "AG-312-default"), a generation_timestamp (ISO 8601), and optionally a confidence_score for inferred values (0.0 to 1.0).

Recommended patterns:

Intrinsic column tagging. Add a data_origin column (or equivalent) to every table or schema that may contain synthetic records. Set a NOT NULL constraint with a default of OBSERVED. All synthetic data pipelines must explicitly set the origin to the appropriate non-observed type. The column participates in all downstream views, exports, and replications. This is the simplest and most robust pattern — the tag is inseparable from the record at the storage layer.
Envelope tagging for document stores. In document-oriented storage (JSON, XML, or vector store metadata), wrap every record in an envelope that includes the origin tag:

``json { "data_origin": "SYNTHETIC", "generation_method": "GAN-v2.3-customer-profile", "generation_timestamp": "2026-01-15T09:30:00Z", "payload": { /* actual record fields */ } } `` The data access layer validates the envelope before delivering the payload. Records without valid envelopes are rejected.

Pipeline-level tag propagation. In ETL and data transformation pipelines, implement tag propagation as a standard transformation rule: if any input record carries a non-OBSERVED data_origin, the output record must carry at minimum a MIXED-ORIGIN tag with a list of contributing origins. Implement this as a reusable transformation component applied to every pipeline stage, reducing the risk of propagation gaps.

Anti-patterns to avoid:

Separable metadata tagging. Storing the synthetic tag in a separate table, file, or system linked by ID. This is the primary failure mode — any process that copies, migrates, or transforms the data record without also copying the linked metadata permanently severs the tag.
Naming convention tagging. Relying on table names (e.g., synthetic_customers), file names, or schema names to indicate synthetic origin. These conventions are lost when data is exported, moved, or queried across schemas. The tag must be at the record level, not the container level.
Tag-on-write-only. Tagging data when it enters the first data store but not enforcing or validating the tag on subsequent reads, writes, and transformations. A tag that is written once and never checked provides no enforcement value.
Allowing tag removal without governance. Permitting any process to overwrite a SYNTHETIC tag with OBSERVED. This is the equivalent of removing an audit trail. De-tagging should require documented justification (e.g., synthetic data has been verified against ground truth and reclassified) and explicit approval.
Ignoring AI-generated content in vector stores. AI-generated summaries, extracted entities, and embeddings stored in vector stores (AG-132) are synthetic data. Treating them as equivalent to original source documents in RAG contexts creates contamination where agents reason over their own prior outputs as though they were authoritative sources.

Industry Considerations

Financial Services. Synthetic data used for model training, stress testing, and scenario analysis is regulated under model risk management frameworks (e.g., FCA SS1/23, Fed SR 11-7). Synthetic data that enters production data stores must be distinguishable for regulatory reporting accuracy. PRA stress test submissions must be clearly identified as based on synthetic scenarios, not observed market conditions.

Healthcare. Synthetic patient data generated for research, testing, or training must never enter clinical systems where it could be consumed by clinical decision support agents. HIPAA does not apply to de-identified synthetic data, but mislabelled synthetic data that enters a clinical record could cause patient harm. The tagging requirement is a patient safety control.

Public Sector. Synthetic data used for policy modelling or demographic analysis must be tagged to prevent it from entering official statistics or citizen-facing decision systems. The UK Statistics Authority Code of Practice requires that published statistics be based on sound methods and reliable data — synthetic data contamination would violate this requirement.

Maturity Model

Basic Implementation — The organisation tags synthetic records at the point of generation using an intrinsic field. Tags cover data_origin and generation_method. Agent data access layers filter for OBSERVED records in decision-critical contexts. The tagging policy is documented and reviewed annually. Tag compliance is verified through periodic audits of data stores.

Intermediate Implementation — Tag propagation is enforced through all data transformations — derived records inherit synthetic-contributing tags from their inputs. Tag validation occurs on ingestion into each data store, rejecting untagged records from synthetic pipelines. Human-facing interfaces visually distinguish synthetic from observed data. A synthetic data inventory catalogues all synthetic datasets with generation methods and approved use contexts.

Advanced Implementation — All intermediate capabilities plus: statistical synthetic detection provides defence-in-depth by flagging records that match synthetic distributions but lack synthetic tags. Tag removal requires a governed de-tagging process. Adversarial testing has verified that tag removal, tag spoofing, and synthetic data contamination attacks are detected and blocked. The organisation can demonstrate end-to-end synthetic tag integrity from generation through all transformations to agent consumption.

7. Evidence Requirements

Required artefacts:

Tagging schema documentation. The specification of the synthetic data tag structure (fields, enumerated values, constraints) as implemented in data stores and pipelines.
Tag propagation evidence. Configuration or code demonstrating that data transformations propagate synthetic tags to derived records.
Synthetic data inventory. Catalogue of all synthetic datasets including generation method, intended use context, storage location, and retention policy.
Tag compliance audit results. Results of periodic audits verifying that synthetic records are correctly tagged and that no untagged synthetic records exist in production data stores.
Enforcement logs. Records demonstrating that agents were blocked from consuming synthetic data in observed-data-required contexts.

Retention requirements:

Tagging configurations and audit results: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Intrinsic Tag Persistence Through Migration

Stimulus: Migrate a dataset containing tagged synthetic records from one data store to another (e.g., database to data lake, on-premises to cloud). Verify tags are preserved.
Expected behaviour: All synthetic tags are present in the migrated dataset. No records lose their tags during migration.
Pass criteria: 100% of synthetic records retain their data_origin, generation_method, and generation_timestamp tags after migration.
Fail criteria: Any synthetic record loses its tag during migration.

Test 8.2: Agent Consumption Blocking for Synthetic Data

Stimulus: Inject synthetic-tagged records into a data source consumed by an agent in a decision context that requires observed data. Submit a decision request.
Expected behaviour: The agent does not consume synthetic-tagged records for the decision. Only OBSERVED records are used.
Pass criteria: The agent's decision input contains no synthetic records. The synthetic records are filtered at the data access layer.
Fail criteria: The agent consumes synthetic records in an observed-data-required context.

Test 8.3: Tag Propagation Through Derivation

Stimulus: Derive a new field or record by aggregating or transforming a mix of observed and synthetic records. Verify the output carries a synthetic-contributing tag.
Expected behaviour: The derived output includes a MIXED-ORIGIN or SYNTHETIC-CONTRIBUTING tag indicating that synthetic data contributed to the derivation.
Pass criteria: The derived record's tag reflects the synthetic contribution. The contributing origins are identifiable.
Fail criteria: The derived record carries an OBSERVED tag despite synthetic input.

Test 8.4: Tag Removal Governance

Stimulus: Attempt to remove or overwrite a SYNTHETIC tag on a record through direct database update, pipeline transformation, or agent action.
Expected behaviour: The tag removal is blocked unless executed through the governed de-tagging process with documented justification and approval.
Pass criteria: Ungoverned tag removal attempts are blocked or reverted. Governed de-tagging produces an audit trail.
Fail criteria: Any process removes a synthetic tag without governance approval.

Test 8.5: Synthetic Detection for Untagged Records

Stimulus: Introduce untagged synthetic records into a data store (simulating a tagging failure). Verify whether statistical detection identifies them.
Expected behaviour: At the advanced maturity level, statistical synthetic detection flags the records as potentially synthetic based on distributional analysis.
Pass criteria: The detection system generates an alert for the untagged synthetic records within the defined detection cadence.
Fail criteria: The untagged synthetic records pass undetected.

Conformance Scoring

Score 0: No synthetic data tagging exists — synthetic and observed data are indistinguishable in production data stores.
Score 1: Synthetic data is tagged at generation but tags are stored as separable metadata, are not propagated through transformations, and are not enforced at agent consumption.
Score 2: Synthetic tags are intrinsic (non-separable), propagated through all transformations, and enforced at the data access layer — agents cannot consume synthetic data in observed-data-required contexts.
Score 3: Verified by independent adversarial testing — tag removal, tag spoofing, and synthetic contamination attacks have been attempted and failed. Statistical synthetic detection provides defence-in-depth.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 10(2) (Data Governance — Training Data)	Direct requirement
EU AI Act	Article 10(5) (Synthetic Data Use)	Direct requirement
FCA SS1/23	Model Risk Management — Data Quality	Supports compliance
GDPR	Article 5(1)(d) (Accuracy)	Supports compliance
BCBS 239	Principle 3 (Accuracy and Integrity)	Supports compliance
NIST AI RMF	MAP 2.3, MEASURE 2.5	Supports compliance
ISO 42001	Clause 8.4 (AI System Operation)	Supports compliance

EU AI Act — Article 10(2) and 10(5)

Article 10(2) requires data governance practices for training, validation, and testing data, including examination of data for biases and gaps. Article 10(5) explicitly addresses synthetic or augmented data, requiring that its use be subject to appropriate data governance practices. AG-313 directly implements this by ensuring synthetic data is persistently tagged, its generation method is documented, and its use in agent decision contexts is governed.

FCA SS1/23 — Model Risk Management

The FCA's supervisory statement on model risk management requires firms to maintain clear documentation of model inputs, including the use of synthetic or augmented data. For AI agents that consume model outputs (inferred values), tagging those outputs as model-generated enables the firm to demonstrate awareness of model dependency in agent decisions.

Personal data must be accurate. If an agent treats synthetic data as real personal data and takes actions based on it (e.g., sending marketing to synthetic addresses), the organisation is processing inaccurate data. Tagging ensures synthetic records are excluded from personal data processing contexts.

BCBS 239 — Principle 3 (Accuracy and Integrity)

Risk data must be accurate and subject to quality assurance. Synthetic data that enters risk calculations without being identified as synthetic compromises the accuracy and integrity of risk data aggregation. Tagging enables risk data consumers to distinguish observed from synthetic inputs.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Data-wide — synthetic contamination can propagate through data stores, derived datasets, and multiple agent decision contexts

Consequence chain: Untagged synthetic data enters production data stores and becomes indistinguishable from observed data. The contamination propagates through data pipelines, derived metrics, and agent inputs. Agents make decisions on data they treat as observed but which is partially or fully synthetic. The financial impact includes wasted operational expenditure (Scenario A: £10,540 in undeliverable mail), incorrect clinical decisions (Scenario B: delayed intervention requiring 48 additional hours of hospitalisation), and regulatory reporting errors (Scenario C: £32,000 in investigation costs). The regulatory impact is severe in sectors where data provenance must be demonstrable — BCBS 239 compliance for banking risk data, FCA model risk management, EU AI Act synthetic data governance. The contamination is difficult to remediate because once synthetic tags are lost, identifying which records are synthetic requires statistical analysis or forensic investigation of historical pipeline logs, which may not exist. Prevention through persistent tagging is orders of magnitude less costly than remediation after contamination.

Cross-references: AG-128 (Data Source Classification) classifies data sources, including sources that produce synthetic data. AG-310 (Field-Level Criticality Governance) determines which fields are decision-critical and therefore subject to the strictest synthetic data controls. AG-312 (Missing Data Escalation Governance) — governed defaults applied when data is missing are a form of synthetic data that must be tagged per AG-313. AG-317 (Derived Data Provenance Governance) traces provenance including synthetic contributions. AG-132 (Vector Store and RAG) — AI-generated content in vector stores must be tagged as synthetic. AG-133 (Source Record Lineage) traces individual records, including synthetic origin. AG-057 (Dataset Suitability and Bias Control) assesses whether synthetic data introduces or amplifies bias.

Cite this protocol

AgentGoverning. (2026). AG-313: Synthetic and Augmented Data Tagging Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-313

← Previous Protocol

AG-312

Missing Data Escalation Governance

Next Protocol →

AG-314

Measurement Unit Consistency Governance