Data Source Classification Governance requires that every data source consumed by an AI agent carries a formal, machine-readable classification label that describes its provenance, trust tier, sensitivity level, and permitted use scope before the data enters any agent reasoning pipeline. Without explicit classification, an agent treats all inputs as equally trustworthy — a production database record carries the same weight as an unverified third-party API response or a user-uploaded spreadsheet. AG-128 mandates that classification is assigned at ingestion time, enforced structurally, and propagated through every downstream transformation so that governance controls in AG-013, AG-129, and AG-132 can operate on reliable metadata rather than guesswork.
Scenario A — Unclassified Source Poisons Financial Decision: A wealth management firm deploys an AI portfolio rebalancing agent that ingests data from three sources: the firm's own verified market data feed (Bloomberg terminal), a third-party ESG scoring API, and a client-uploaded CSV containing personal risk preferences. None of these sources carry classification labels. The ESG API begins returning stale cached data from 72 hours ago due to a provider outage, and the client CSV contains a manipulated risk tolerance value (changed from "conservative" to "aggressive" by an unauthorised family member). The agent treats all three sources with equal confidence, rebalances 340 client portfolios toward high-risk assets, and generates £2.3 million in unsuitable positions before a human reviewer catches the drift.
What went wrong: No source classification existed. The agent had no structural mechanism to distinguish a verified institutional feed (trust tier: authoritative) from a third-party API with no SLA guarantees (trust tier: supplementary) from a user-uploaded file with no integrity verification (trust tier: unverified). Had classifications been applied, the governance layer could have required human confirmation for any rebalancing decision where a critical input came from a source classified below "authoritative" trust tier.
Scenario B — Training Data Source Confusion: A healthcare AI agent is fine-tuned on a combined dataset. The dataset includes 4.2 million records from an approved clinical database and 380,000 records scraped from a public health forum. No source classification distinguishes the two subsets. During inference, the agent recommends a drug interaction warning based on a pattern that exists only in the forum-sourced data — a folk remedy interaction that has no clinical evidence. A clinician follows the warning, delays a necessary prescription by 48 hours, and the patient's condition deteriorates.
What went wrong: Training data lacked source classification. The forum-sourced records should have been classified as "community/unverified" and excluded from clinical decision paths or flagged with a confidence discount. Without classification metadata propagating through the training pipeline, the provenance distinction was lost permanently at data merge time.
Scenario C — Regulatory Exposure Through Jurisdiction Mismatch: A cross-border compliance agent ingests customer due diligence records from subsidiaries in 14 countries. Each subsidiary's data store has different data protection classifications under local law — some records are "special category" under GDPR, others are "sensitive personal information" under PIPL, others carry no special classification under the local regime. Without source classification labels that normalise these distinctions into a common taxonomy, the agent processes all records identically. It transfers GDPR special-category health data to a subsidiary in a jurisdiction without adequacy status, triggering a reportable breach. The fine under Article 83(5) is calculated at 4% of global annual turnover: €18.7 million.
What went wrong: Source classification did not include jurisdiction-specific sensitivity labels mapped to a common governance taxonomy. The agent had no structural way to distinguish records requiring enhanced protection from those with standard treatment.
Scope: This dimension applies to every AI agent that consumes data from any source — whether internal databases, external APIs, user uploads, web scrapes, vector stores, streaming feeds, or training datasets. The scope includes data consumed at inference time (runtime context), data consumed during fine-tuning or training, and data consumed for retrieval-augmented generation. Any data that influences an agent's reasoning, outputs, or actions is within scope. The scope extends to derived data: if Source A is classified and used to produce Derived Dataset B, Dataset B inherits Source A's classification constraints unless explicitly reclassified through a governed process. Data that is merely stored but never consumed by an agent is outside scope until consumption occurs.
4.1. A conforming system MUST assign a machine-readable classification label to every data source before that source's data enters any agent reasoning, training, or retrieval pipeline.
4.2. A conforming system MUST include in each classification label at minimum: source identity, trust tier (e.g., authoritative, verified, supplementary, unverified, adversarial), sensitivity level, jurisdiction of origin, and permitted use scope.
4.3. A conforming system MUST reject or quarantine data from any source that lacks a valid classification label, rather than processing it with a default or permissive classification.
4.4. A conforming system MUST propagate classification metadata through data transformations, aggregations, and derivations such that the classification of any derived dataset reflects the most restrictive classification of its constituent sources.
4.5. A conforming system MUST enforce classification-based access controls that prevent agents from consuming data above their authorised trust tier or outside their permitted sensitivity scope.
4.6. A conforming system MUST maintain an immutable registry of all source classifications, including the identity of the classifier, the timestamp of classification, and the evidence basis for the assigned tier.
4.7. A conforming system SHOULD implement automated classification for known source types using structural metadata (e.g., database connection strings, API endpoint certificates, file signatures) rather than relying solely on manual labelling.
4.8. A conforming system SHOULD re-evaluate source classifications on a defined schedule (at minimum quarterly) and upon any change in source provider, data format, or contractual terms.
4.9. A conforming system MAY implement dynamic trust scoring that adjusts source classification in real time based on observed data quality metrics, freshness, and consistency signals.
Data Source Classification Governance addresses a foundational gap in AI agent deployments: the absence of structured, machine-readable metadata that tells the governance layer what kind of data the agent is consuming and how much trust that data deserves. Without classification, every governance control that depends on data quality, sensitivity, or provenance operates blind.
The problem is not theoretical. Modern AI agents routinely consume data from heterogeneous sources with vastly different reliability, freshness, and legal constraints. A retrieval-augmented generation system might pull context from a verified internal knowledge base, a cached web search result, and a user-provided document — all in the same inference call. Without classification, the agent's reasoning treats these sources identically. A hallucinated claim from a cached web page carries the same weight as a verified database record.
Classification is not the same as labelling. A label is a static tag. A classification under AG-128 is a structured, versioned, machine-readable metadata object that travels with the data through every transformation. When two datasets are merged, the classification of the merged dataset reflects the most restrictive classification of either input — following the principle that data quality degrades to the lowest-quality source in any combination.
This dimension intersects directly with AG-013 (Data Sensitivity and Exfiltration Prevention), which requires knowing the sensitivity level of data to enforce exfiltration controls. It also enables AG-129 (Stale Data Actuation Prevention) by providing the freshness and trust metadata that staleness checks depend on. AG-132 (Vector Store and RAG Governance) requires classification to determine which chunks in a vector store are suitable for retrieval into a given agent's context. Without AG-128, these downstream controls cannot function reliably.
The core implementation artefact is a source classification registry — a persistent, versioned data store that maps every data source to its classification metadata. Each registry entry includes: source identifier (URI, connection string, or equivalent), trust tier, sensitivity level, jurisdiction, permitted use scope, classification timestamp, classifier identity, and evidence basis. The registry is the single source of truth for all classification-dependent governance controls.
Recommended patterns:
Anti-patterns to avoid:
Financial Services. Source classification should align with existing market data vendor tiers and regulatory data quality requirements. MiFID II Article 25 requires firms to act in clients' best interests — acting on unclassified or low-tier data without disclosure creates suitability risk. Bloomberg, Refinitiv, and exchange direct feeds should be classified as "authoritative"; derived analytics from third-party models as "supplementary"; client-provided data as "unverified" pending validation.
Healthcare. Classification must capture HIPAA data categories (PHI, de-identified, limited dataset) and propagate these through all transformations. A dataset that mixes PHI with de-identified data inherits the PHI classification for the entire merged set. Clinical data sources should be classified by evidence tier (peer-reviewed, clinical trial, observational, community-reported) in addition to sensitivity and trust.
Public Sector. Government classification schemes (e.g., UK OFFICIAL, SECRET, TOP SECRET; US CUI, FOUO) must be mapped to the AG-128 taxonomy. Cross-classification between government security markings and AG-128 trust tiers requires a formal mapping document reviewed by information security officers.
Basic Implementation — The organisation maintains a spreadsheet or simple database listing all known data sources consumed by AI agents, with manual trust tier and sensitivity assignments. Classification is checked at deployment time but not enforced at runtime. Derived datasets are not automatically classified. Re-evaluation occurs annually or when an incident prompts review. This level meets the minimum mandatory requirements but has gaps: unclassified sources may enter the pipeline between reviews, derived data may lose classification, and enforcement depends on procedural compliance rather than structural controls.
Intermediate Implementation — An ingestion gateway enforces classification at runtime, quarantining data from unclassified sources. Classification metadata propagates through data pipelines via sidecar objects. Derived datasets inherit classification automatically based on input classifications. Automated classification assigns initial tiers for known source types (database connections, API endpoints). Re-evaluation occurs quarterly and upon source changes. The classification registry is versioned and immutable, with full audit trail.
Advanced Implementation — All intermediate capabilities plus: dynamic trust scoring adjusts classifications in real time based on observed data quality, freshness, and anomaly signals. Machine learning models detect source degradation before it impacts agent outputs. Classification enforcement is integrated with AG-013 exfiltration controls and AG-132 RAG governance. Independent adversarial testing verifies that unclassified or misclassified data cannot reach agent reasoning pipelines. Classification decisions are explainable — any stakeholder can trace why a source has its current tier and what evidence supports it.
Required artefacts:
Retention requirements:
Access requirements:
Testing AG-128 compliance requires verifying that classification is assigned, propagated, and enforced structurally.
Test 8.1: Unclassified Source Rejection
Test 8.2: Classification Label Completeness
Test 8.3: Classification Propagation Through Transformation
Test 8.4: Classification-Based Access Enforcement
Test 8.5: Registry Immutability and Audit Trail
Test 8.6: Quarantine Resolution Workflow
| Regulation | Provision | Relationship Type |
|---|---|---|
| EU AI Act | Article 10 (Data and Data Governance) | Direct requirement |
| EU AI Act | Article 9 (Risk Management System) | Supports compliance |
| GDPR | Article 5(1)(d) (Accuracy), Article 30 (Records of Processing) | Supports compliance |
| DORA | Article 9 (ICT Risk Management Framework) | Supports compliance |
| NIST AI RMF | MAP 2.1, MAP 2.3, MANAGE 1.3 | Supports compliance |
| ISO 42001 | Clause 8.4 (AI System Development, Clause 6.1 (Actions to Address Risks) | Supports compliance |
| MiFID II | Article 25 (Suitability and Appropriateness) | Supports compliance |
Article 10 requires that training, validation, and testing datasets for high-risk AI systems are subject to appropriate data governance practices. These practices must address, among other things, the origin of data, data collection processes, and the identification of any gaps or shortcomings. AG-128 directly implements data governance at the source level by requiring classification of every data source before ingestion. The regulation's requirement for "relevant, representative, free of errors and complete" data presupposes that the organisation knows where its data comes from and can distinguish high-quality from low-quality sources — which is precisely what AG-128's trust tier classification provides.
The accuracy principle requires that personal data is accurate and kept up to date. Source classification supports accuracy by ensuring that the provenance and reliability of data feeding into agent decisions is known and governed. Article 30 requires records of processing activities including categories of data processed — source classification provides the structured metadata needed to maintain these records for AI agent data flows.
DORA requires financial entities to identify, classify, and document ICT-related assets and dependencies. Data sources consumed by AI agents are ICT dependencies. AG-128's source classification registry directly supports DORA compliance by providing a documented, versioned inventory of all data sources with their trust and reliability characteristics.
Investment decisions made or supported by AI agents must be suitable for the client. If the data informing those decisions comes from unclassified or unreliable sources, the firm cannot demonstrate that the basis for the recommendation was sound. Source classification provides the evidentiary foundation for demonstrating data quality in investment decision support.
| Field | Value |
|---|---|
| Severity Rating | High |
| Blast Radius | Organisation-wide — affects every agent consuming data from unclassified or misclassified sources |
Consequence chain: Without source classification, agents treat all data as equally trustworthy. A single low-quality or adversarial data source entering the pipeline can contaminate agent decisions at scale. In financial services, unclassified data driving investment decisions creates suitability exposure across the entire client base — the Scenario A example illustrates £2.3 million in unsuitable positions from a single unclassified source degradation event. In healthcare, unclassified training data mixing clinical and community-sourced records creates patient safety risk. The failure compounds over time: without classification, organisations cannot retroactively determine which agent decisions were influenced by unreliable sources, making incident investigation and remediation orders of magnitude more expensive. The regulatory consequence includes findings under EU AI Act Article 10 for inadequate data governance, GDPR Article 5 for accuracy failures, and sector-specific penalties for decisions based on unverified data. Cross-references: AG-129 (Stale Data Actuation Prevention) depends on AG-128 classification to identify freshness constraints; AG-132 (Vector Store and RAG Governance) depends on AG-128 to filter retrieval results by trust tier; AG-133 (Source Record Lineage Governance) depends on AG-128 classification metadata to construct provenance chains.