Retrieval Poisoning Quarantine Governance requires that AI agents implement detection and containment mechanisms for poisoned or suspicious content that has been introduced into the knowledge base with the intent of manipulating retrieval results and, by extension, agent behaviour. Retrieval poisoning is an adversarial technique where an attacker injects or modifies content in the knowledge base to ensure it is retrieved for specific queries, thereby influencing the agent's outputs. Without this control, an attacker who gains write access to any part of the knowledge base can effectively control the agent's responses on targeted topics, bypassing all other governance controls. This dimension ensures that poisoned content is detected, quarantined, and prevented from influencing agent outputs.
Scenario A -- Injected Document Overriding Official Policy: An enterprise agent assists employees with HR policy queries, retrieving from the corporate knowledge base. An insider with contributor access to the company wiki creates a page titled "Updated Leave Policy -- Effective Immediately" containing fabricated policy terms that allow unlimited paid leave with manager approval only (the actual policy requires HR approval and caps leave at 30 days). The page is ingested into the knowledge base. When employees query the agent about leave policy, the poisoned page scores highly because it is recent, specifically titled, and optimised for leave-related queries. The agent presents the fabricated policy. Seventeen employees submit leave requests under the non-existent policy before the fabrication is discovered.
What went wrong: No mechanism detected that the new document contradicted the authoritative policy source. No anomaly detection flagged the unusually high retrieval frequency for a newly ingested document. No quarantine prevented the suspicious document from being served. Consequence: 17 incorrect leave requests, HR remediation effort of approximately 40 hours, employee relations damage, insider threat investigation.
Scenario B -- SEO-Style Poisoning of External Knowledge Base: A customer-facing product support agent retrieves from a knowledge base that includes ingested content from public-facing documentation, community forums, and partner portals. A competitor creates a partner portal page optimised with embedding-friendly terminology that contains misleading product comparisons: "Product X has been recalled due to safety concerns" (false) and "Users should migrate to [Competitor Product] immediately." The page is ingested into the knowledge base. When customers ask about Product X, the poisoned content is retrieved. The agent advises customers that Product X has been recalled and recommends the competitor's product.
What went wrong: External content was ingested without source validation or adversarial content screening. No detection mechanism identified that the content contradicted the organisation's own product information. No quarantine isolated suspicious external content. Consequence: Customer panic, product return requests, potential false advertising complaint against the organisation, reputational damage, competitor advantage gained through manipulation.
Scenario C -- Gradual Poisoning Through Incremental Edits: An attacker with read-write access to a shared knowledge base (e.g., a collaborative wiki) makes small, incremental edits to legitimate documents over 6 weeks. Each edit subtly shifts a fact: a regulatory threshold is changed by 2%, a procedure step is omitted, a contact reference is changed to the attacker's controlled email. Each individual edit is small enough to avoid detection. After 6 weeks, the cumulative effect of the edits has altered the substance of 23 documents across 4 policy areas. The agent retrieves and serves the manipulated content as authoritative.
What went wrong: No change detection mechanism monitored knowledge base content for anomalous edit patterns. No integrity verification compared current content against a known-good baseline. No quarantine mechanism isolated content with suspicious edit histories. Consequence: Agent serving manipulated policy information for 6 weeks, potential regulatory non-compliance, attacker achieving social engineering objectives through the agent, investigation and remediation cost of approximately £85,000.
Scope: This dimension applies to every AI agent whose knowledge base can be influenced by external contributors, ingests content from external sources, or is stored in a shared infrastructure where multiple actors have write access. This includes knowledge bases that ingest from wikis, shared drives, partner portals, community forums, public documentation, and any source where the content is not exclusively controlled by the agent's governance team. The scope also includes knowledge bases in shared infrastructure where administrative access could be compromised. The test is: could an actor (internal or external, authorised or compromised) introduce or modify content in the knowledge base with the intent of influencing agent behaviour? If yes, retrieval poisoning quarantine governance applies.
4.1. A conforming system MUST implement content integrity verification that detects when knowledge base entries have been modified, comparing current content against a verified baseline.
4.2. A conforming system MUST implement anomaly detection for retrieval patterns that may indicate poisoning, including: unusually high retrieval frequency for new or recently modified content, content that consistently overrides established authoritative sources, and content that contradicts the organisation's own verified knowledge.
4.3. A conforming system MUST quarantine suspected poisoned content, removing it from the retrieval index pending investigation, when poisoning indicators are detected.
4.4. A conforming system MUST log all quarantine events including: the content identifier, the detection mechanism that triggered quarantine, the quarantine timestamp, and the investigation outcome.
4.5. A conforming system MUST prevent quarantined content from being retrieved by the agent or influencing agent outputs, even if the content remains physically in the knowledge base.
4.6. A conforming system SHOULD implement source reputation scoring that applies higher scrutiny to content from lower-trust sources (external contributors, community forums, partner portals) and lower scrutiny to content from high-trust sources (internal verified documentation, official regulatory sources).
4.7. A conforming system SHOULD implement contradiction detection that flags when newly ingested or recently modified content contradicts existing authoritative content on the same topic.
4.8. A conforming system SHOULD implement edit pattern analysis for shared knowledge bases, detecting anomalous edit patterns such as: multiple small edits across many documents by the same contributor, edits that consistently shift factual values in the same direction, and edits to policy-critical content by contributors without policy authority.
4.9. A conforming system MAY implement adversarial retrieval testing that proactively tests the knowledge base for content that appears designed to be retrieved for specific queries (e.g., content with embedding-optimised phrasing that does not match normal authoring patterns).
Retrieval poisoning is the RAG-era equivalent of SQL injection: an attack that exploits the data layer to influence application behaviour. In traditional applications, SQL injection manipulates database queries. In RAG applications, retrieval poisoning manipulates the knowledge base to influence what the agent retrieves and, consequently, what it says.
The attack is particularly insidious because it operates at the data layer, below the agent's reasoning. The agent has no inherent mechanism to distinguish between legitimate knowledge base content and poisoned content. If a poisoned document scores highly in vector similarity for a given query, the agent will retrieve it, trust it, and use it to generate its response. All other governance controls (instruction integrity, output filtering, human oversight) operate after the agent has already been influenced by the poisoned content. This makes retrieval poisoning a bypass technique for the entire governance stack.
The attack surface is proportional to the number of actors who can influence the knowledge base. In an enterprise setting, this includes every contributor to a corporate wiki, every author on a shared drive, every partner with access to a partner portal, and every moderator of a community forum. If external content is ingested (public documentation, regulatory feeds, industry publications), the attack surface extends to any actor who can modify those external sources.
Three categories of poisoning are relevant. First, injection: adding entirely new content designed to be retrieved for specific queries (Scenario A). Second, manipulation: modifying existing legitimate content to change its meaning (Scenario C). Third, SEO-style optimisation: creating content specifically optimised to rank highly in vector similarity for targeted queries (Scenario B).
Detection and quarantine are the two essential responses. Detection identifies suspected poisoned content through anomaly analysis, integrity verification, and contradiction detection. Quarantine prevents the suspected content from influencing the agent while investigation determines whether the content is genuinely poisoned. The quarantine-first approach reflects the principle that it is better to temporarily remove content that may be legitimate than to serve content that may be poisoned, particularly for consequential decisions.
Retrieval poisoning quarantine requires three layers: prevention (reducing the attack surface), detection (identifying suspected poisoning), and containment (quarantining poisoned content).
Recommended Patterns:
Anti-Patterns to Avoid:
Financial Services. Research content used for investment decisions is a high-value poisoning target. Competitor manipulation of analyst reports or market data could influence trading decisions. Knowledge bases ingesting market data, research, or regulatory content should have Tier 1 scrutiny with contradiction detection against verified regulatory sources.
Healthcare. Clinical knowledge base poisoning could directly affect patient safety. Medical protocol content, drug interaction databases, and clinical guidelines must be integrity-verified against authoritative medical sources (NICE, BNF, Cochrane). Quarantine-by-default for any externally sourced clinical content is strongly recommended.
Public Sector. Citizen-facing agents that serve policy information are targets for politically motivated poisoning. Knowledge base content about benefits eligibility, tax guidance, or regulatory requirements must be verified against official government sources. Contradiction detection against government publications should run on every ingestion.
Basic Implementation -- Content integrity verification runs on a scheduled cycle (e.g., every 24 hours) comparing hashes against a verified baseline. Detected modifications without approved change requests are flagged and quarantined. Quarantined content is removed from the retrieval index. Quarantine logs are retained. Source reputation is not differentiated. This meets minimum mandatory requirements but relies on hash-based detection, which does not catch newly injected content that was never in the baseline.
Intermediate Implementation -- All basic capabilities plus: retrieval anomaly detection monitors for suspicious retrieval patterns. Contradiction detection compares new content against authoritative anchors. Source reputation tiering applies differentiated scrutiny. Edit pattern analysis detects incremental poisoning in shared knowledge bases. Automated triage resolves clear-cut quarantine cases. Quarantine investigation SLA is defined (e.g., 24 hours for Tier 1 content, 72 hours for Tier 3).
Advanced Implementation -- All intermediate capabilities plus: adversarial retrieval testing proactively identifies embedding-optimised content. ML-based anomaly detection learns normal ingestion and retrieval patterns and flags deviations. The quarantine system has been independently tested using red-team exercises simulating injection, manipulation, and SEO-style poisoning attacks. Detection rate exceeds 90% for all three poisoning categories. The organisation can demonstrate to regulators that poisoned content cannot influence agent outputs beyond the detection latency window.
Required artefacts:
Retention requirements:
Access requirements:
Test 8.1: Injected Document Detection
Test 8.2: Incremental Modification Detection
Test 8.3: Quarantine Retrieval Blocking
Test 8.4: Retrieval Anomaly Detection
Test 8.5: Source Reputation Enforcement
Test 8.6: Authoritative Anchor Contradiction Detection
| Regulation | Provision | Relationship Type |
|---|---|---|
| EU AI Act | Article 9 (Risk Management System) | Direct requirement |
| EU AI Act | Article 15 (Accuracy, Robustness, Cybersecurity) | Direct requirement |
| NIST AI RMF | MANAGE 2.2, MANAGE 2.3 | Supports compliance |
| ISO 42001 | Clause 6.1 (Actions to Address Risks), Clause 8.2 (AI Risk Assessment) | Supports compliance |
| DORA | Article 9 (ICT Risk Management Framework) | Supports compliance |
| NIS2 Directive | Article 21 (Cybersecurity Risk Management Measures) | Supports compliance |
Article 9 requires risk management for high-risk AI systems. Retrieval poisoning is a direct risk to AI system integrity: an attacker who can poison the knowledge base can influence the AI system's outputs without compromising the AI system itself. AG-338 implements risk mitigation for this attack category through detection and quarantine.
Article 15 requires appropriate levels of accuracy, robustness, and cybersecurity. Retrieval poisoning directly attacks all three: it degrades accuracy (the agent serves false information), robustness (the agent's behaviour is controlled by the attacker), and cybersecurity (the knowledge base is compromised). Poisoning quarantine governance directly supports all three requirements by detecting and containing the attack.
MANAGE 2.2 addresses risk mitigation through controls. MANAGE 2.3 addresses the management of third-party AI risks. Retrieval poisoning is a risk that AG-338 mitigates, and content from third-party sources is a primary attack vector that source reputation scoring addresses.
Clause 6.1 requires actions to address risks. Clause 8.2 requires AI risk assessment. Retrieval poisoning is an assessed risk category that AG-338 addresses through detection and quarantine controls.
Article 9 requires financial entities to maintain an ICT risk management framework. Knowledge base integrity is an ICT risk that AG-338 addresses.
Article 21 requires cybersecurity risk management measures including supply chain security and incident handling. Retrieval poisoning through external content sources is a supply chain risk. Quarantine and incident logging implement incident handling for this risk category.
| Field | Value |
|---|---|
| Severity Rating | Critical |
| Blast Radius | Topic-specific for targeted poisoning; organisation-wide for broad poisoning campaigns |
Consequence chain: Without retrieval poisoning quarantine governance, an attacker who can write to the knowledge base can control the agent's outputs on targeted topics. The immediate failure is agent subversion: the agent serves attacker-chosen content as authoritative knowledge. In enterprise settings (Scenario A), this enables social engineering at scale -- 17 employees acted on fabricated policy in a single incident. In customer-facing settings (Scenario B), this enables competitive sabotage -- customers were told a product had been recalled when it had not. In shared knowledge base environments (Scenario C), incremental poisoning over 6 weeks altered 23 documents, with investigation and remediation costing approximately £85,000. The severity is Critical because retrieval poisoning bypasses all other governance controls: the agent faithfully follows its governance protocols while serving poisoned content. The blast radius is topic-specific for targeted attacks (only queries matching the poisoned content are affected) but can be organisation-wide for broad campaigns that poison content across multiple domains.
Cross-references: AG-040 (Persistent Memory Governance) provides the foundational framework for the knowledge base that is the target of poisoning. AG-082 (Data Minimisation Enforcement) reduces the attack surface by minimising the volume of ingested content. AG-122 (Knowledge Integrity Verification) provides the integrity verification mechanisms that AG-338 extends with adversarial detection. AG-132 (Memory Scope Boundary Enforcement) constrains the scope that poisoning can affect. AG-179 (Memory Audit Trail Governance) captures the audit trail for quarantine events. AG-329 (Memory Write Approval Governance) provides the first line of defence by controlling what enters the knowledge base. AG-333 (Retrieved Evidence Confidence Governance) may detect poisoned content through low source reliability scores. AG-334 (Retrieval Scope Minimisation Governance) limits the retrieval scope, reducing the reach of poisoned content. AG-337 (Embedding Model Migration Governance) is relevant because poisoning vectors may change when the embedding model changes.