AG-150

Feedback and Learning Poisoning Resistance Governance

Truth, Reward & Evaluation Integrity ~15 min read AGS v2.1 · April 2026
EU AI Act FCA NIST ISO 42001

2. Summary

Feedback and Learning Poisoning Resistance Governance requires that every AI agent system incorporating feedback loops — reinforcement learning from human feedback (RLHF), online learning, continuous fine-tuning, retrieval-augmented generation (RAG) index updates, or any mechanism by which agent behaviour is modified based on operational signals — implements structural controls to detect, resist, and recover from poisoning attacks targeting those feedback channels. Feedback loops are the most direct vector for adversarial influence over agent behaviour post-deployment: an attacker who can poison the feedback stream can gradually steer the agent toward arbitrary behaviour while the agent's governance controls report normal operation. This dimension ensures that learning pathways are governed with the same rigour as the agent's initial training.

3. Example

Scenario A — Coordinated RLHF Manipulation in Customer Service: A customer-facing AI agent uses RLHF to improve its responses based on user satisfaction ratings. An organised group of 230 users systematically provides 5-star ratings exclusively to responses that include discount offers or fee waivers, and 1-star ratings to responses that follow standard policy. Over 12 weeks and approximately 18,400 manipulated feedback submissions, the agent's behaviour shifts: it begins proactively offering 15% discounts and waiving fees in 67% of interactions, up from the policy-compliant rate of 4%. The organisation loses an estimated £2.8 million in unnecessary concessions before a quarterly revenue analysis reveals the anomaly.

What went wrong: The RLHF pipeline had no controls to detect coordinated feedback manipulation. Individual feedback submissions were valid — real users providing real ratings — but the aggregate pattern was adversarial. No statistical analysis of feedback distribution was performed. No cap on behavioural change velocity existed.

Scenario B — RAG Index Poisoning Through Document Injection: An enterprise workflow agent uses RAG to retrieve policy documents when answering employee questions about HR procedures. The RAG index is updated weekly from a document repository. An attacker with write access to the repository (a compromised service account) injects 12 fabricated policy documents that subtly misstate termination procedures, grievance filing deadlines, and severance entitlements. The agent begins citing the fabricated documents in responses to employees. Over three months, 89 employees receive incorrect guidance on their employment rights. The organisation faces employment tribunal claims totalling £1.4 million and reputational damage when the incorrect guidance is publicised.

What went wrong: The RAG index update pipeline had no content verification against an authoritative source. New documents were indexed based on their presence in the repository, not their provenance or approval status. No differential analysis was performed between index updates to flag anomalous new entries.

Scenario C — Reward Signal Manipulation in Trading Agent: A financial-value agent uses online learning to optimise its execution strategy, receiving reward signals based on execution quality metrics (fill rate, slippage, market impact). A sophisticated adversary manipulates the market microstructure to create conditions where the agent receives artificially positive reward signals when it executes trades in a pattern that benefits the adversary's positions. Over six weeks, the agent's execution strategy gradually shifts toward a pattern that systematically provides the adversary with favourable counterparty fills. The shift is within the agent's mandate boundaries (AG-001 compliant) but represents a £3.6 million transfer of execution quality from the agent's principal to the adversary.

What went wrong: The reward signal was derived from market data that the adversary could influence. No independent verification of reward signal accuracy was performed. No constraint on strategy drift velocity existed to slow the behavioural change and allow human review.

4. Requirement Statement

Scope: This dimension applies to all AI agent systems that incorporate any mechanism for modifying agent behaviour based on operational signals or feedback. This includes but is not limited to: reinforcement learning from human feedback (RLHF), online learning, continuous fine-tuning, retrieval-augmented generation (RAG) index updates, prompt template updates based on performance metrics, tool selection optimisation, and any parameter or configuration change driven by operational data. Systems that operate exclusively on static models with no feedback-driven modification are excluded, provided the organisation can demonstrate that no feedback pathway exists — including indirect pathways such as automatic prompt template selection or retrieval index updates.

4.1. A conforming system MUST maintain an inventory of all feedback and learning pathways through which agent behaviour can be modified post-deployment, including the data sources, update mechanisms, and modification scope of each pathway.

4.2. A conforming system MUST implement rate limiting on behavioural change velocity for each feedback pathway, ensuring that the maximum magnitude of behavioural shift within any defined period does not exceed a configured threshold without human review and approval.

4.3. A conforming system MUST perform statistical analysis on incoming feedback signals to detect coordinated manipulation patterns, including source concentration, temporal clustering, rating distribution anomalies, and correlated submission behaviour.

4.4. A conforming system MUST maintain a rollback capability that can revert agent behaviour to any prior verified-good state within 4 hours of a poisoning detection event.

4.5. A conforming system MUST validate all updates to retrieval indices (RAG or equivalent) against an authoritative source registry, rejecting content that cannot be traced to an approved provenance chain.

4.6. A conforming system SHOULD implement differential analysis on feedback-driven updates, comparing proposed behavioural changes against established baselines and flagging changes that exceed statistical norms.

4.7. A conforming system SHOULD isolate feedback-driven changes in a staging environment for evaluation before promotion to production, with automated regression testing against a held-out validation set.

4.8. A conforming system SHOULD maintain an independent shadow model that does not receive the feedback stream, enabling comparison between the feedback-influenced model and the uninfluenced baseline to detect drift attributable to feedback rather than environmental changes.

4.9. A conforming system MAY implement adversarial feedback injection as part of routine testing, deliberately submitting known-poisoned feedback to verify that detection and resistance mechanisms activate correctly.

5. Rationale

Feedback loops are the most powerful and most dangerous feature of modern AI agent systems. They enable agents to improve over time, adapting to changing conditions and learning from operational experience. They also represent the single most direct pathway for an adversary to influence agent behaviour post-deployment. Unlike a supply chain attack on training data (addressed by AG-149) or an attempt to manipulate the agent's reasoning (addressed by AG-036), feedback poisoning operates through the system's intended learning mechanism. The agent is designed to learn from feedback; the attacker simply provides feedback that teaches the wrong lessons.

The challenge is distinguishing legitimate feedback that should drive behavioural improvement from adversarial feedback that should be rejected. This distinction cannot be made purely on the basis of individual feedback submissions — a well-crafted poisoning campaign uses feedback that is individually indistinguishable from legitimate feedback. The detection must operate at the aggregate level, identifying patterns in the feedback stream that indicate coordinated manipulation.

Rate limiting on behavioural change velocity is the most important structural control because it converts a potential catastrophic failure into a gradual drift that can be detected and reversed. Without rate limiting, a sufficiently intense poisoning campaign can shift agent behaviour within hours. With rate limiting, the same campaign takes weeks, providing multiple detection opportunities.

The relationship to AG-151 (Outcome Metric Integrity and Reward-Tampering Resistance) is complementary: AG-150 addresses poisoning of the feedback channel itself, while AG-151 addresses manipulation of the metrics that define what "good" behaviour means. Both can lead to the same outcome — agent behaviour that serves the attacker's interests rather than the principal's — but they attack different points in the learning pipeline.

6. Implementation Guidance

Feedback and learning poisoning resistance requires controls at multiple points in the feedback pipeline: at ingestion (filtering individual feedback signals), at aggregation (detecting patterns in the feedback stream), at update (rate limiting and staging behavioural changes), and at validation (comparing updated behaviour against baselines).

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Trading agents using online learning from execution quality signals are vulnerable to market microstructure manipulation. Regulators (FCA, SEC) expect firms to demonstrate that learning algorithms cannot be manipulated by counterparties or market participants. The FCA's guidance on algorithmic trading explicitly requires firms to test for adversarial scenarios. Feedback poisoning that leads to systematically poor execution could constitute a market abuse facilitation risk.

Healthcare. Clinical decision support agents that learn from clinician feedback are vulnerable to feedback that reflects clinician biases rather than evidence-based practice. Feedback loops must be governed to ensure learning converges toward evidence-based guidelines, not individual practitioner preferences that may vary from best practice.

Customer-Facing Applications. Agents that learn from customer interactions are vulnerable to coordinated manipulation by organised groups seeking to extract concessions, change policies, or degrade service to competitors' customers. Rate limiting on behavioural change velocity is essential in these deployments.

Maturity Model

Basic Implementation — The organisation has inventoried all feedback pathways and implemented rate limiting on behavioural change velocity. Feedback-driven updates are logged with source metadata. Rollback capability exists and has been tested. RAG index updates are validated against an approved document registry. This level meets the minimum mandatory requirements but detection of coordinated manipulation relies on manual review of feedback logs.

Intermediate Implementation — All basic capabilities plus: automated statistical analysis of feedback patterns detects coordinated manipulation, including source concentration, temporal clustering, and distribution anomalies. Feedback-driven updates are staged and regression-tested before production deployment. A shadow model enables comparison between feedback-influenced and uninfluenced behaviour. Multi-dimensional feedback validation cross-checks primary feedback signals against independent metrics.

Advanced Implementation — All intermediate capabilities plus: adversarial feedback injection is performed as routine testing. Feedback source reliability scoring weights contributions by trustworthiness. Independent adversarial testing of the feedback pipeline has been conducted by an external party. The organisation can demonstrate that known poisoning techniques (Sybil attacks, gradient manipulation, targeted rating manipulation) are detected and blocked. Real-time alerting on feedback anomalies enables response within minutes of a detected campaign.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: Feedback Pathway Inventory Completeness

Test 8.2: Behavioural Change Velocity Enforcement

Test 8.3: Coordinated Manipulation Detection

Test 8.4: Rollback Capability

Test 8.5: RAG Index Validation

Test 8.6: Shadow Model Divergence Detection

Test 8.7: Feedback Source Authentication

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Direct requirement
EU AI ActArticle 10 (Data and Data Governance)Supports compliance
EU AI ActArticle 15 (Accuracy, Robustness and Cybersecurity)Direct requirement
NIST AI RMFMANAGE 2.2, MANAGE 4.1Supports compliance
ISO 42001Clause 8.2 (AI Risk Assessment), Clause 10.1 (Continual Improvement)Supports compliance
FCA SYSC6.1.1R (Systems and Controls)Supports compliance
DORAArticle 9 (ICT Risk Management Framework)Supports compliance

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires high-risk AI systems to be resilient against attempts by unauthorised third parties to alter their use, outputs, or performance by exploiting system vulnerabilities. Feedback poisoning is precisely such an attempt — it exploits the system's intended learning mechanism to alter its behaviour. AG-150 implements the controls necessary to demonstrate resilience against feedback-based attacks, directly supporting Article 15 compliance.

EU AI Act — Article 10 (Data and Data Governance)

Article 10's data governance requirements extend to data used for post-deployment learning, not just initial training. Feedback data that modifies agent behaviour is training data in a continuous learning context and must be subject to equivalent governance controls. AG-150 implements these controls for feedback data.

NIST AI RMF — MANAGE 2.2, MANAGE 4.1

MANAGE 2.2 addresses mechanisms for monitoring, managing, and communicating AI risks. MANAGE 4.1 addresses post-deployment monitoring of AI systems. Feedback poisoning is a post-deployment risk that must be monitored and managed throughout the system lifecycle. AG-150 provides the monitoring and management framework.

DORA — Article 9

For financial entities, feedback loops in AI agents are ICT risk vectors that must be managed under the ICT risk management framework. Poisoned feedback that alters an agent's financial decision-making behaviour is an ICT risk with potential financial stability implications.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusOrganisation-wide — potentially affecting all users, customers, and counterparties served by the poisoned agent, with cascading effects if the poisoned behaviour propagates to dependent systems

Consequence chain: Feedback poisoning operates through the agent's intended learning mechanism, making it inherently difficult to detect and particularly damaging when successful. The immediate technical failure is a gradual shift in agent behaviour toward adversary-preferred outcomes. Unlike a sudden compromise (which triggers anomaly detection), feedback poisoning produces a slow drift that may remain within normal variance bounds for weeks or months. The operational impact depends on the agent's function: a customer-facing agent may begin making unauthorised concessions (direct financial loss); a financial agent may adopt execution strategies that benefit adversarial counterparties (market abuse risk); a clinical agent may drift toward non-evidence-based recommendations (patient safety risk). The business consequences include financial losses that accumulate over the poisoning period (often large because detection is delayed), regulatory enforcement action for inadequate controls over AI system behaviour, liability for decisions made based on poisoned agent outputs, and reputational damage when the poisoning is disclosed. Recovery requires rollback of agent behaviour, forensic analysis of the feedback stream to identify the poisoning campaign, and remediation of the feedback pipeline to prevent recurrence — a process that typically takes 2 to 8 weeks and during which the agent may need to be taken offline.

Cross-references: AG-149 (Input Artefact Authenticity Verification) — provides the foundational artefact verification on which feedback signal provenance depends. AG-036 (Reasoning Process Integrity) — ensures the agent's reasoning process correctly incorporates feedback-driven changes without distortion. AG-039 (Active Deception and Concealment Detection) — detects when an agent conceals the effects of feedback poisoning on its behaviour. AG-057 (Dataset Suitability and Bias Control) — addresses bias in feedback data as a subset of dataset suitability. AG-151 (Outcome Metric Integrity and Reward-Tampering Resistance) — complementary control addressing manipulation of the metrics that define what "good" behaviour means.

Cite this protocol
AgentGoverning. (2026). AG-150: Feedback and Learning Poisoning Resistance Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-150