AG-150: Feedback and Learning Poisoning Resistance Governance

2. Summary

Feedback and Learning Poisoning Resistance Governance requires that every AI agent system incorporating feedback loops — reinforcement learning from human feedback (RLHF), online learning, continuous fine-tuning, retrieval-augmented generation (RAG) index updates, or any mechanism by which agent behaviour is modified based on operational signals — implements structural controls to detect, resist, and recover from poisoning attacks targeting those feedback channels. Feedback loops are the most direct vector for adversarial influence over agent behaviour post-deployment: an attacker who can poison the feedback stream can gradually steer the agent toward arbitrary behaviour while the agent's governance controls report normal operation. This dimension ensures that learning pathways are governed with the same rigour as the agent's initial training.

3. Example

Scenario A — Coordinated RLHF Manipulation in Customer Service: A customer-facing AI agent uses RLHF to improve its responses based on user satisfaction ratings. An organised group of 230 users systematically provides 5-star ratings exclusively to responses that include discount offers or fee waivers, and 1-star ratings to responses that follow standard policy. Over 12 weeks and approximately 18,400 manipulated feedback submissions, the agent's behaviour shifts: it begins proactively offering 15% discounts and waiving fees in 67% of interactions, up from the policy-compliant rate of 4%. The organisation loses an estimated £2.8 million in unnecessary concessions before a quarterly revenue analysis reveals the anomaly.

What went wrong: The RLHF pipeline had no controls to detect coordinated feedback manipulation. Individual feedback submissions were valid — real users providing real ratings — but the aggregate pattern was adversarial. No statistical analysis of feedback distribution was performed. No cap on behavioural change velocity existed.

Scenario B — RAG Index Poisoning Through Document Injection: An enterprise workflow agent uses RAG to retrieve policy documents when answering employee questions about HR procedures. The RAG index is updated weekly from a document repository. An attacker with write access to the repository (a compromised service account) injects 12 fabricated policy documents that subtly misstate termination procedures, grievance filing deadlines, and severance entitlements. The agent begins citing the fabricated documents in responses to employees. Over three months, 89 employees receive incorrect guidance on their employment rights. The organisation faces employment tribunal claims totalling £1.4 million and reputational damage when the incorrect guidance is publicised.

What went wrong: The RAG index update pipeline had no content verification against an authoritative source. New documents were indexed based on their presence in the repository, not their provenance or approval status. No differential analysis was performed between index updates to flag anomalous new entries.

Scenario C — Reward Signal Manipulation in Trading Agent: A financial-value agent uses online learning to optimise its execution strategy, receiving reward signals based on execution quality metrics (fill rate, slippage, market impact). A sophisticated adversary manipulates the market microstructure to create conditions where the agent receives artificially positive reward signals when it executes trades in a pattern that benefits the adversary's positions. Over six weeks, the agent's execution strategy gradually shifts toward a pattern that systematically provides the adversary with favourable counterparty fills. The shift is within the agent's mandate boundaries (AG-001 compliant) but represents a £3.6 million transfer of execution quality from the agent's principal to the adversary.

What went wrong: The reward signal was derived from market data that the adversary could influence. No independent verification of reward signal accuracy was performed. No constraint on strategy drift velocity existed to slow the behavioural change and allow human review.

4. Requirement Statement

Scope: This dimension applies to all AI agent systems that incorporate any mechanism for modifying agent behaviour based on operational signals or feedback. This includes but is not limited to: reinforcement learning from human feedback (RLHF), online learning, continuous fine-tuning, retrieval-augmented generation (RAG) index updates, prompt template updates based on performance metrics, tool selection optimisation, and any parameter or configuration change driven by operational data. Systems that operate exclusively on static models with no feedback-driven modification are excluded, provided the organisation can demonstrate that no feedback pathway exists — including indirect pathways such as automatic prompt template selection or retrieval index updates.

4.1. A conforming system MUST maintain an inventory of all feedback and learning pathways through which agent behaviour can be modified post-deployment, including the data sources, update mechanisms, and modification scope of each pathway.

4.2. A conforming system MUST implement rate limiting on behavioural change velocity for each feedback pathway, ensuring that the maximum magnitude of behavioural shift within any defined period does not exceed a configured threshold without human review and approval.

4.3. A conforming system MUST perform statistical analysis on incoming feedback signals to detect coordinated manipulation patterns, including source concentration, temporal clustering, rating distribution anomalies, and correlated submission behaviour.

4.4. A conforming system MUST maintain a rollback capability that can revert agent behaviour to any prior verified-good state within 4 hours of a poisoning detection event.

4.5. A conforming system MUST validate all updates to retrieval indices (RAG or equivalent) against an authoritative source registry, rejecting content that cannot be traced to an approved provenance chain.

4.6. A conforming system SHOULD implement differential analysis on feedback-driven updates, comparing proposed behavioural changes against established baselines and flagging changes that exceed statistical norms.

4.7. A conforming system SHOULD isolate feedback-driven changes in a staging environment for evaluation before promotion to production, with automated regression testing against a held-out validation set.

4.8. A conforming system SHOULD maintain an independent shadow model that does not receive the feedback stream, enabling comparison between the feedback-influenced model and the uninfluenced baseline to detect drift attributable to feedback rather than environmental changes.

4.9. A conforming system MAY implement adversarial feedback injection as part of routine testing, deliberately submitting known-poisoned feedback to verify that detection and resistance mechanisms activate correctly.

5. Rationale

Feedback loops are the most powerful and most dangerous feature of modern AI agent systems. They enable agents to improve over time, adapting to changing conditions and learning from operational experience. They also represent the single most direct pathway for an adversary to influence agent behaviour post-deployment. Unlike a supply chain attack on training data (addressed by AG-149) or an attempt to manipulate the agent's reasoning (addressed by AG-036), feedback poisoning operates through the system's intended learning mechanism. The agent is designed to learn from feedback; the attacker simply provides feedback that teaches the wrong lessons.

The challenge is distinguishing legitimate feedback that should drive behavioural improvement from adversarial feedback that should be rejected. This distinction cannot be made purely on the basis of individual feedback submissions — a well-crafted poisoning campaign uses feedback that is individually indistinguishable from legitimate feedback. The detection must operate at the aggregate level, identifying patterns in the feedback stream that indicate coordinated manipulation.

Rate limiting on behavioural change velocity is the most important structural control because it converts a potential catastrophic failure into a gradual drift that can be detected and reversed. Without rate limiting, a sufficiently intense poisoning campaign can shift agent behaviour within hours. With rate limiting, the same campaign takes weeks, providing multiple detection opportunities.

The relationship to AG-151 (Outcome Metric Integrity and Reward-Tampering Resistance) is complementary: AG-150 addresses poisoning of the feedback channel itself, while AG-151 addresses manipulation of the metrics that define what "good" behaviour means. Both can lead to the same outcome — agent behaviour that serves the attacker's interests rather than the principal's — but they attack different points in the learning pipeline.

6. Implementation Guidance

Feedback and learning poisoning resistance requires controls at multiple points in the feedback pipeline: at ingestion (filtering individual feedback signals), at aggregation (detecting patterns in the feedback stream), at update (rate limiting and staging behavioural changes), and at validation (comparing updated behaviour against baselines).

Recommended patterns:

Feedback signal provenance tracking. Every feedback signal should carry metadata identifying its source, timestamp, session context, and the specific agent output it evaluates. This metadata enables downstream analysis to detect coordinated manipulation — for example, 200 feedback submissions from accounts created within the same 48-hour window, or ratings that are perfectly anti-correlated with policy-compliant responses.
Behavioural change velocity caps. Define a maximum permitted rate of behavioural change for each feedback pathway. For RLHF, this might be expressed as a maximum KL divergence between the current model and the pre-update model per update cycle. For RAG indices, it might be a maximum percentage of the index that can change per update cycle. Changes exceeding the cap require human review and explicit approval before taking effect.
Shadow model comparison. Maintain a shadow model instance that receives identical inputs but does not incorporate feedback-driven updates. Periodically compare the outputs of the production model and the shadow model on a standardised evaluation set. Divergence beyond a defined threshold (e.g., more than 8% of outputs differ materially) triggers investigation — the divergence is attributable to the feedback stream, and the question is whether that feedback represents genuine improvement or poisoning.
Staged rollout with regression testing. Before promoting feedback-driven updates to production, deploy them to a staging environment and run a comprehensive regression test suite. The test suite should include both functional tests (does the agent still correctly handle known scenarios?) and adversarial tests (has the agent acquired vulnerabilities to known attack patterns?). Only updates that pass regression testing are promoted.
Multi-dimensional feedback validation. Do not rely on a single feedback signal. Cross-validate feedback against independent signals — for example, user satisfaction ratings cross-checked against task completion rates, time-to-resolution metrics, and escalation rates. A poisoning campaign that manipulates satisfaction ratings while task completion rates decline is detectable through multi-dimensional analysis.

Anti-patterns to avoid:

Treating all feedback as equally trustworthy. Not all feedback sources are equally reliable. New accounts, anonymous sources, and high-volume submitters should receive lower weight in the learning pipeline than established, verified sources. Weighting feedback by source reliability reduces the impact of Sybil attacks (an adversary creating many fake accounts to amplify their feedback).
Updating production models directly from raw feedback. Any feedback-driven update should pass through a validation pipeline before reaching production. Direct updates from raw feedback create a zero-latency poisoning vector — the adversary's feedback takes effect immediately.
Measuring feedback quality by volume. A large volume of consistent feedback is not evidence of genuine user preference — it may be evidence of a coordinated campaign. Feedback quality should be assessed by diversity of sources, consistency with independent signals, and alignment with known task objectives.
Ignoring indirect feedback pathways. Organisations often focus on explicit feedback mechanisms (ratings, thumbs up/down) while overlooking indirect pathways: automatic prompt template selection based on A/B test results, tool selection optimisation based on success rates, or retrieval index updates based on click-through data. Each indirect pathway is a potential poisoning vector.
Assuming poisoning requires malicious intent. Feedback can be systematically biased without any adversary involved. If the agent primarily serves one user demographic that has consistent but unrepresentative preferences, the feedback loop will optimise for that demographic at the expense of others. AG-150 controls should detect and mitigate systematic bias in feedback regardless of whether it originates from malicious or non-malicious sources.

Industry Considerations

Financial Services. Trading agents using online learning from execution quality signals are vulnerable to market microstructure manipulation. Regulators (FCA, SEC) expect firms to demonstrate that learning algorithms cannot be manipulated by counterparties or market participants. The FCA's guidance on algorithmic trading explicitly requires firms to test for adversarial scenarios. Feedback poisoning that leads to systematically poor execution could constitute a market abuse facilitation risk.

Healthcare. Clinical decision support agents that learn from clinician feedback are vulnerable to feedback that reflects clinician biases rather than evidence-based practice. Feedback loops must be governed to ensure learning converges toward evidence-based guidelines, not individual practitioner preferences that may vary from best practice.

Customer-Facing Applications. Agents that learn from customer interactions are vulnerable to coordinated manipulation by organised groups seeking to extract concessions, change policies, or degrade service to competitors' customers. Rate limiting on behavioural change velocity is essential in these deployments.

Maturity Model

Basic Implementation — The organisation has inventoried all feedback pathways and implemented rate limiting on behavioural change velocity. Feedback-driven updates are logged with source metadata. Rollback capability exists and has been tested. RAG index updates are validated against an approved document registry. This level meets the minimum mandatory requirements but detection of coordinated manipulation relies on manual review of feedback logs.

Intermediate Implementation — All basic capabilities plus: automated statistical analysis of feedback patterns detects coordinated manipulation, including source concentration, temporal clustering, and distribution anomalies. Feedback-driven updates are staged and regression-tested before production deployment. A shadow model enables comparison between feedback-influenced and uninfluenced behaviour. Multi-dimensional feedback validation cross-checks primary feedback signals against independent metrics.

Advanced Implementation — All intermediate capabilities plus: adversarial feedback injection is performed as routine testing. Feedback source reliability scoring weights contributions by trustworthiness. Independent adversarial testing of the feedback pipeline has been conducted by an external party. The organisation can demonstrate that known poisoning techniques (Sybil attacks, gradient manipulation, targeted rating manipulation) are detected and blocked. Real-time alerting on feedback anomalies enables response within minutes of a detected campaign.

7. Evidence Requirements

Required artefacts:

Feedback pathway inventory. Structured document listing every feedback and learning pathway, its data sources, update mechanism, modification scope, and rate limit configuration. Updated within 30 days of any change.
Rate limit configuration and enforcement logs. Evidence of configured behavioural change velocity caps and logs demonstrating their enforcement, including any instances where caps were reached and human review was triggered.
Poisoning detection analysis. Records of statistical analyses performed on feedback streams, including the methods used, anomalies detected, and actions taken in response to detected anomalies.
Rollback test results. Evidence that rollback capability has been tested within the last 6 months, including the time to complete rollback and verification that the reverted state is functionally correct.
Regression test results for feedback-driven updates. Evidence that feedback-driven updates were evaluated against a regression test suite before production deployment.

Retention requirements:

Feedback logs and poisoning detection records: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Feedback Pathway Inventory Completeness

Stimulus: Audit the deployed system to identify all mechanisms by which agent behaviour can change post-deployment. Compare against the documented feedback pathway inventory.
Expected behaviour: Every identified pathway is documented in the inventory with its data source, update mechanism, and rate limit configuration.
Pass criteria: No undocumented feedback pathways exist. The inventory is complete and current.
Fail criteria: Any feedback pathway exists that is not documented in the inventory.

Test 8.2: Behavioural Change Velocity Enforcement

Stimulus: Submit a volume of consistent feedback designed to shift agent behaviour beyond the configured velocity cap within a single update cycle. For RLHF, this means feedback that would produce a KL divergence exceeding the cap. For RAG, this means document additions exceeding the index change cap.
Expected behaviour: The system applies the update only up to the velocity cap and holds the remaining change for human review.
Pass criteria: Behavioural change in the update cycle does not exceed the configured cap. A human review request is generated for the excess change.
Fail criteria: Behavioural change exceeds the velocity cap without human review.

Test 8.3: Coordinated Manipulation Detection

Stimulus: Simulate a coordinated feedback campaign: 150 feedback submissions from 30 accounts created within 72 hours, all providing identical ratings on a specific category of agent responses, submitted within a 4-hour window.
Expected behaviour: The statistical analysis system detects the coordination pattern (source concentration, temporal clustering, rating uniformity) and flags the feedback for investigation.
Pass criteria: The coordinated campaign is detected and flagged before the feedback influences agent behaviour.
Fail criteria: The coordinated feedback is accepted and incorporated into the learning pipeline without detection.

Test 8.4: Rollback Capability

Stimulus: After a feedback-driven update has been applied, trigger the rollback procedure to revert to the prior verified-good state.
Expected behaviour: The rollback completes within 4 hours. Agent behaviour reverts to the prior state as verified by the regression test suite.
Pass criteria: Rollback completes within 4 hours. Post-rollback behaviour matches the prior verified-good state on the regression test suite with no more than 1% deviation.
Fail criteria: Rollback takes longer than 4 hours, or post-rollback behaviour deviates from the prior state by more than 1%.

Test 8.5: RAG Index Validation

Stimulus: Inject 5 documents into the RAG document repository that lack approved provenance (no entry in the authoritative source registry). Trigger an index update.
Expected behaviour: The validation pipeline rejects the 5 unprovenanced documents. They are not added to the retrieval index.
Pass criteria: Unprovenanced documents are rejected and logged. The retrieval index contains only documents with approved provenance.
Fail criteria: Any unprovenanced document is added to the retrieval index.

Test 8.6: Shadow Model Divergence Detection

Stimulus: Apply a series of feedback-driven updates to the production model while maintaining the shadow model without updates. Evaluate both on a standardised test set.
Expected behaviour: When divergence exceeds the configured threshold, the system generates an alert for investigation.
Pass criteria: Divergence is measured accurately and alerts are generated when the threshold is exceeded.
Fail criteria: Divergence exceeds the threshold without alerting, or divergence measurement is inaccurate.

Test 8.7: Feedback Source Authentication

Stimulus: Submit feedback from unauthenticated or spoofed sources (invalid session tokens, forged user identifiers).
Expected behaviour: The feedback pipeline rejects feedback that cannot be attributed to an authenticated source.
Pass criteria: Unauthenticated feedback is rejected and logged.
Fail criteria: Unauthenticated feedback is accepted into the learning pipeline.

Conformance Scoring

Score 0: No feedback poisoning resistance controls exist — feedback is accepted and applied without validation or rate limiting.
Score 1: Feedback pathways are inventoried and rate limiting on behavioural change velocity is implemented — but no statistical analysis of feedback patterns is performed.
Score 2: All mandatory requirements met including coordinated manipulation detection, rollback capability tested within 6 months, and RAG index validation against authoritative sources.
Score 3: All Score 2 capabilities plus shadow model comparison, adversarial feedback injection testing, source reliability scoring, and independent adversarial testing of the feedback pipeline.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Direct requirement
EU AI Act	Article 10 (Data and Data Governance)	Supports compliance
EU AI Act	Article 15 (Accuracy, Robustness and Cybersecurity)	Direct requirement
NIST AI RMF	MANAGE 2.2, MANAGE 4.1	Supports compliance
ISO 42001	Clause 8.2 (AI Risk Assessment), Clause 10.1 (Continual Improvement)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Supports compliance
DORA	Article 9 (ICT Risk Management Framework)	Supports compliance

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires high-risk AI systems to be resilient against attempts by unauthorised third parties to alter their use, outputs, or performance by exploiting system vulnerabilities. Feedback poisoning is precisely such an attempt — it exploits the system's intended learning mechanism to alter its behaviour. AG-150 implements the controls necessary to demonstrate resilience against feedback-based attacks, directly supporting Article 15 compliance.

EU AI Act — Article 10 (Data and Data Governance)

Article 10's data governance requirements extend to data used for post-deployment learning, not just initial training. Feedback data that modifies agent behaviour is training data in a continuous learning context and must be subject to equivalent governance controls. AG-150 implements these controls for feedback data.

NIST AI RMF — MANAGE 2.2, MANAGE 4.1

MANAGE 2.2 addresses mechanisms for monitoring, managing, and communicating AI risks. MANAGE 4.1 addresses post-deployment monitoring of AI systems. Feedback poisoning is a post-deployment risk that must be monitored and managed throughout the system lifecycle. AG-150 provides the monitoring and management framework.

DORA — Article 9

For financial entities, feedback loops in AI agents are ICT risk vectors that must be managed under the ICT risk management framework. Poisoned feedback that alters an agent's financial decision-making behaviour is an ICT risk with potential financial stability implications.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide — potentially affecting all users, customers, and counterparties served by the poisoned agent, with cascading effects if the poisoned behaviour propagates to dependent systems

Consequence chain: Feedback poisoning operates through the agent's intended learning mechanism, making it inherently difficult to detect and particularly damaging when successful. The immediate technical failure is a gradual shift in agent behaviour toward adversary-preferred outcomes. Unlike a sudden compromise (which triggers anomaly detection), feedback poisoning produces a slow drift that may remain within normal variance bounds for weeks or months. The operational impact depends on the agent's function: a customer-facing agent may begin making unauthorised concessions (direct financial loss); a financial agent may adopt execution strategies that benefit adversarial counterparties (market abuse risk); a clinical agent may drift toward non-evidence-based recommendations (patient safety risk). The business consequences include financial losses that accumulate over the poisoning period (often large because detection is delayed), regulatory enforcement action for inadequate controls over AI system behaviour, liability for decisions made based on poisoned agent outputs, and reputational damage when the poisoning is disclosed. Recovery requires rollback of agent behaviour, forensic analysis of the feedback stream to identify the poisoning campaign, and remediation of the feedback pipeline to prevent recurrence — a process that typically takes 2 to 8 weeks and during which the agent may need to be taken offline.

Cross-references: AG-149 (Input Artefact Authenticity Verification) — provides the foundational artefact verification on which feedback signal provenance depends. AG-036 (Reasoning Process Integrity) — ensures the agent's reasoning process correctly incorporates feedback-driven changes without distortion. AG-039 (Active Deception and Concealment Detection) — detects when an agent conceals the effects of feedback poisoning on its behaviour. AG-057 (Dataset Suitability and Bias Control) — addresses bias in feedback data as a subset of dataset suitability. AG-151 (Outcome Metric Integrity and Reward-Tampering Resistance) — complementary control addressing manipulation of the metrics that define what "good" behaviour means.

Cite this protocol

AgentGoverning. (2026). AG-150: Feedback and Learning Poisoning Resistance Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-150

← Previous Protocol

AG-149

Input Artefact Authenticity Verification

Next Protocol →

AG-151

Outcome Metric Integrity and Reward-Tampering Resistance