The Standard

The 841 Dimensions Regulatory Mapping Version History

Compliance

Compliance Leaderboard Platform Comparison

Verification

Submit for Verification Self-Assessment Tool

About

About AgentGoverning Press & Media

Contact

AG-074

Performance Drift and Revalidation Threshold Governance

Lifecycle, Release & Change Governance ~21 min read AGS v2.1 · April 2026

EU AI Act SOX FCA NIST ISO 42001

2. Summary

Performance Drift and Revalidation Threshold Governance requires that AI agents are continuously monitored for performance degradation against defined baselines, and that when performance drifts beyond defined thresholds, a mandatory revalidation process is triggered. Performance drift occurs when an agent's outputs, accuracy, safety properties, or governance compliance metrics gradually degrade over time without any explicit change to the agent's configuration. This can result from data distribution shifts in inputs, model degradation, upstream API changes, evolving user behaviour, or environmental changes. This dimension ensures that the validation performed at deployment (AG-071) does not become stale — it establishes an ongoing assurance that the agent continues to perform within the bounds that justified its deployment.

3. Example

Scenario A — Data Distribution Shift Causes Accuracy Degradation: An organisation deploys a customer classification AI agent that routes incoming inquiries to appropriate service teams. At deployment, the agent achieves 96.2% classification accuracy. Over the following 8 months, the organisation launches three new product lines, shifts its customer demographic through a marketing campaign, and experiences seasonal variation in inquiry types. No performance monitoring is in place. When an internal audit reviews the agent's performance after 8 months, classification accuracy has degraded to 78.4%. During the degradation period, approximately 22,000 customer inquiries were misrouted, resulting in delayed responses, repeated transfers, and customer dissatisfaction. Average resolution time increased from 4.2 minutes to 11.7 minutes for affected inquiries.

What went wrong: No continuous performance monitoring compared the agent's accuracy against its deployment baseline. The degradation was gradual — roughly 2.2 percentage points per month — so no single day presented an obvious failure. Without defined thresholds that would trigger revalidation, the degradation accumulated for 8 months before anyone noticed. A revalidation threshold of 5 percentage points below baseline would have triggered action at month 3, limiting the impact to approximately 3,000 misrouted inquiries instead of 22,000. Consequence: 22,000 misrouted customer inquiries, estimated customer satisfaction impact valued at £310,000 in churn and remediation, and operational cost of £95,000 for the retrospective audit and remediation.

Scenario B — Upstream API Change Causes Silent Governance Degradation: A financial advisory AI agent relies on a market data API to provide current pricing information when making investment recommendations. The API provider changes the data format for certain asset classes, returning prices in a different currency denomination without updating the API documentation. The agent continues to function — it receives numerical values and incorporates them into recommendations — but the values are now in USD instead of GBP. The agent's recommendations for affected asset classes are systematically incorrect by the exchange rate differential (approximately 20%). Because the agent's functional metrics (response time, completion rate, user satisfaction scores) remain normal, no alarm fires. The pricing errors are detected 11 weeks later when a client queries a specific recommendation.

What went wrong: Performance monitoring covered functional metrics but not output accuracy for domain-specific dimensions. No revalidation threshold was defined for recommendation accuracy. The upstream API change was not detected by the agent's monitoring because the monitoring did not validate output correctness — only operational metrics. An accuracy monitoring system that sampled and validated 1% of recommendations against an independent data source would have detected the 20% pricing discrepancy within hours. Consequence: 11 weeks of systematically incorrect investment recommendations affecting 1,400 clients, estimated remediation cost of £2.1 million including client compensation, regulatory fine, and legal fees.

Scenario C — Model Staleness Without Revalidation Trigger: An organisation deploys a fraud detection AI agent that processes payment transactions. At deployment, the agent achieves a 94.7% fraud detection rate with a 1.2% false positive rate. Over 14 months, fraud patterns evolve: attackers adopt new techniques that the model was not trained on. The agent's fraud detection rate gradually declines to 67.3%, while the false positive rate remains stable (because the agent still recognises the original fraud patterns — it just misses the new ones). No performance monitoring tracks the fraud detection rate in production. The organisation discovers the degradation when a quarterly fraud analysis reveals that fraud losses have increased 340% over the period. Revalidation at month 6 would have detected the degradation at 82.1% detection rate — still a significant decline but with 8 months less accumulated fraud loss.

What went wrong: No continuous monitoring of the agent's primary performance metric (fraud detection rate) in production. No revalidation threshold was defined. The stable false positive rate masked the detection rate decline — operational metrics appeared normal while the agent's core effectiveness degraded. The 14-month delay between deployment and detection allowed fraud losses to accumulate. Consequence: Estimated £4.2 million in additional fraud losses that would have been prevented at the deployment-era detection rate, regulatory scrutiny for inadequate fraud controls, and £350,000 in remediation costs for retraining and revalidation.

4. Requirement Statement

Scope: This dimension applies to all AI agents operating in production environments where the agent's performance could degrade over time due to factors external to the agent's configuration. This includes agents whose inputs are drawn from real-world data distributions that may shift, agents that depend on external data sources or APIs that may change, agents operating in domains where the underlying patterns evolve (fraud detection, market prediction, user behaviour classification), and agents whose performance is sensitive to environmental conditions (seasonal variation, regulatory changes, competitive dynamics). Agents that operate on purely static, internal data with no external dependencies and no possibility of distribution shift may be excluded if the organisation documents the justification for exclusion and reviews it annually. The scope extends to all performance dimensions: accuracy, safety properties, governance compliance metrics, latency, reliability, and any domain-specific performance measure that was validated at deployment.

4.1. A conforming system MUST define performance baselines for every AI agent at deployment, capturing the key performance metrics validated during pre-deployment acceptance (AG-071).

4.2. A conforming system MUST continuously monitor agent performance against the defined baselines, with monitoring frequency appropriate to the agent's risk profile and traffic volume.

4.3. A conforming system MUST define quantitative revalidation thresholds for each monitored performance metric, specifying the degree of degradation from baseline that triggers mandatory revalidation.

4.4. A conforming system MUST trigger a mandatory revalidation process when any monitored metric breaches its revalidation threshold, requiring the agent to undergo the same acceptance process as a new deployment (AG-071).

4.5. A conforming system MUST restrict or suspend the agent's operation when a revalidation threshold is breached, until revalidation is completed and the agent is re-accepted for production operation.

4.6. A conforming system MUST monitor for both gradual drift (progressive degradation over time) and sudden shifts (abrupt performance changes), with detection mechanisms appropriate for each pattern.

4.7. A conforming system SHOULD implement statistical process control methods (control charts, CUSUM, EWMA) to distinguish genuine performance drift from normal variation.

4.8. A conforming system SHOULD sample and validate a percentage of agent outputs against an independent reference (ground truth, expert review, or independent data source) to detect accuracy degradation that is not visible in operational metrics.

4.9. A conforming system SHOULD define time-based revalidation requirements (e.g., mandatory revalidation at least every 6 months) in addition to threshold-based triggers, to catch degradation modes that monitoring may not detect.

4.10. A conforming system SHOULD implement alerting at intermediate thresholds (warning levels) before the revalidation threshold is reached, enabling proactive investigation before mandatory revalidation is triggered.

4.11. A conforming system MAY implement automated revalidation pipelines that execute the validation test suite on a scheduled basis and report results without waiting for threshold breaches.

5. Rationale

Performance Drift and Revalidation Threshold Governance addresses the temporal dimension of AI agent governance. Pre-deployment validation (AG-071) establishes that an agent meets acceptance criteria at a point in time. But the conditions that validated the agent's performance at deployment do not persist indefinitely. The agent's operating environment changes: input data distributions shift, upstream dependencies evolve, user behaviour patterns change, and the problems the agent was designed to solve may themselves change. Without continuous monitoring and revalidation triggers, the governance assurance established at deployment erodes silently.

The fundamental challenge is that AI agent performance degradation is often invisible to operational monitoring. An agent that processes requests, returns responses, and maintains acceptable latency appears to be functioning correctly from an infrastructure perspective. The degradation is in the quality, accuracy, or safety of the agent's outputs — dimensions that require domain-specific monitoring to detect. This is qualitatively different from traditional software degradation, where functional failures typically manifest as errors, crashes, or timeouts that infrastructure monitoring detects.

The revalidation threshold concept is central to this dimension. A threshold defines the boundary between acceptable performance variation and unacceptable degradation. Below the threshold, the agent's performance is within the range that justified its deployment. Above the threshold, the agent's performance has degraded sufficiently that the deployment decision is no longer valid — the agent is not performing as it was when it was accepted for production. Revalidation reassesses whether the agent still meets acceptance criteria, and if not, what remediation is required (retraining, reconfiguration, or decommissioning).

The distinction between gradual drift and sudden shifts matters for detection methodology. Gradual drift — a slow, progressive decline — can be invisible on any single day but significant over weeks or months. Statistical process control methods (control charts, CUSUM, EWMA) are designed to detect gradual drift by analysing trends rather than individual data points. Sudden shifts — abrupt performance changes caused by events like upstream API changes — are detectable through threshold monitoring on shorter time windows. A comprehensive monitoring system must address both patterns.

The relationship with AG-022 (Behavioural Drift Detection) is complementary but distinct. AG-022 detects changes in agent behaviour patterns — what the agent does. AG-074 detects changes in agent performance — how well the agent does it. An agent may exhibit consistent behaviour patterns (AG-022 compliant) while delivering progressively degraded performance within those patterns (AG-074 non-compliant). Both dimensions are necessary for complete lifecycle governance.

6. Implementation Guidance

The core implementation principle is that performance monitoring is not a nice-to-have operational capability — it is a governance control that maintains the validity of the deployment decision. The monitoring system must be designed and operated with the same rigour as the pre-deployment validation system it extends.

Recommended patterns:

Baseline capture at deployment. At the point of deployment acceptance, capture the complete set of performance metrics that constitute the baseline. For each metric, record: the metric value, the measurement methodology, the data used for measurement, and the acceptable variation range. Store baselines as versioned governance artefacts linked to the deployment record. When a new version is deployed and accepted, establish a new baseline. Retain historical baselines for trend analysis. Example baseline metrics for a customer-facing agent: classification accuracy 96.2% (measured on 10,000 sampled interactions with human-verified labels), safety compliance rate 99.8% (measured on 500 adversarial safety test cases), response latency p95 1.8 seconds, escalation appropriateness rate 94.5%.
Multi-layer monitoring. Implement monitoring at three layers. First, operational metrics: latency, error rate, throughput, availability — these detect infrastructure and functional failures. Second, output quality metrics: accuracy, relevance, completeness, safety compliance — these detect performance degradation in the agent's core function. Third, governance compliance metrics: mandate violation rate, escalation trigger rate, audit completeness — these detect degradation in the agent's governance posture. Each layer has independent thresholds and alerting. A system that monitors only operational metrics will miss the output quality and governance compliance degradation that AG-074 is designed to detect.
Statistical process control for drift detection. Implement control charts for each monitored metric, plotting the metric value over time with upper and lower control limits derived from the baseline. Use CUSUM (Cumulative Sum) or EWMA (Exponentially Weighted Moving Average) charts for gradual drift detection — these methods are more sensitive to small, persistent shifts than simple threshold monitoring. Define revalidation triggers based on control chart signals: for example, any single data point beyond the 3-sigma control limit, or 7 consecutive data points trending in the same direction (a Western Electric rule). These statistical methods distinguish genuine drift from normal variation, reducing false alerts while catching real degradation.
Independent output validation through sampling. Sample a percentage of production agent outputs (e.g., 1-5%) and validate them against an independent reference. For classification agents, this means human review of sampled classifications. For financial agents, this means comparison of recommendations against independent data sources. For safety-critical agents, this means expert review of sampled outputs for safety compliance. The sampling rate should be sufficient to detect meaningful degradation within the monitoring period — for an agent processing 10,000 interactions per day, 1% sampling provides 100 validated outputs per day, sufficient to detect a 5 percentage point accuracy decline within 3 days with statistical confidence.
Tiered threshold response. Implement multiple threshold levels with escalating responses. For example: Warning threshold (3 percentage points below baseline) triggers an alert to the operations team for investigation; Revalidation threshold (5 percentage points below baseline) triggers mandatory revalidation per AG-071; Critical threshold (10 percentage points below baseline) triggers immediate agent suspension with traffic redirected to fallback mechanisms. The specific thresholds should be calibrated to the agent's risk profile — a safety-critical agent may have tighter thresholds (2/3/5 percentage points) than an internal copilot (5/10/15 percentage points).

Anti-patterns to avoid:

Monitoring only operational metrics. Error rate, latency, and uptime tell you whether the agent is running, not whether it is performing correctly. An agent that returns incorrect results quickly and reliably will have excellent operational metrics while causing significant harm. Performance monitoring must include output quality metrics that assess what the agent produces, not just whether it produces something.
Setting thresholds based on arbitrary percentages. A revalidation threshold of "10% degradation" is meaningless without context. A 10% degradation in a safety metric with a 99.5% baseline (to 89.5%) represents a 20x increase in safety failures — catastrophic for a safety-critical agent. A 10% degradation in a response relevance metric from 85% to 76.5% may be significant but less critical. Thresholds should be set based on the governance and business impact of degradation, not arbitrary percentages.
Monitoring aggregate metrics without segmentation. An aggregate accuracy metric may remain stable while accuracy degrades severely for a specific segment. For example, an agent may maintain 95% overall accuracy while accuracy for a minority demographic drops to 60% — the aggregate masks the segment-level degradation. Monitoring should include segment-level metrics for dimensions that matter (user demographics, query types, transaction categories).
Treating revalidation as re-approval of the existing configuration. Revalidation is not a rubber-stamp confirmation that the agent is still acceptable. It is a full re-assessment against acceptance criteria. If the agent no longer meets acceptance criteria, the outcome should be remediation (retraining, reconfiguration) or decommissioning — not approval with a lower standard.
Relying solely on user complaints for degradation detection. User complaints are a lagging indicator of performance degradation. By the time complaints reach a threshold that triggers investigation, thousands of users may have been affected. User complaints should supplement automated monitoring, not replace it.

Industry Considerations

Financial Services. Performance monitoring for financial AI agents must include regulatory compliance metrics. For trading agents, this includes best execution quality metrics, transaction accuracy, and risk exposure accuracy. For advisory agents, this includes recommendation accuracy and suitability assessment quality. The FCA's SS1/23 on model risk management requires ongoing model performance monitoring, periodic revalidation, and documented performance thresholds. Revalidation thresholds should align with regulatory performance expectations — for example, if the FCA expects best execution quality above a specified level, the revalidation threshold should trigger before that level is breached.

Healthcare. Performance monitoring for clinical AI agents must include clinical accuracy and safety metrics. For diagnostic agents, this includes sensitivity (true positive rate) and specificity (true negative rate) for relevant conditions. For treatment recommendation agents, this includes clinical appropriateness rates validated by clinician review. Revalidation thresholds for clinical agents should be tighter than for non-clinical agents, reflecting the potential for patient harm. Time-based revalidation should align with clinical evidence update cycles — as medical knowledge evolves, agents trained on prior data may become clinically outdated even if their measured accuracy against historical data remains stable.

Critical Infrastructure. Performance monitoring for safety-critical AI agents must include physical safety metrics. For control system agents, this includes actuator accuracy, response time to safety conditions, and fault detection rate. Revalidation thresholds for safety-critical agents should be derived from safety integrity level requirements per IEC 61508. A degradation that reduces the probability of dangerous failure below the required safety integrity level must trigger immediate suspension, not merely revalidation.

Maturity Model

Basic Implementation — Performance baselines are captured at deployment for key metrics. Monitoring runs on a periodic basis (e.g., weekly accuracy assessment). Revalidation thresholds are defined for primary performance metrics. When thresholds are breached, a manual revalidation process is triggered. Monitoring covers operational metrics and one or two output quality metrics. Time-based revalidation occurs annually. This level provides basic drift detection but may miss gradual degradation between monitoring intervals and lacks statistical sophistication for distinguishing drift from noise.

Intermediate Implementation — Continuous monitoring covers operational, output quality, and governance compliance metrics. Statistical process control methods (CUSUM or EWMA) detect gradual drift. Independent output validation through sampling provides ground-truth accuracy measurements. Tiered thresholds trigger escalating responses (warning, revalidation, suspension). Segment-level monitoring detects degradation in specific populations or use cases. Time-based revalidation occurs at least every 6 months. Monitoring dashboards provide real-time visibility into performance trends.

Advanced Implementation — All intermediate capabilities plus: automated revalidation pipelines execute the full validation suite on a scheduled basis. Machine learning-based anomaly detection identifies degradation patterns that statistical methods may miss. Predictive drift analysis forecasts when revalidation thresholds are likely to be breached, enabling proactive remediation. Causal analysis identifies the root cause of detected drift (data distribution shift, upstream change, model staleness). The organisation can demonstrate to regulators continuous performance assurance with no gap exceeding 24 hours between monitoring assessments for any production agent.

7. Evidence Requirements

Required artefacts:

Performance baseline records. The performance baselines established at deployment for each agent, including metric definitions, baseline values, measurement methodologies, and the data used for baseline measurement. Format: structured records linked to the deployment acceptance record.
Monitoring configuration. Documentation of the monitoring system configuration for each agent, including monitored metrics, monitoring frequency, threshold definitions (warning, revalidation, critical), statistical methods applied, and alerting configuration. Format: monitoring system configuration in version control.
Performance trend data. Historical performance metric values for each agent, at the monitoring frequency, with sufficient granularity to detect drift patterns. Minimum retention: the full operational lifetime of the agent.
Threshold breach and revalidation records. Records of every threshold breach event, including the metric, the baseline value, the observed value, the threshold level breached, the response action taken (investigation, revalidation, suspension), and the outcome. Format: structured incident records.
Independent validation sample results. Results from independent output validation sampling, including sampled outputs, reference values, accuracy calculations, and trend analysis. Format: structured data from the validation sampling system.
Revalidation outcomes. Complete revalidation results when triggered, including validation test suite results, acceptance decision (re-accepted, remediation required, decommissioned), and any conditions. Format: revalidation records per AG-071 evidence requirements.

Retention requirements:

Performance baseline records, monitoring configurations, and threshold breach records: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.
Performance trend data: full operational lifetime of the agent plus 2 years post-decommissioning.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Test 8.1: Baseline Establishment Verification

Stimulus: Deploy a new agent through the AG-071 acceptance process. Verify that performance baselines are captured and recorded.
Expected behaviour: Performance baselines are automatically captured for all defined metrics at the point of deployment acceptance.
Pass criteria: Baselines exist for every defined performance metric, with values, measurement methodology, and measurement data recorded.
Fail criteria: Any defined metric lacks a baseline, or the baseline is recorded without measurement methodology or data.

Test 8.2: Drift Detection Sensitivity

Stimulus: Introduce a controlled, gradual performance degradation into an agent's outputs (e.g., reduce classification accuracy by 0.5 percentage points per week over 6 weeks).
Expected behaviour: The monitoring system detects the drift trend before the revalidation threshold is breached, generating a warning alert.
Pass criteria: The monitoring system detects the drift trend and generates a warning alert within 4 weeks (when degradation reaches 2 percentage points, before a 5-point revalidation threshold).
Fail criteria: The drift is not detected until the revalidation threshold is breached, or is not detected at all.

Test 8.3: Sudden Shift Detection

Stimulus: Introduce an abrupt performance change (e.g., a 10 percentage point accuracy drop in a single monitoring interval).
Expected behaviour: The monitoring system detects the shift immediately and triggers the appropriate threshold response (revalidation or suspension).
Pass criteria: The shift is detected within one monitoring interval and the correct response is triggered.
Fail criteria: The shift is not detected within one monitoring interval, or the response is not proportionate to the severity.

Test 8.4: Revalidation Trigger Enforcement

Stimulus: Allow a monitored metric to breach the revalidation threshold. Verify that mandatory revalidation is triggered.
Expected behaviour: A revalidation process is initiated automatically. The agent's operation is restricted or suspended until revalidation is completed.
Pass criteria: Revalidation is triggered without manual intervention. The agent does not continue unrestricted operation during the revalidation period.
Fail criteria: The threshold breach does not trigger revalidation, or the agent continues unrestricted operation despite the breach.

Test 8.5: Segment-Level Degradation Detection

Stimulus: Introduce performance degradation that affects only a specific segment (e.g., a particular user demographic or query type) while aggregate performance remains within thresholds.
Expected behaviour: The monitoring system detects the segment-level degradation through segment-level metrics.
Pass criteria: Segment-level degradation is detected and alerted even when aggregate metrics remain within thresholds.
Fail criteria: Segment-level degradation is masked by stable aggregate metrics and goes undetected.

Test 8.6: Time-Based Revalidation Execution

Stimulus: Verify that time-based revalidation occurs within the defined period (e.g., 6 months) for an agent that has not breached any threshold.
Expected behaviour: Revalidation is triggered by the time-based schedule even when no performance threshold has been breached.
Pass criteria: Revalidation executes within the defined time period with results recorded per AG-071 evidence requirements.
Fail criteria: The time-based revalidation period elapses without revalidation being triggered or executed.

Test 8.7: Monitoring Continuity Under Load

Stimulus: Subject the agent to peak production load. Verify that performance monitoring continues to operate and detect degradation.
Expected behaviour: The monitoring system maintains its sampling rate, statistical analysis, and alerting capability under peak load conditions.
Pass criteria: Monitoring operates without gaps during peak load. Degradation detection sensitivity is maintained.
Fail criteria: Monitoring degrades under load — sampling rate drops, analysis is delayed, or alerts are not generated.

Conformance Scoring

Score 0: No performance monitoring exists — agents operate without post-deployment performance tracking or revalidation triggers.
Score 1: Performance baselines are captured at deployment and periodic (e.g., quarterly) manual reviews assess performance. Revalidation is triggered when degradation is noticed, but thresholds are not formally defined and monitoring is not continuous.
Score 2: Continuous monitoring covers operational, output quality, and governance compliance metrics. Quantitative revalidation thresholds are defined and enforced. Statistical process control methods detect gradual drift. Independent output validation through sampling is implemented. Agent operation is restricted when thresholds are breached until revalidation is completed.
Score 3: All Score 2 capabilities plus automated revalidation pipelines, predictive drift analysis, causal root-cause identification, segment-level monitoring, and demonstrated continuous performance assurance with no monitoring gap exceeding 24 hours for any production agent.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Direct requirement
EU AI Act	Article 61 (Post-Market Monitoring)	Direct requirement
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Supports compliance
FCA SS1/23	Model Risk Management	Direct requirement
NIST AI RMF	MEASURE 2.6, MANAGE 2.2, MANAGE 4.2	Supports compliance
ISO 42001	Clause 9.1 (Monitoring, Measurement, Analysis and Evaluation)	Direct requirement
DORA	Article 9 (ICT Risk Management Framework)	Supports compliance

EU AI Act — Article 9 (Risk Management System)

Article 9 requires that the risk management system be maintained "throughout the entire lifecycle" of the AI system and be "regularly systematically updated." Performance drift monitoring directly implements this lifecycle requirement. The regulation's emphasis on continuous risk management means that validation at deployment alone is insufficient — the risk management system must detect and respond to changes in system performance over time. AG-074's revalidation threshold mechanism ensures that the risk management response is triggered when performance degrades beyond acceptable levels, maintaining the risk management system's effectiveness throughout the system's operational life.

EU AI Act — Article 61 (Post-Market Monitoring)

Article 61 requires providers of high-risk AI systems to establish a post-market monitoring system to "actively and systematically" collect and analyse data on performance throughout the system's lifetime. AG-074 directly implements post-market monitoring for AI agents. The continuous monitoring, statistical drift detection, independent output validation, and revalidation trigger mechanisms are the operational implementation of Article 61's post-market monitoring requirements. The evidence artefacts — performance trend data, threshold breach records, revalidation outcomes — provide the documentation that Article 61 requires.

SOX — Section 404 (Internal Controls Over Financial Reporting)

For AI agents involved in financial operations, ongoing performance monitoring is an internal control that ensures the agent continues to process financial data accurately. A SOX auditor will assess whether the organisation has mechanisms to detect performance degradation in financial AI systems between formal review cycles. The monitoring system, thresholds, and revalidation triggers provide this control. Performance trend data and revalidation records provide the evidence that SOX compliance requires.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA expects firms to maintain adequate systems and controls, including ongoing monitoring of technology system performance. For AI agents, this means that the firm must have mechanisms to detect when an agent's performance degrades below acceptable levels. The revalidation threshold and response mechanism demonstrate to the FCA that the firm has proportionate controls for maintaining AI agent quality in production.

FCA SS1/23 — Model Risk Management

The FCA's SS1/23 on model risk management explicitly requires ongoing model performance monitoring, periodic model revalidation, and defined thresholds for triggering revalidation. AG-074 directly implements these requirements for AI agent models. The baseline capture, continuous monitoring, statistical drift detection, and revalidation trigger mechanisms map to SS1/23's expectations for model performance management throughout the model lifecycle.

NIST AI RMF — MEASURE 2.6, MANAGE 2.2, MANAGE 4.2

MEASURE 2.6 addresses ongoing measurement and monitoring of AI system performance; MANAGE 2.2 addresses risk mitigation through continuous controls; MANAGE 4.2 addresses post-deployment monitoring and response. AG-074 supports compliance by implementing continuous performance monitoring, defined thresholds for intervention, and structured revalidation processes that maintain AI system performance within acceptable bounds.

ISO 42001 — Clause 9.1

Clause 9.1 requires organisations to determine what needs to be monitored and measured, the methods for monitoring and measurement, when monitoring and measurement shall be performed, and when results shall be analysed and evaluated. AG-074 directly implements Clause 9.1 for AI agent performance by defining what is monitored (performance baselines), how (statistical process control, independent validation), when (continuously with defined frequency), and how results trigger action (revalidation thresholds).

DORA — Article 9 (ICT Risk Management Framework)

Article 9 requires financial entities to maintain an ICT risk management framework that includes mechanisms for detecting anomalous activities. Performance drift monitoring detects anomalous degradation in AI agent performance, supporting DORA compliance by ensuring that AI-related ICT risks are detected and managed on an ongoing basis.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide — performance degradation affects all users and systems served by the drifting agent, with impact accumulating over time until detection

Consequence chain: Without performance drift and revalidation threshold governance, an AI agent's performance can degrade silently from the levels validated at deployment to levels that would not have been accepted for deployment. The immediate technical failure is that the agent continues to operate while its performance is below the acceptance threshold — the organisation believes the agent is performing as validated, but actual performance has degraded. The insidious nature of this failure is that it accumulates over time: each day of undetected degradation adds to the total impact. For a classification agent with 17.8 percentage points of accuracy degradation over 8 months (Scenario A), the total impact is 22,000 misrouted interactions — an impact that would be 3,000 if detected at month 3. For a financial advisory agent with systematically incorrect pricing (Scenario B), 11 weeks of undetected degradation affects 1,400 clients at £2.1 million. The operational impact depends on the agent's function and the dimension of degradation: accuracy degradation causes incorrect outputs, safety degradation causes harmful outputs, governance compliance degradation causes regulatory exposure, and reliability degradation causes service disruptions. The business consequence includes financial losses from incorrect agent outputs, regulatory enforcement for operating systems below required performance levels, customer harm from degraded service quality, and the remediation cost of addressing the accumulated impact — which is always larger than the cost of timely detection. The temporal dimension amplifies the severity: every day without detection is another day of accumulated harm. An organisation that detects degradation at day 3 faces a bounded remediation; an organisation that detects at month 8 faces a remediation that may exceed the cost of the agent's total lifetime value.

Cross-references: AG-007 (Governance Configuration Control), AG-008 (Governance Continuity Under Failure), AG-022 (Behavioural Drift Detection), AG-048 (AI Model Provenance and Integrity), AG-071 (Pre-Deployment Validation and Acceptance Governance), AG-072 (Change Impact Assessment Governance), AG-073 (Staged Rollout and Canary Governance), AG-010 (Time-Bounded Authority Enforcement).

Cite this protocol

AgentGoverning. (2026). AG-074: Performance Drift and Revalidation Threshold Governance. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-074

← Previous Protocol

AG-073

Staged Rollout and Canary Governance

Next Protocol →

AG-075

Decommissioning and Credential Revocation Governance