Third-Party Behaviour Drift Monitoring Governance requires that every organisation deploying AI agents maintains continuous, structured monitoring of the behavioural characteristics of third-party AI models, APIs, tools, and services that the agent depends upon. Unlike internal components governed by AG-022, third-party dependencies change without the consuming organisation's knowledge or consent — a provider may retrain a model, update an API, modify rate limits, alter response distributions, or silently deprecate capabilities. AG-091 mandates that organisations establish behavioural baselines for every third-party dependency, monitor for statistically significant deviations from those baselines, and trigger governance responses when drift exceeds defined thresholds. The principle is that an organisation cannot govern what it does not measure: if a third-party model's accuracy degrades from 94% to 71% over six weeks, and the consuming organisation has no monitoring in place, the organisation is operating on unvalidated assumptions about a component it does not control.
Scenario A — Silent Model Retraining Degrades Classification Accuracy: A financial services firm uses a third-party AI model via API for transaction fraud scoring. At integration time, the model achieved a 96.2% true-positive rate on the firm's validation set and a 2.1% false-positive rate. The provider retrains the model to improve performance for a different customer segment. No notification is issued — the API contract is "best-effort fraud scoring" with no SLA on accuracy. Over four weeks, the true-positive rate for the firm's transaction profile drops to 78.4%, and the false-positive rate rises to 11.7%. The firm's agent continues to rely on the scores, approving 412 transactions that the original model would have flagged. Total fraud losses attributable to the drift: £1.83 million.
What went wrong: The firm had no behavioural baseline recorded for the third-party model. No monitoring compared current performance against historical distributions. The degradation was gradual — no single day triggered an obvious failure — and cumulated over weeks. The firm treated the API as a stable black box when in reality it was a moving target. Consequence: £1.83 million in fraud losses, FCA supervisory review of outsourcing controls, mandatory remediation programme for third-party AI dependency monitoring.
Scenario B — API Response Distribution Shift Causes Downstream Bias: A public sector organisation uses a third-party natural language processing API for citizen complaint categorisation. The API provider updates its underlying model to address a different customer's requirements, inadvertently shifting the model's classification boundaries. Complaints in languages other than English are now categorised as "low priority" at 3.2x the previous rate. The organisation's agent routes these complaints to a slower processing queue. Over three months, 2,847 non-English complaints experience an average 14-day delay compared to the previous 3-day average.
What went wrong: No monitoring tracked the distribution of classifications by language or demographic category. The shift was invisible at the aggregate level — total complaint volumes were unchanged — but the per-category distributions changed materially. Consequence: Equality Act 2010 investigation, mandatory impact assessment, public disclosure requirement, remediation cost of £340,000 including manual re-review of affected complaints.
Scenario C — Third-Party Embedding Model Semantic Drift: A research organisation's agent uses a third-party embedding API for document retrieval. The provider releases a new model version behind the same API endpoint, maintaining dimensional compatibility but altering the semantic space. Documents previously retrieved for a query about "cardiovascular risk factors" now include unrelated oncology papers because the embedding distances have shifted. The agent generates three research summaries containing materially irrelevant citations before the drift is detected through manual review.
What went wrong: The organisation validated schema compatibility (vector dimensions matched) but did not monitor semantic consistency. Embedding drift is invisible to structural checks — the vectors have the right shape but wrong meaning. No reference query set existed to detect semantic shifts. Consequence: Three published summaries required retraction, reputational damage to research credibility, estimated cost of £87,000 in re-review and correction.
Scope: This dimension applies to all AI agents that consume outputs from third-party AI models, machine learning APIs, natural language processing services, embedding providers, classification services, or any external service whose behaviour is driven by learned parameters rather than deterministic logic. The scope includes model-as-a-service APIs, hosted inference endpoints, third-party plugins that incorporate AI components, and any dependency where the provider may change the underlying model or algorithm without bilateral agreement. Dependencies on deterministic APIs with fixed logic (e.g., currency conversion at published rates) are excluded unless the API incorporates ML-driven components. The scope extends to indirect dependencies: if an agent calls a third-party orchestration layer that itself calls AI models, the indirect AI dependencies are in scope even if the agent does not call them directly.
4.1. A conforming system MUST establish and maintain a documented behavioural baseline for every third-party AI dependency, recording expected output distributions, accuracy metrics, latency characteristics, and response schema at the point of integration and after each known provider update.
4.2. A conforming system MUST implement continuous monitoring that compares current third-party dependency behaviour against the established baseline using statistically rigorous methods, with monitoring frequency proportional to the criticality of the dependency.
4.3. A conforming system MUST define drift thresholds for each third-party dependency that trigger governance responses when exceeded, with thresholds calibrated to the business impact of degraded dependency performance.
4.4. A conforming system MUST generate a structured alert within 24 hours of detecting a statistically significant deviation from the behavioural baseline that exceeds defined thresholds.
4.5. A conforming system MUST maintain a documented response procedure for each severity level of detected drift, including provisions for fallback operation, human review, and dependency suspension.
4.6. A conforming system MUST re-establish the behavioural baseline after any confirmed provider update, model retraining, or API version change, with the new baseline subject to the same approval process as the original.
4.7. A conforming system SHOULD implement automated reference query sets — canary queries with known expected outputs — that are executed on a defined schedule against each third-party AI dependency to detect drift proactively.
4.8. A conforming system SHOULD monitor distributional characteristics of third-party outputs (not just aggregate accuracy) to detect shifts in per-category, per-demographic, or per-input-type performance that may be masked by aggregate stability.
4.9. A conforming system SHOULD maintain a rolling history of at least 90 days of third-party dependency behavioural metrics to support trend analysis and drift detection.
4.10. A conforming system MAY implement shadow evaluation — routing a sample of live requests to a reference model for comparison — to detect drift in the absence of ground truth labels.
Third-Party Behaviour Drift Monitoring Governance addresses a fundamental asymmetry in AI supply chains: the consuming organisation bears the risk of third-party behavioural changes, but the provider controls the timing, scope, and notification of those changes. Traditional software supply chain governance focuses on version management and vulnerability patching — deterministic concerns where a specific version behaves identically every time it executes. AI dependencies are fundamentally different: a model endpoint may change behaviour without any version change, a retraining cycle may shift output distributions without altering the API contract, and a provider may optimise for a different customer segment in ways that degrade performance for the consuming organisation.
AG-022 (Behavioural Drift Detection) governs drift in the agent's own behaviour. AG-091 extends this principle to the agent's third-party dependencies, recognising that an agent's outputs are only as reliable as its inputs. An agent that makes correct decisions based on third-party fraud scores is functionally compromised when those scores silently degrade — the agent's own logic is unchanged, but its effective accuracy has declined because its inputs have drifted.
The detection challenge is compounded by the absence of ground truth in many operational contexts. When a third-party classification API changes its output distribution, the consuming organisation often cannot immediately determine whether the new distribution is more or less accurate — it can only determine that the distribution has changed. This is why AG-091 requires monitoring of distributional characteristics rather than only accuracy metrics: distribution shifts are detectable even without ground truth, whereas accuracy degradation requires labelled data that may not be available in real time.
The economic dynamics of AI-as-a-service create structural incentives for providers to make changes that benefit their largest customers, potentially at the expense of smaller customers whose workloads represent edge cases. AG-045 (Economic Incentive Alignment Verification) addresses the contractual dimension; AG-091 addresses the technical monitoring dimension that validates whether contractual commitments are being met in practice.
AG-091 requires organisations to treat third-party AI dependencies as inherently unstable components that require continuous validation. The implementation architecture should separate baseline establishment, continuous monitoring, drift detection, and response orchestration into distinct, auditable components.
Recommended patterns:
Anti-patterns to avoid:
Financial Services. Third-party credit scoring, fraud detection, and risk models are subject to SR 11-7 (Federal Reserve) and SS1/23 (FCA) model risk management expectations. Drift monitoring for third-party AI models must be integrated with existing model risk management frameworks. The PRA expects firms to demonstrate ongoing performance monitoring of models, including third-party models, with defined thresholds for remediation. For critical financial models, daily canary set execution and real-time production traffic monitoring are baseline expectations.
Healthcare. Third-party clinical AI models (diagnostic assistants, triage classifiers, drug interaction checkers) are subject to MDR 2017/745 post-market surveillance requirements. Drift in clinical AI dependencies must be monitored with particular attention to differential performance across patient demographics. A drift that degrades diagnostic accuracy for a specific demographic group may constitute a discriminatory failure even if aggregate performance remains acceptable. FDA guidance on AI/ML-based SaMD requires monitoring for dataset drift and concept drift in deployed models.
Critical Infrastructure. Third-party AI components in safety-critical systems (predictive maintenance models, anomaly detection systems, control optimisation algorithms) require the most stringent monitoring regimes. Drift in a predictive maintenance model could cause missed failure predictions with physical safety consequences. IEC 62443 requirements for component validation extend to behavioural validation of AI components, not only functional and security validation.
Basic Implementation — The organisation has identified its third-party AI dependencies and established initial behavioural baselines for each. A canary query set of at least 20 inputs per dependency is executed weekly. Drift detection uses simple threshold comparisons (e.g., accuracy drops below 90% of baseline). Alerts are generated via email to a designated team. Response procedures are documented at a high level. Baseline records exist but may not be formally versioned. This level provides early warning of major drift events but may miss gradual degradation or distributional shifts that do not affect aggregate metrics.
Intermediate Implementation — All basic capabilities plus: canary sets contain 50+ inputs per dependency executed daily for critical dependencies. Production traffic is sampled at 5-10% with rolling distributional monitoring using statistical process control methods. Drift thresholds are calibrated per-dependency based on business impact assessment. Baselines are versioned, immutable artefacts with formal approval records. Distributional monitoring covers per-category and per-demographic breakdowns. Response procedures are tiered by severity with defined escalation paths and maximum response times. Integration with AG-014 ensures schema and agent monitoring operate as complementary layers.
Advanced Implementation — All intermediate capabilities plus: shadow evaluation routes a sample of live traffic to a reference model for comparison, enabling drift detection without ground truth. Semantic consistency monitoring covers embedding and language model dependencies. Automated circuit breakers suspend third-party dependencies that exceed critical drift thresholds, with automatic fallback to alternative providers or degraded-mode operation. Drift monitoring is integrated with the organisation's model risk management framework. Historical drift data feeds into supplier risk scoring per AG-093. Provider SLA compliance is tracked quantitatively against contractual performance commitments. The organisation can demonstrate to regulators that no third-party AI dependency operates without continuous agent monitoring.
Required artefacts:
Retention requirements:
Access requirements:
Testing AG-091 compliance requires validating the monitoring infrastructure's ability to detect drift across different modalities and response magnitudes.
Test 8.1: Canary Set Drift Detection Sensitivity
Test 8.2: Production Traffic Distributional Monitoring
Test 8.3: Baseline Re-establishment After Provider Update
Test 8.4: Alert Generation and Response Execution
Test 8.5: Semantic Drift Detection for Embedding Dependencies
Test 8.6: Monitoring Continuity Under Dependency Unavailability
| Regulation | Provision | Relationship Type |
|---|---|---|
| EU AI Act | Article 9 (Risk Management System) | Direct requirement |
| EU AI Act | Article 17 (Quality Management System) | Supports compliance |
| FCA SS1/23 | Model Risk Management — Ongoing Monitoring | Direct requirement |
| PRA SS1/23 | Model Risk Management Principles | Supports compliance |
| DORA | Article 28 (ICT Third-Party Risk) | Direct requirement |
| NIST AI RMF | MEASURE 2.6, MANAGE 3.1 | Supports compliance |
| ISO 42001 | Clause 8.4 (Operation of AI System), Clause 9.1 (Monitoring) | Supports compliance |
| SR 11-7 | Federal Reserve Guidance on Model Risk Management | Supports compliance |
Article 9 requires providers and deployers of high-risk AI systems to establish a risk management system that includes ongoing monitoring of the system's performance. When a high-risk AI system depends on third-party AI models, the deployer's risk management obligation extends to monitoring those dependencies. AG-091 implements the monitoring component for third-party AI dependencies. The regulation requires that risks be "identified and analysed" on an ongoing basis — a third-party dependency whose behaviour has drifted materially is an unidentified risk if no monitoring is in place.
Article 17 requires quality management procedures including techniques for monitoring, testing, and validation of the AI system. For systems incorporating third-party AI components, quality management must extend to validating that those components continue to perform as expected. AG-091 implements this requirement through structured agent monitoring and baseline management.
SS1/23 expects firms to maintain ongoing performance monitoring of all models, including third-party models used in decision-making. The guidance specifically addresses vendor models and requires firms to validate that vendor model performance remains within acceptable bounds. AG-091 directly implements this requirement by establishing baselines, monitoring for drift, and defining response procedures. The FCA expects monitoring to cover both aggregate and segmented performance metrics.
Article 28 requires financial entities to manage ICT third-party risk, including ongoing monitoring of the performance and quality of ICT services provided by third parties. For AI-as-a-service dependencies, behavioural drift monitoring is a specific instantiation of this broader requirement. DORA requires that monitoring be proportionate to the criticality of the service — aligning with AG-091's requirement that monitoring frequency be proportional to dependency criticality.
MEASURE 2.6 addresses the monitoring of AI system performance over time, including detection of performance degradation. MANAGE 3.1 addresses risk response actions when monitoring detects material changes. AG-091 supports compliance by implementing structured monitoring (MEASURE 2.6) with defined response procedures (MANAGE 3.1) for third-party AI dependencies.
Clause 8.4 addresses the operation of AI systems including controls on external components. Clause 9.1 requires monitoring, measurement, analysis, and evaluation of the AI management system's performance. AG-091 implements monitoring of third-party AI components as part of the operational control framework required by Clause 8.4, with the measurement and evaluation discipline required by Clause 9.1.
SR 11-7 requires ongoing monitoring of all models, with explicit attention to vendor models. The guidance states that the use of vendor models does not diminish the model risk management obligation — banks are expected to conduct their own performance monitoring of vendor models. AG-091 directly implements this requirement for AI vendor models.
| Field | Value |
|---|---|
| Severity Rating | High |
| Blast Radius | All agent operations dependent on the drifted third-party component — potentially cross-functional where multiple agents share a dependency |
Consequence chain: Without third-party behaviour drift monitoring, a silent change in a third-party AI dependency propagates through the agent's decision-making without detection. The immediate technical failure is undetected degradation of a critical input to the agent's reasoning. The operational impact depends on the dependency's role: a drifted fraud model causes increased fraud losses; a drifted classification model causes misrouted work items; a drifted embedding model causes irrelevant retrievals. The business impact accumulates silently over the drift detection gap — the period between when drift begins and when it is detected. For an organisation with no monitoring, this gap is potentially unlimited, bounded only by the point at which downstream consequences become visible through other channels (customer complaints, financial losses, regulatory findings). The severity is amplified by the fact that the consuming organisation has no control over the timing of third-party changes. Cross-references: AG-014 (External Dependency Integrity) governs structural validation; AG-022 (Behavioural Drift Detection) governs internal drift; AG-048 (AI Model Provenance and Integrity) governs model identity; AG-093 (Supplier Concentration and Exit Governance) addresses the concentration risk that amplifies drift impact.