The Standard

The 841 Dimensions Regulatory Mapping Version History

Compliance

Compliance Leaderboard Platform Comparison

Verification

Submit for Verification Self-Assessment Tool

About

About AgentGoverning Press & Media

Contact

AG-091

Third-Party Behaviour Drift Monitoring Governance

Supply Chain, Third-Party AI & Dependencies ~18 min read AGS v2.1 · April 2026

EU AI Act FCA NIST ISO 42001

2. Summary

Third-Party Behaviour Drift Monitoring Governance requires that every organisation deploying AI agents maintains continuous, structured monitoring of the behavioural characteristics of third-party AI models, APIs, tools, and services that the agent depends upon. Unlike internal components governed by AG-022, third-party dependencies change without the consuming organisation's knowledge or consent — a provider may retrain a model, update an API, modify rate limits, alter response distributions, or silently deprecate capabilities. AG-091 mandates that organisations establish behavioural baselines for every third-party dependency, monitor for statistically significant deviations from those baselines, and trigger governance responses when drift exceeds defined thresholds. The principle is that an organisation cannot govern what it does not measure: if a third-party model's accuracy degrades from 94% to 71% over six weeks, and the consuming organisation has no monitoring in place, the organisation is operating on unvalidated assumptions about a component it does not control.

3. Example

Scenario A — Silent Model Retraining Degrades Classification Accuracy: A financial services firm uses a third-party AI model via API for transaction fraud scoring. At integration time, the model achieved a 96.2% true-positive rate on the firm's validation set and a 2.1% false-positive rate. The provider retrains the model to improve performance for a different customer segment. No notification is issued — the API contract is "best-effort fraud scoring" with no SLA on accuracy. Over four weeks, the true-positive rate for the firm's transaction profile drops to 78.4%, and the false-positive rate rises to 11.7%. The firm's agent continues to rely on the scores, approving 412 transactions that the original model would have flagged. Total fraud losses attributable to the drift: £1.83 million.

What went wrong: The firm had no behavioural baseline recorded for the third-party model. No monitoring compared current performance against historical distributions. The degradation was gradual — no single day triggered an obvious failure — and cumulated over weeks. The firm treated the API as a stable black box when in reality it was a moving target. Consequence: £1.83 million in fraud losses, FCA supervisory review of outsourcing controls, mandatory remediation programme for third-party AI dependency monitoring.

Scenario B — API Response Distribution Shift Causes Downstream Bias: A public sector organisation uses a third-party natural language processing API for citizen complaint categorisation. The API provider updates its underlying model to address a different customer's requirements, inadvertently shifting the model's classification boundaries. Complaints in languages other than English are now categorised as "low priority" at 3.2x the previous rate. The organisation's agent routes these complaints to a slower processing queue. Over three months, 2,847 non-English complaints experience an average 14-day delay compared to the previous 3-day average.

What went wrong: No monitoring tracked the distribution of classifications by language or demographic category. The shift was invisible at the aggregate level — total complaint volumes were unchanged — but the per-category distributions changed materially. Consequence: Equality Act 2010 investigation, mandatory impact assessment, public disclosure requirement, remediation cost of £340,000 including manual re-review of affected complaints.

Scenario C — Third-Party Embedding Model Semantic Drift: A research organisation's agent uses a third-party embedding API for document retrieval. The provider releases a new model version behind the same API endpoint, maintaining dimensional compatibility but altering the semantic space. Documents previously retrieved for a query about "cardiovascular risk factors" now include unrelated oncology papers because the embedding distances have shifted. The agent generates three research summaries containing materially irrelevant citations before the drift is detected through manual review.

What went wrong: The organisation validated schema compatibility (vector dimensions matched) but did not monitor semantic consistency. Embedding drift is invisible to structural checks — the vectors have the right shape but wrong meaning. No reference query set existed to detect semantic shifts. Consequence: Three published summaries required retraction, reputational damage to research credibility, estimated cost of £87,000 in re-review and correction.

4. Requirement Statement

Scope: This dimension applies to all AI agents that consume outputs from third-party AI models, machine learning APIs, natural language processing services, embedding providers, classification services, or any external service whose behaviour is driven by learned parameters rather than deterministic logic. The scope includes model-as-a-service APIs, hosted inference endpoints, third-party plugins that incorporate AI components, and any dependency where the provider may change the underlying model or algorithm without bilateral agreement. Dependencies on deterministic APIs with fixed logic (e.g., currency conversion at published rates) are excluded unless the API incorporates ML-driven components. The scope extends to indirect dependencies: if an agent calls a third-party orchestration layer that itself calls AI models, the indirect AI dependencies are in scope even if the agent does not call them directly.

4.1. A conforming system MUST establish and maintain a documented behavioural baseline for every third-party AI dependency, recording expected output distributions, accuracy metrics, latency characteristics, and response schema at the point of integration and after each known provider update.

4.2. A conforming system MUST implement continuous monitoring that compares current third-party dependency behaviour against the established baseline using statistically rigorous methods, with monitoring frequency proportional to the criticality of the dependency.

4.3. A conforming system MUST define drift thresholds for each third-party dependency that trigger governance responses when exceeded, with thresholds calibrated to the business impact of degraded dependency performance.

4.4. A conforming system MUST generate a structured alert within 24 hours of detecting a statistically significant deviation from the behavioural baseline that exceeds defined thresholds.

4.5. A conforming system MUST maintain a documented response procedure for each severity level of detected drift, including provisions for fallback operation, human review, and dependency suspension.

4.6. A conforming system MUST re-establish the behavioural baseline after any confirmed provider update, model retraining, or API version change, with the new baseline subject to the same approval process as the original.

4.7. A conforming system SHOULD implement automated reference query sets — canary queries with known expected outputs — that are executed on a defined schedule against each third-party AI dependency to detect drift proactively.

4.8. A conforming system SHOULD monitor distributional characteristics of third-party outputs (not just aggregate accuracy) to detect shifts in per-category, per-demographic, or per-input-type performance that may be masked by aggregate stability.

4.9. A conforming system SHOULD maintain a rolling history of at least 90 days of third-party dependency behavioural metrics to support trend analysis and drift detection.

4.10. A conforming system MAY implement shadow evaluation — routing a sample of live requests to a reference model for comparison — to detect drift in the absence of ground truth labels.

5. Rationale

Third-Party Behaviour Drift Monitoring Governance addresses a fundamental asymmetry in AI supply chains: the consuming organisation bears the risk of third-party behavioural changes, but the provider controls the timing, scope, and notification of those changes. Traditional software supply chain governance focuses on version management and vulnerability patching — deterministic concerns where a specific version behaves identically every time it executes. AI dependencies are fundamentally different: a model endpoint may change behaviour without any version change, a retraining cycle may shift output distributions without altering the API contract, and a provider may optimise for a different customer segment in ways that degrade performance for the consuming organisation.

AG-022 (Behavioural Drift Detection) governs drift in the agent's own behaviour. AG-091 extends this principle to the agent's third-party dependencies, recognising that an agent's outputs are only as reliable as its inputs. An agent that makes correct decisions based on third-party fraud scores is functionally compromised when those scores silently degrade — the agent's own logic is unchanged, but its effective accuracy has declined because its inputs have drifted.

The detection challenge is compounded by the absence of ground truth in many operational contexts. When a third-party classification API changes its output distribution, the consuming organisation often cannot immediately determine whether the new distribution is more or less accurate — it can only determine that the distribution has changed. This is why AG-091 requires monitoring of distributional characteristics rather than only accuracy metrics: distribution shifts are detectable even without ground truth, whereas accuracy degradation requires labelled data that may not be available in real time.

The economic dynamics of AI-as-a-service create structural incentives for providers to make changes that benefit their largest customers, potentially at the expense of smaller customers whose workloads represent edge cases. AG-045 (Economic Incentive Alignment Verification) addresses the contractual dimension; AG-091 addresses the technical monitoring dimension that validates whether contractual commitments are being met in practice.

6. Implementation Guidance

AG-091 requires organisations to treat third-party AI dependencies as inherently unstable components that require continuous validation. The implementation architecture should separate baseline establishment, continuous monitoring, drift detection, and response orchestration into distinct, auditable components.

Recommended patterns:

Reference query canary sets. Maintain a curated set of 50-200 reference inputs with known expected outputs for each third-party AI dependency. Execute the canary set on a defined schedule (daily for critical dependencies, weekly for standard dependencies). Compare outputs against the recorded baseline using appropriate statistical tests — Kolmogorov-Smirnov for continuous distributions, chi-squared for categorical outputs, cosine similarity for embedding spaces. Alert when the test statistic exceeds the pre-defined threshold (e.g., p < 0.01 for a two-sample KS test). The canary set should be diverse enough to cover the input distribution the agent encounters in production but small enough to execute within cost and rate-limit constraints.
Production traffic sampling. Continuously sample a percentage (5-15%) of live requests to third-party dependencies and record both inputs and outputs. Compute rolling distributional statistics — output class proportions, mean confidence scores, latency percentiles, response length distributions — and compare against the baseline window. This detects drift that canary sets may miss because production traffic covers a wider input distribution. Implement a sliding window comparison (e.g., current 7-day window vs. baseline 30-day window) using statistical process control methods such as CUSUM or EWMA charts.
Semantic consistency validation. For embedding and language model dependencies, maintain a set of reference semantic relationships (e.g., "cardiovascular" should be more similar to "heart disease" than to "oncology"). Test these relationships on each monitoring cycle. Semantic drift in embedding spaces can be detected by monitoring the rank stability of nearest neighbours for reference queries — if the top-5 nearest neighbours for a reference query change materially between monitoring cycles, the semantic space has shifted.
Baseline versioning and approval. Store each behavioural baseline as a versioned, immutable artefact with metadata recording: the date established, the provider version (if known), the reference query set used, the statistical characteristics recorded, and the approving authority. When a new baseline is established after a confirmed provider change, the old baseline is retained for historical comparison. This supports audit requirements and enables retrospective analysis of when drift began.

Anti-patterns to avoid:

Monitoring only aggregate accuracy. Aggregate accuracy can remain stable while per-category performance shifts dramatically. A fraud model that improves performance on high-value transactions while degrading on micro-transactions may show stable aggregate metrics while creating concentrated risk in a specific transaction category. Always monitor distributional characteristics at a granular level.
Relying on provider notifications. Many AI service providers do not notify customers of model retraining or performance changes. Even when contractual notification obligations exist, there is no guarantee of timely compliance. Monitoring must be independent of provider communication — it should detect drift whether or not the provider reports a change.
Using fixed thresholds across all dependencies. A 5% accuracy drift in a fraud scoring model has a materially different business impact than a 5% accuracy drift in a document summarisation model. Thresholds must be calibrated to the business impact of each specific dependency, not applied uniformly.
Treating schema stability as behavioural stability. A third-party API can return structurally identical responses (same fields, same types, same value ranges) while the semantic content has shifted materially. Schema validation per AG-014 is necessary but not sufficient — AG-091 requires agent monitoring that goes beyond structural checks.
Monitoring without a documented response procedure. Detection without response is observation without governance. If a drift alert fires and no one knows what to do — whether to suspend the dependency, switch to a fallback, or escalate to the provider — the monitoring provides no operational value.

Industry Considerations

Financial Services. Third-party credit scoring, fraud detection, and risk models are subject to SR 11-7 (Federal Reserve) and SS1/23 (FCA) model risk management expectations. Drift monitoring for third-party AI models must be integrated with existing model risk management frameworks. The PRA expects firms to demonstrate ongoing performance monitoring of models, including third-party models, with defined thresholds for remediation. For critical financial models, daily canary set execution and real-time production traffic monitoring are baseline expectations.

Healthcare. Third-party clinical AI models (diagnostic assistants, triage classifiers, drug interaction checkers) are subject to MDR 2017/745 post-market surveillance requirements. Drift in clinical AI dependencies must be monitored with particular attention to differential performance across patient demographics. A drift that degrades diagnostic accuracy for a specific demographic group may constitute a discriminatory failure even if aggregate performance remains acceptable. FDA guidance on AI/ML-based SaMD requires monitoring for dataset drift and concept drift in deployed models.

Critical Infrastructure. Third-party AI components in safety-critical systems (predictive maintenance models, anomaly detection systems, control optimisation algorithms) require the most stringent monitoring regimes. Drift in a predictive maintenance model could cause missed failure predictions with physical safety consequences. IEC 62443 requirements for component validation extend to behavioural validation of AI components, not only functional and security validation.

Maturity Model

Basic Implementation — The organisation has identified its third-party AI dependencies and established initial behavioural baselines for each. A canary query set of at least 20 inputs per dependency is executed weekly. Drift detection uses simple threshold comparisons (e.g., accuracy drops below 90% of baseline). Alerts are generated via email to a designated team. Response procedures are documented at a high level. Baseline records exist but may not be formally versioned. This level provides early warning of major drift events but may miss gradual degradation or distributional shifts that do not affect aggregate metrics.

Intermediate Implementation — All basic capabilities plus: canary sets contain 50+ inputs per dependency executed daily for critical dependencies. Production traffic is sampled at 5-10% with rolling distributional monitoring using statistical process control methods. Drift thresholds are calibrated per-dependency based on business impact assessment. Baselines are versioned, immutable artefacts with formal approval records. Distributional monitoring covers per-category and per-demographic breakdowns. Response procedures are tiered by severity with defined escalation paths and maximum response times. Integration with AG-014 ensures schema and agent monitoring operate as complementary layers.

Advanced Implementation — All intermediate capabilities plus: shadow evaluation routes a sample of live traffic to a reference model for comparison, enabling drift detection without ground truth. Semantic consistency monitoring covers embedding and language model dependencies. Automated circuit breakers suspend third-party dependencies that exceed critical drift thresholds, with automatic fallback to alternative providers or degraded-mode operation. Drift monitoring is integrated with the organisation's model risk management framework. Historical drift data feeds into supplier risk scoring per AG-093. Provider SLA compliance is tracked quantitatively against contractual performance commitments. The organisation can demonstrate to regulators that no third-party AI dependency operates without continuous agent monitoring.

7. Evidence Requirements

Required artefacts:

Behavioural baseline register. Versioned baseline records for each third-party AI dependency, including: dependency identifier, baseline date, statistical characteristics recorded (output distributions, accuracy metrics, latency profiles), reference query set used, and approving authority. Format: structured data (JSON, YAML, or database export).
Monitoring configuration and execution logs. Records demonstrating that monitoring is executed on the defined schedule, including: canary query execution timestamps, production traffic sampling rates, statistical test results, and threshold evaluations. Minimum 12 months retention.
Drift alert and response records. Timestamped records of all drift alerts generated, including: the dependency affected, the metric that drifted, the magnitude of drift, the threshold exceeded, the response action taken, and the resolution. Minimum 12 months retention.
Threshold calibration documentation. Records demonstrating that drift thresholds are calibrated to business impact, including: the risk assessment for each dependency, the rationale for each threshold value, and the approval of thresholds by an appropriate authority.

Retention requirements:

Behavioural baselines and drift alert logs: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Testing AG-091 compliance requires validating the monitoring infrastructure's ability to detect drift across different modalities and response magnitudes.

Test 8.1: Canary Set Drift Detection Sensitivity

Stimulus: Introduce a simulated third-party dependency that returns baseline-consistent outputs for 30 days, then shifts output distributions by a controlled magnitude (e.g., 5%, 10%, 20% shift in classification probabilities).
Expected behaviour: The monitoring system detects the shift and generates an alert within the time window defined for the dependency's criticality tier.
Pass criteria: All shifts exceeding the defined threshold are detected and alerted within the required time window. The 5% shift (below threshold for most configurations) does not generate a false alert.
Fail criteria: A shift exceeding the threshold is not detected, or detection latency exceeds the defined maximum.

Test 8.2: Production Traffic Distributional Monitoring

Stimulus: Route production-representative traffic through a simulated dependency that gradually shifts its output distribution over 14 days — for example, increasing the proportion of "high risk" classifications from 12% to 28% while maintaining stable aggregate confidence scores.
Expected behaviour: The distributional monitoring detects the per-category shift even though aggregate statistics remain within normal bounds.
Pass criteria: The distributional shift is detected and alerted before the cumulative business impact exceeds the defined tolerance for the dependency.
Fail criteria: The shift is not detected because monitoring operates only at the aggregate level, or detection occurs only after the cumulative impact exceeds tolerance.

Test 8.3: Baseline Re-establishment After Provider Update

Stimulus: Simulate a confirmed provider update notification. Verify that the system triggers baseline re-establishment, records the new baseline as a versioned artefact, and resumes monitoring against the new baseline.
Expected behaviour: The old baseline is retained, a new baseline is established and approved, and monitoring transitions to the new baseline with no gap in coverage.
Pass criteria: The new baseline is established within the defined time window, is formally versioned and approved, and monitoring resumes without interruption.
Fail criteria: Monitoring continues against the old baseline after a confirmed provider change, or a gap in monitoring coverage occurs during the transition.

Test 8.4: Alert Generation and Response Execution

Stimulus: Trigger a drift alert that exceeds the critical threshold for a designated high-criticality dependency.
Expected behaviour: The alert is generated within 24 hours, the defined response procedure is initiated, and the response is executed within the maximum response time defined for the severity level.
Pass criteria: Alert generation, notification delivery, and response initiation all occur within defined time windows. The response action taken matches the documented procedure for the severity level.
Fail criteria: Alert generation is delayed beyond 24 hours, notifications fail to reach the designated recipients, or no response action is initiated.

Test 8.5: Semantic Drift Detection for Embedding Dependencies

Stimulus: Substitute an embedding dependency with a version that maintains dimensional compatibility but alters the semantic space — for example, a model retrained on a different corpus that shifts nearest-neighbour relationships for reference queries.
Expected behaviour: The semantic consistency monitoring detects the shift in nearest-neighbour rankings and generates an alert.
Pass criteria: The semantic shift is detected within two monitoring cycles. Reference relationship violations (e.g., "cardiovascular" no longer nearest to "heart disease") are specifically identified.
Fail criteria: The semantic shift is not detected because monitoring checks only dimensional compatibility and response schema, or detection requires more than two monitoring cycles.

Test 8.6: Monitoring Continuity Under Dependency Unavailability

Stimulus: Simulate a third-party dependency becoming unavailable (timeout, 503 errors) for a period exceeding the monitoring interval.
Expected behaviour: The monitoring system records the unavailability as an anomalous event, generates an availability alert distinct from a drift alert, and resumes drift monitoring when the dependency recovers. Unavailability does not corrupt the behavioural baseline.
Pass criteria: Unavailability is detected and alerted separately from drift. The baseline is not updated with error responses. Monitoring resumes correctly after recovery.
Fail criteria: Unavailability is misclassified as drift, error responses are incorporated into the baseline, or monitoring fails to resume after dependency recovery.

Conformance Scoring

Score 0: No third-party behaviour drift monitoring exists — the organisation treats third-party AI dependencies as stable black boxes with no ongoing behavioural validation.
Score 1: Basic monitoring exists — canary queries are executed on an ad hoc or infrequent basis, and aggregate accuracy is tracked, but no formal baselines, thresholds, or response procedures are defined.
Score 2: Structured monitoring with formal baselines, defined thresholds calibrated to business impact, regular canary set execution, and documented response procedures for detected drift.
Score 3: All Score 2 capabilities plus distributional monitoring, semantic drift detection, automated circuit breakers, shadow evaluation, and integration with model risk management and supplier risk scoring frameworks.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Direct requirement
EU AI Act	Article 17 (Quality Management System)	Supports compliance
FCA SS1/23	Model Risk Management — Ongoing Monitoring	Direct requirement
PRA SS1/23	Model Risk Management Principles	Supports compliance
DORA	Article 28 (ICT Third-Party Risk)	Direct requirement
NIST AI RMF	MEASURE 2.6, MANAGE 3.1	Supports compliance
ISO 42001	Clause 8.4 (Operation of AI System), Clause 9.1 (Monitoring)	Supports compliance
SR 11-7	Federal Reserve Guidance on Model Risk Management	Supports compliance

EU AI Act — Article 9 (Risk Management System)

Article 9 requires providers and deployers of high-risk AI systems to establish a risk management system that includes ongoing monitoring of the system's performance. When a high-risk AI system depends on third-party AI models, the deployer's risk management obligation extends to monitoring those dependencies. AG-091 implements the monitoring component for third-party AI dependencies. The regulation requires that risks be "identified and analysed" on an ongoing basis — a third-party dependency whose behaviour has drifted materially is an unidentified risk if no monitoring is in place.

EU AI Act — Article 17 (Quality Management System)

Article 17 requires quality management procedures including techniques for monitoring, testing, and validation of the AI system. For systems incorporating third-party AI components, quality management must extend to validating that those components continue to perform as expected. AG-091 implements this requirement through structured agent monitoring and baseline management.

FCA SS1/23 — Model Risk Management

SS1/23 expects firms to maintain ongoing performance monitoring of all models, including third-party models used in decision-making. The guidance specifically addresses vendor models and requires firms to validate that vendor model performance remains within acceptable bounds. AG-091 directly implements this requirement by establishing baselines, monitoring for drift, and defining response procedures. The FCA expects monitoring to cover both aggregate and segmented performance metrics.

DORA — Article 28 (ICT Third-Party Risk)

Article 28 requires financial entities to manage ICT third-party risk, including ongoing monitoring of the performance and quality of ICT services provided by third parties. For AI-as-a-service dependencies, behavioural drift monitoring is a specific instantiation of this broader requirement. DORA requires that monitoring be proportionate to the criticality of the service — aligning with AG-091's requirement that monitoring frequency be proportional to dependency criticality.

NIST AI RMF — MEASURE 2.6, MANAGE 3.1

MEASURE 2.6 addresses the monitoring of AI system performance over time, including detection of performance degradation. MANAGE 3.1 addresses risk response actions when monitoring detects material changes. AG-091 supports compliance by implementing structured monitoring (MEASURE 2.6) with defined response procedures (MANAGE 3.1) for third-party AI dependencies.

ISO 42001 — Clause 8.4, Clause 9.1

Clause 8.4 addresses the operation of AI systems including controls on external components. Clause 9.1 requires monitoring, measurement, analysis, and evaluation of the AI management system's performance. AG-091 implements monitoring of third-party AI components as part of the operational control framework required by Clause 8.4, with the measurement and evaluation discipline required by Clause 9.1.

SR 11-7 — Federal Reserve Guidance on Model Risk Management

SR 11-7 requires ongoing monitoring of all models, with explicit attention to vendor models. The guidance states that the use of vendor models does not diminish the model risk management obligation — banks are expected to conduct their own performance monitoring of vendor models. AG-091 directly implements this requirement for AI vendor models.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	All agent operations dependent on the drifted third-party component — potentially cross-functional where multiple agents share a dependency

Consequence chain: Without third-party behaviour drift monitoring, a silent change in a third-party AI dependency propagates through the agent's decision-making without detection. The immediate technical failure is undetected degradation of a critical input to the agent's reasoning. The operational impact depends on the dependency's role: a drifted fraud model causes increased fraud losses; a drifted classification model causes misrouted work items; a drifted embedding model causes irrelevant retrievals. The business impact accumulates silently over the drift detection gap — the period between when drift begins and when it is detected. For an organisation with no monitoring, this gap is potentially unlimited, bounded only by the point at which downstream consequences become visible through other channels (customer complaints, financial losses, regulatory findings). The severity is amplified by the fact that the consuming organisation has no control over the timing of third-party changes. Cross-references: AG-014 (External Dependency Integrity) governs structural validation; AG-022 (Behavioural Drift Detection) governs internal drift; AG-048 (AI Model Provenance and Integrity) governs model identity; AG-093 (Supplier Concentration and Exit Governance) addresses the concentration risk that amplifies drift impact.

Cite this protocol

AgentGoverning. (2026). AG-091: Third-Party Behaviour Drift Monitoring Governance. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-091

← Previous Protocol

AG-090

Fine-Tune and Adapter Provenance Governance

Next Protocol →

AG-092

Training Data Rights and Licensing Governance