AG-345: Model Family Substitution Governance

2. Summary

Model Family Substitution Governance requires that any replacement of one model family with another for a given task — such as substituting a model from Provider A with a model from Provider B, or replacing a GPT-family model with a Llama-family model — be subject to formal impact assessment, re-evaluation, and approval before deployment. Different model families have fundamentally different training data, alignment approaches, capability profiles, failure modes, and licensing terms. A model substitution is not an upgrade — it is a change in the underlying cognitive architecture serving a task, and assumptions validated for the original model family do not transfer to the replacement. AG-345 ensures that substitution decisions are governed with the same rigour as new model deployments.

3. Example

Scenario A — Provider Substitution Changes Regulatory Posture: A European financial services firm uses a proprietary model from Provider A for customer communication. Provider A's terms include GDPR-compliant data processing agreements, EU data residency, and contractual commitments to not train on customer data. Due to a 30% cost reduction, the firm substitutes Provider B's model. Provider B's terms permit training on inputs unless opted out (opt-out requires a separate enterprise agreement not yet signed), data processing occurs in US data centres, and the DPA does not meet GDPR adequacy requirements. The substitution is treated as a "model upgrade" and does not pass through the data protection team. Four months later, a GDPR audit discovers that customer data has been processed outside the EU without adequate safeguards and may have been used for model training without consent.

What went wrong: The substitution was treated as a technical change rather than a regulatory change. No impact assessment evaluated the data processing terms of the new provider. The assumption that "it's just a different model" masked a fundamental change in data handling, jurisdiction, and contractual protections. Consequence: GDPR investigation, potential €8.2 million fine (4% of EU turnover), emergency migration back to the original provider (£340,000), and customer notification obligation.

Scenario B — Model Family Has Different Failure Modes: A safety-critical agent uses Model Family X for equipment diagnostics. Model Family X has a known, documented failure mode: it occasionally hallucinates non-existent error codes, which are easily filtered by the downstream system (non-existent codes are rejected). The team substitutes Model Family Y, which scores 4% higher on the diagnostic accuracy benchmark. Model Family Y does not hallucinate non-existent codes. Instead, its failure mode is to substitute one valid error code for another — a real code, but the wrong one. The downstream filter does not catch this because the code is valid. Over three weeks, 47 pieces of equipment receive incorrect diagnostic actions, including 3 that result in unnecessary component replacements costing £28,000 each and 1 that misses a genuine failure, leading to an unplanned shutdown costing £215,000.

What went wrong: The substitution assessed accuracy (Model Y is 4% better) but not failure mode characteristics. The assumption that "more accurate equals safer" did not hold because the failure modes were qualitatively different. Model X's failures were detectable; Model Y's failures were invisible to existing safeguards. No impact assessment evaluated whether existing downstream safeguards were effective against the new model's failure modes. Consequence: £299,000 in equipment damage and downtime, plus £45,000 remediation to implement safeguards for the new failure mode.

Scenario C — Substitution Invalidates Fine-Tuning Investment: An organisation fine-tunes Model Family X over 6 months, investing £1.2 million in training data curation, preference labelling, and iterative fine-tuning. When a new release of Model Family Y demonstrates superior base capability, the team proposes substitution. The fine-tuning investment is non-transferable — LoRA adapters trained for Family X are architecturally incompatible with Family Y, preference data collected for Family X's output distribution is poorly aligned with Family Y's output distribution, and evaluation benchmarks calibrated for Family X's failure patterns do not capture Family Y's failure patterns. The substitution effectively resets 6 months of adaptation work to zero.

What went wrong: The substitution cost analysis compared inference costs and base benchmark scores but did not account for the stranded fine-tuning investment. No governance process required assessment of adaptation portability before approving the substitution. Consequence: £1.2 million in stranded fine-tuning investment, 4 months to re-adapt Model Family Y to equivalent capability, and 4 months of degraded service while the new model reaches parity.

4. Requirement Statement

Scope: This dimension applies to any substitution of one model family with another for a task currently served by a production or pre-production AI agent. A model family substitution is defined as replacing the base model architecture with a model from a different provider, a different architecture family, or a different major version that has been trained from scratch (as opposed to a continuation or fine-tune of the existing model). Upgrading within the same family (e.g., version 3 to version 4 of the same provider's model line) is in scope when the new version represents a new training run with potentially different training data, alignment approach, or capability profile. Changing quantisation levels, adapter compositions, or system prompts without changing the base model is not a model family substitution (those are governed by AG-344 and AG-342 respectively).

4.1. A conforming system MUST require a documented substitution impact assessment before any model family substitution, covering: capability comparison (disaggregated by task and segment), failure mode analysis (comparing the old and new model's characteristic failure patterns), safety property comparison, regulatory and contractual impact (data processing terms, jurisdiction, training data rights), adaptation portability (whether existing fine-tuning, adapters, and evaluation assets transfer to the new model), and total cost of substitution (including stranded investment and re-adaptation costs).

4.2. A conforming system MUST evaluate the substitute model against all evaluation criteria that the current model meets, including domain-specific benchmarks, safety benchmarks, and adversarial testing — not just the criteria on which the substitute model is expected to be superior.

4.3. A conforming system MUST assess whether existing downstream safeguards, filters, and monitoring systems remain effective against the substitute model's failure modes and output characteristics.

4.4. A conforming system MUST require approval from stakeholders spanning technical, risk, legal, and business functions before deploying a model family substitution.

4.5. A conforming system MUST maintain substitution records documenting the rationale, impact assessment, evaluation results, and approval chain for every model family substitution.

4.6. A conforming system SHOULD implement a parallel-run period where both the existing and substitute models serve traffic (with the substitute in shadow or canary mode) to identify behavioural differences under real-world conditions before full cutover.

4.7. A conforming system SHOULD assess the substitute model's licensing terms, data processing terms, and vendor dependency implications as part of the impact assessment.

4.8. A conforming system SHOULD evaluate the substitute model using the existing model's real production inputs (sanitised as needed) rather than relying solely on benchmarks, to capture domain-specific behaviour differences.

4.9. A conforming system MAY implement automated substitution impact scoring that quantifies the total cost and risk of substitution to support decision-making.

5. Rationale

Model family substitution is one of the highest-impact changes an organisation can make to its AI infrastructure, yet it is routinely treated as a straightforward upgrade decision. The reasoning is seductive: "Model Y scores 4% higher on benchmarks and costs 30% less — why wouldn't we switch?" This reasoning fails because it evaluates substitution on the dimensions where the new model is expected to excel while ignoring the dimensions where the existing model has been validated, adapted, and integrated.

The core governance problem is that model families are not interchangeable components. They differ in training data (affecting knowledge, biases, and cultural context), alignment approach (affecting safety behaviour, refusal patterns, and helpfulness trade-offs), architecture (affecting latency, throughput, and failure characteristics), capability profile (where each model excels and where it struggles), licensing terms (affecting what you can do with the model and its outputs), and failure modes (how the model fails when it fails). Each of these differences can invalidate assumptions that the organisation has built into its deployment, monitoring, and safeguard infrastructure.

The adaptation portability problem is particularly expensive. Modern AI deployments involve substantial investment in fine-tuning, evaluation benchmarks, prompt engineering, output parsing, monitoring thresholds, and downstream integration. These investments are model-family-specific. A LoRA adapter trained for one model family cannot be used with another. Prompt templates optimised for one model's output format may produce poor results with another. Monitoring thresholds calibrated for one model's error distribution will produce false positives or false negatives with another. The total cost of substitution includes all of this re-adaptation work, which often exceeds the cost savings that motivated the substitution.

The failure mode problem is the most dangerous. Organisations build safeguards around the failure modes they have observed. When a new model family introduces different failure modes, existing safeguards may be ineffective. Scenario B illustrates this: a safeguard designed for one failure mode (hallucinated codes) is useless against a qualitatively different failure mode (code substitution). The substitution makes the deployment less safe despite the new model being more accurate, because the safety infrastructure is misaligned with the new model's failure characteristics.

6. Implementation Guidance

Substitution impact assessment template. Establish a standardised template with the following sections: business rationale (why the substitution is proposed, including cost and capability drivers), capability comparison (side-by-side evaluation on all relevant benchmarks, disaggregated by task and segment), failure mode analysis (documented failure modes of both models, with assessment of whether existing safeguards address the substitute's failure modes), safety comparison (standard and adversarial safety evaluation for both models), regulatory and contractual analysis (data processing terms, jurisdiction, training data rights, licensing restrictions), adaptation portability assessment (which fine-tuning, adapters, prompts, evaluations, and monitoring assets transfer vs. must be rebuilt), total cost of substitution (stranded investment + re-adaptation cost + parallel-run cost + risk exposure during transition), and rollback plan (how to revert to the existing model if the substitution fails).

Parallel-run evaluation. Before full cutover, deploy the substitute model in shadow mode — receiving the same inputs as the production model but with outputs discarded or compared rather than served. Compare outputs across dimensions: agreement rate (how often the two models produce equivalent outputs), disagreement analysis (on cases where they disagree, which model is correct?), latency comparison, and failure pattern comparison. A parallel-run period of 2-4 weeks is recommended for general-purpose deployments; 4-8 weeks for safety-critical or regulated deployments.

Stakeholder approval matrix. Define which stakeholders must approve a substitution based on deployment context. At minimum: technical lead (capability assessment), risk/compliance officer (regulatory and contractual assessment), business owner (cost and service impact), and security officer (data processing and custody assessment). For high-risk deployments, include the model risk management committee, legal counsel, and the data protection officer.

Recommended patterns:

Substitution scorecard. Produce a structured scorecard that quantifies the substitution impact across all dimensions on a common scale. The scorecard should highlight: dimensions where the substitute is superior (in green), dimensions where the substitute is inferior (in red), dimensions where the substitute is equivalent (in neutral), and dimensions that could not be assessed (flagged for manual review). This makes the full picture visible to decision-makers.
Failure mode mapping. Before substitution, document the existing model's known failure modes and the safeguards built to address each. Then map the substitute model's failure modes against the same safeguard inventory. Identify safeguard gaps (new failure modes not covered) and safeguard redundancies (old failure modes no longer applicable). This mapping is the most important safety artefact in the substitution process.
Contractual comparison matrix. Side-by-side comparison of the existing and substitute providers' terms covering: data processing location, training-on-inputs policy, data retention, liability provisions, service level agreements, and termination terms. This catches the regulatory posture change in Scenario A.

Anti-patterns to avoid:

Benchmark-only comparison. Evaluating the substitution solely on benchmark scores. Benchmarks capture a subset of model behaviour; they do not capture failure modes, downstream integration compatibility, or regulatory posture. A model that scores 4% higher on benchmarks but has incompatible failure modes may be a net negative substitution.
Treating substitution as an upgrade. An upgrade implies the same thing, but better. A substitution is a different thing that may be better in some dimensions and worse in others. The governance framing matters — upgrades suggest routine approval; substitutions demand comprehensive assessment.
Ignoring stranded investment. Evaluating the substitute's inference cost against the existing model's inference cost without accounting for the £1.2 million in fine-tuning, £300,000 in evaluation infrastructure, and £150,000 in prompt engineering that must be rebuilt. The true cost comparison is: existing model (marginal inference cost) vs. substitute model (inference cost + re-adaptation cost + transition risk cost).
Assuming failure mode similarity. Assuming that because two models achieve similar accuracy, they fail in similar ways. This assumption is consistently wrong: models trained on different data with different architectures fail in characteristically different patterns.
Substitution without rollback plan. Committing to a substitution without a tested rollback path. If the substitute model exhibits problems in production that were not detected in evaluation, the organisation must be able to revert to the existing model quickly (per AG-347).

Industry Considerations

Financial Services. Model substitution in financial services triggers model risk management obligations under PRA SS1/23. The substitute model must undergo independent validation. The substitution may also trigger notification obligations to the PRA/FCA if the model is material to regulated activities.

Healthcare. Substituting the underlying model of a clinical AI system may constitute a significant change requiring regulatory notification under MDR/IVDR. The substitute model must be assessed for clinical safety, and the change may require updated clinical evidence.

Government and Public Sector. Substitution may change the data processing jurisdiction, which can affect data sovereignty requirements. Government deployments with data residency requirements must verify that the substitute model's processing chain meets jurisdictional constraints.

Maturity Model

Basic Implementation — Model substitutions are proposed and evaluated by the technical team, with business approval for cost changes. Capability comparison uses standard benchmarks. Failure mode analysis is informal. Contractual and regulatory analysis is ad hoc. The substitution decision is documented but the assessment may be incomplete. This level catches obvious mismatches but misses subtle failure mode differences, regulatory implications, and stranded investment costs.

Intermediate Implementation — A standardised substitution impact assessment template is completed for every substitution. The assessment covers capability, failure modes, safety, regulatory posture, adaptation portability, and total cost. Stakeholder approval from technical, risk, legal, and business functions is required. A parallel-run period is mandatory for regulated deployments. Substitution records are retained as governance artefacts.

Advanced Implementation — All intermediate capabilities plus: failure mode mapping identifies safeguard gaps before substitution. Contractual comparison matrices catch regulatory posture changes. Parallel-run evaluation with real production inputs (sanitised) runs for 4+ weeks. Automated substitution scoring quantifies total cost and risk. Rollback plans are tested before cutover. The organisation can demonstrate to regulators that every model substitution was assessed for capability, safety, failure mode, regulatory, and financial impact with multi-stakeholder approval.

7. Evidence Requirements

Required artefacts:

Substitution impact assessments. Completed assessments for every model family substitution, covering all required dimensions.
Capability comparison results. Side-by-side evaluation results for existing and substitute models across all relevant benchmarks.
Failure mode mapping. Documentation of both models' failure modes and the safeguard coverage assessment.
Regulatory and contractual analysis. Assessment of data processing, licensing, and jurisdictional changes.
Approval records. Multi-stakeholder approval chain with signatures, dates, and any conditions.
Parallel-run results. Results from the parallel-run evaluation period (where conducted).

Retention requirements:

Substitution records: operational lifetime of the substitute model plus minimum 5 years for regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Substitution Impact Assessment Completeness

Stimulus: Audit all model family substitutions in the past 12 months. Retrieve the impact assessment for each.
Expected behaviour: Every substitution has a complete impact assessment covering capability, failure modes, safety, regulatory posture, adaptation portability, and total cost.
Pass criteria: 100% of substitutions have complete impact assessments.
Fail criteria: Any substitution lacks an impact assessment or any assessment is missing required sections.

Test 8.2: Failure Mode Safeguard Coverage

Stimulus: For the most recent substitution, retrieve the failure mode mapping. Verify that every documented failure mode of the substitute model has an identified safeguard.
Expected behaviour: The failure mode mapping shows each substitute model failure mode and the corresponding safeguard. Any gaps are documented with remediation actions.
Pass criteria: All substitute model failure modes have identified safeguards, or gaps are documented with remediation actions completed before deployment.
Fail criteria: Failure mode mapping was not performed, or gaps exist without documented remediation.

Test 8.3: Multi-Stakeholder Approval

Stimulus: For 5 model substitutions, verify that approval was obtained from technical, risk, legal, and business stakeholders.
Expected behaviour: Each substitution has documented approval from all required stakeholder categories.
Pass criteria: All substitutions have multi-stakeholder approval with dated signatures.
Fail criteria: Any substitution was approved by a single function or any required stakeholder approval is missing.

Test 8.4: Downstream Safeguard Effectiveness

Stimulus: After a model substitution, test all existing downstream safeguards (output filters, anomaly detection, quality gates) against the substitute model's outputs.
Expected behaviour: All safeguards function correctly against the substitute model's output format and characteristics. Any safeguard designed for the previous model's failure modes has been verified or updated for the substitute model.
Pass criteria: All downstream safeguards pass verification against the substitute model.
Fail criteria: Any downstream safeguard fails to function correctly with the substitute model's outputs.

Test 8.5: Regulatory Posture Assessment

Stimulus: For a substitution involving a change of model provider, verify that the regulatory and contractual analysis was completed and includes data processing terms comparison.
Expected behaviour: The assessment includes side-by-side comparison of data processing location, training-on-inputs policy, GDPR adequacy, and licensing terms.
Pass criteria: Regulatory analysis is complete and no unresolved regulatory risks exist at deployment time.
Fail criteria: Regulatory analysis was not performed or identified risks remain unresolved at deployment time.

Conformance Scoring

Score 0: No substitution governance — model families are swapped based on benchmark comparisons without formal impact assessment.
Score 1: Basic assessment — capability comparison exists but failure mode analysis, regulatory assessment, and adaptation portability are not systematically evaluated.
Score 2: Systematic governance — complete impact assessment, multi-stakeholder approval, failure mode mapping, and regulatory analysis for every substitution.
Score 3: Comprehensive governance — all Score 2 controls plus mandatory parallel-run evaluation, failure mode safeguard gap analysis, contractual comparison matrices, tested rollback plans, and automated substitution scoring.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Supports compliance
EU AI Act	Article 16 (Obligations of Providers)	Direct requirement
GDPR	Articles 28, 44-49 (Data Processing, International Transfers)	Direct requirement
PRA SS1/23	Model Risk Management — Model Change and Replacement	Direct requirement
NIST AI RMF	GOVERN 1.1, MAP 2.3, MANAGE 2.2	Supports compliance
ISO 42001	Clause 8.2 (AI Risk Assessment), Clause 8.4 (AI System Operation)	Supports compliance

Model family substitution that changes the model provider is a change of data processor, requiring assessment under GDPR Article 28 (processor obligations). If the substitute provider processes data in a different jurisdiction, Articles 44-49 on international data transfers apply. Scenario A illustrates this: a substitution that moves data processing from the EU to the US without adequate safeguards creates a GDPR violation. AG-345's requirement for regulatory and contractual impact assessment directly addresses this risk.

EU AI Act — Article 16 (Obligations of Providers)

Article 16 establishes ongoing obligations for providers of high-risk AI systems, including maintaining the risk management system throughout the lifecycle. A model family substitution is a significant lifecycle event that must be managed within the risk management system. The substitute model must meet all requirements that the original model met, and the transition must be governed to prevent regression in compliance posture.

PRA SS1/23 — Model Change and Replacement

PRA SS1/23 expects firms to apply appropriate governance to model changes, including model replacement. A model family substitution is the most significant form of model change. The supervisory expectation is that replacement models undergo independent validation equivalent to new model validation. A firm that substitutes model families without re-validation would face supervisory challenge.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Service-wide — affects all users and downstream systems dependent on the substituted model

Consequence chain: Ungoverned model substitution creates three distinct failure paths. First, capability regression: the substitute model underperforms the existing model on dimensions not captured by the comparison benchmarks, degrading service quality for all users. The financial impact depends on the application — from customer dissatisfaction (£50,000-£200,000 in churn) to material financial harm (£299,000 in equipment damage in Scenario B). Second, regulatory non-compliance: the substitute model or provider changes the regulatory posture without assessment, creating compliance violations. Scenario A's potential €8.2 million GDPR fine illustrates the upper bound. Third, stranded investment: the organisation loses the value of model-family-specific adaptations — fine-tuning, evaluation infrastructure, prompt engineering — that do not transfer to the new family. Scenario C's £1.2 million in stranded investment plus 4 months of degraded service illustrates this path. All three paths are preventable through systematic impact assessment, which is why AG-345 requires assessment before substitution rather than monitoring after.

Cross-references: AG-048 (AI Model Provenance and Integrity) provides the provenance framework within which substitutions are tracked. AG-347 (Model Rollback Readiness Governance) ensures rollback capability during substitution transitions. AG-344 (Quantisation Risk Governance) covers substitution of precision variants within a model family. AG-057 (Dataset Suitability and Bias Control) addresses evaluation of substitute model behaviour across demographic segments. AG-339 through AG-348 form the sibling landscape for Model Provenance, Training & Adaptation.

Cite this protocol

AgentGoverning. (2026). AG-345: Model Family Substitution Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-345

← Previous Protocol

AG-344

Quantisation Risk Governance

Next Protocol →

AG-346

Frontier Capability Reclassification Governance