AG-337: Embedding Model Migration Governance

2. Summary

Embedding Model Migration Governance requires that when an organisation changes, upgrades, or fine-tunes the embedding model used to vectorise its knowledge base, the migration is governed through a structured process that ensures retrieval quality is maintained, compatibility issues are detected, and embedding drift is managed. Without this control, embedding model changes silently degrade retrieval quality as new embeddings become semantically incompatible with legacy embeddings, leading to missed retrievals, incorrect relevance rankings, and knowledge base fragmentation. This dimension ensures that embedding model changes are treated as governed infrastructure changes, not routine updates.

3. Example

Scenario A -- Silent Retrieval Degradation After Model Upgrade: An organisation upgrades its embedding model from a general-purpose model (384 dimensions) to a newer, higher-performance model (768 dimensions). The upgrade is applied to new documents ingested after the change, but the 250,000 existing documents retain their 384-dimension embeddings. The retrieval system now searches a vector space containing two incompatible embedding types. Queries embedded with the new model produce low similarity scores against legacy documents because the vector spaces are not aligned. Over 3 weeks, users report that the agent "forgot" information that was previously accessible. Investigation reveals that 250,000 legacy documents have effectively become invisible to retrieval because cross-model similarity scores fall below the retrieval threshold.

What went wrong: The embedding model was changed without re-embedding existing documents. Incompatible embeddings coexisted in the same vector space. No compatibility check detected the retrieval degradation. Consequence: 250,000 documents effectively lost from retrieval for 3 weeks, 847 incorrect agent responses traced to missing legacy knowledge, £45,000 in investigation and remediation costs, user trust erosion.

Scenario B -- Semantic Drift After Fine-Tuning: An organisation fine-tunes its embedding model on domain-specific financial terminology to improve retrieval precision for financial queries. The fine-tuning shifts the model's semantic space: terms like "derivative" now cluster with financial instruments rather than calculus. The existing knowledge base includes both financial and engineering content. After fine-tuning, engineering documents about mathematical derivatives are no longer retrievable for engineering queries because their embeddings were computed with the pre-fine-tuning model and "derivative" in those embeddings sits in a different region of the semantic space. Engineering queries retrieve financial content instead.

What went wrong: Fine-tuning shifted the embedding model's semantic space without re-embedding existing content. Cross-domain retrieval quality degraded because the semantic alignment between old and new embeddings changed. No impact assessment evaluated cross-domain effects before deployment. Consequence: Engineering team unable to retrieve technical documentation for 10 days, 23 engineering decisions made without proper technical reference, one design error costing £32,000 in rework.

Scenario C -- Vendor Lock-in Through Embedding Dependency: An organisation uses a proprietary embedding model from a vendor. The vendor deprecates the model with 6 months' notice. The organisation's knowledge base contains 1.2 million documents embedded with the deprecated model. The replacement model uses a different architecture and produces embeddings in a different vector space. The organisation must re-embed all 1.2 million documents before the deprecation deadline. At 200 documents per minute, re-embedding takes approximately 100 hours of continuous processing. The organisation's infrastructure cannot support this within the deprecation window without significant additional compute expenditure (estimated £28,000 in cloud compute costs).

What went wrong: The organisation had no migration plan, no re-embedding budget, and no assessment of the re-embedding timeline. The dependency on a specific embedding model was not governed as an infrastructure dependency. Consequence: Emergency procurement of cloud compute (£28,000), 2-week project to re-embed and validate, potential retrieval degradation during the migration period, diversion of engineering resources from other priorities.

4. Requirement Statement

Scope: This dimension applies to every AI agent that uses embedding models to vectorise knowledge base content for retrieval. This includes vector databases, semantic search systems, and any RAG implementation that relies on embedding similarity for retrieval. The scope extends to all embedding model changes: upgrades to newer model versions, migration to different model providers, fine-tuning on domain-specific data, changes in embedding dimensions, and changes in tokenisation. The scope includes both the knowledge base embeddings (the stored vectors) and the query embeddings (the vectors computed at query time). The test is: does the agent's retrieval system depend on embedding models? If yes, any change to those models is within scope.

4.1. A conforming system MUST maintain a registry of all embedding models in use, including: model identifier, version, provider, dimensionality, the date the model was deployed, and the scope of content embedded with each model.

4.2. A conforming system MUST require a governed change process for any embedding model change, including impact assessment, compatibility verification, and rollback capability.

4.3. A conforming system MUST ensure embedding compatibility when model changes occur, either by re-embedding all existing content with the new model or by implementing a compatibility layer that aligns cross-model retrievals.

4.4. A conforming system MUST verify retrieval quality after any embedding model change, using a defined benchmark set of queries with known-relevant results.

4.5. A conforming system MUST maintain the ability to roll back to the previous embedding model and its associated embeddings if quality verification fails.

4.6. A conforming system SHOULD implement a re-embedding pipeline capable of processing the full knowledge base within a defined time window (e.g., 48 hours for knowledge bases up to 1 million documents).

4.7. A conforming system SHOULD maintain an embedding compatibility matrix documenting which embedding models are compatible (can coexist in the same vector space with acceptable retrieval quality) and which are incompatible.

4.8. A conforming system SHOULD implement progressive migration, where re-embedding occurs in priority order (critical content first, archival content last) to minimise the period of degraded retrieval.

4.9. A conforming system MAY implement embedding versioning that stores multiple embedding versions per document, enabling parallel retrieval across model generations during the migration period.

5. Rationale

Embedding models are the foundation of vector-based retrieval. Every document in the knowledge base is represented as a high-dimensional vector computed by the embedding model. Retrieval works by comparing the query vector (also computed by the embedding model) against document vectors using similarity metrics (typically cosine similarity). This architecture has a critical dependency: the query and document vectors must be in the same semantic space for similarity comparisons to be meaningful.

When the embedding model changes, the semantic space changes. A new model maps the same text to different vectors. If the query is embedded with Model B but the documents are embedded with Model A, the similarity scores become unreliable. In practice, cross-model similarity scores typically degrade by 20-40% compared to same-model scores, depending on the architectural distance between the models. For documents near the retrieval threshold, this degradation pushes them below the threshold, making them invisible to retrieval (Scenario A).

Fine-tuning introduces a subtler problem: the model's semantic space shifts selectively. Terms that the fine-tuning emphasised move in the space; terms that were not in the fine-tuning data remain approximately in place. This creates uneven retrieval quality across domains: the fine-tuned domain improves, other domains may degrade (Scenario B).

The operational consequence is significant. A knowledge base with 1 million documents represents a substantial investment in embedding computation. Re-embedding is computationally expensive (typically 0.5-2 seconds per document for production models), requiring significant compute resources and time. Without a governed migration process, organisations face a choice between living with degraded retrieval (unacceptable for production systems) and executing an unplanned, untested re-embedding (risky and expensive).

The embedding model registry and change governance process ensure that model changes are planned, impact-assessed, and verified. The re-embedding pipeline and compatibility matrix provide the operational capabilities needed to execute migrations safely. The rollback capability provides a safety net when quality verification fails.

6. Implementation Guidance

Embedding model migration governance requires capabilities at three levels: registry and planning (knowing what models are in use and planning changes), execution (re-embedding and compatibility management), and verification (confirming that retrieval quality is maintained).

Recommended Patterns:

Embedding model registry. Maintain a registry table: {model_id, model_version, provider, dimensions, deployed_at, deprecated_at, document_count, collection_scope}. Example entry: {model_id: "text-embedding-v3", version: "3.1.2", provider: "vendor-X", dimensions: 768, deployed_at: "2026-01-15", deprecated_at: null, document_count: 450000, collection_scope: "all_collections"}. The registry is the authoritative record of which model produced which embeddings. Every embedding stored in the vector database should carry metadata linking it to the model version that produced it.
Benchmark retrieval test suite. Maintain a benchmark set of 200-500 queries with known-relevant documents (ground truth labels). Before and after any model change, run the benchmark and compare retrieval quality metrics: Mean Reciprocal Rank (MRR), Recall@10, Precision@10, and NDCG@10. Define minimum acceptable thresholds: e.g., MRR must not decrease by more than 5%, Recall@10 must not decrease by more than 3%. If thresholds are violated, the migration fails quality gate and must be investigated before proceeding. Example: pre-migration MRR = 0.82; post-migration MRR = 0.74 (a 9.8% decrease, exceeding the 5% threshold). Migration fails quality gate.
Progressive re-embedding pipeline. Implement a re-embedding pipeline that processes documents in priority order: (1) documents accessed in the last 30 days (high activity, highest impact if degraded), (2) documents in critical domains (regulatory, safety, financial), (3) remaining documents by recency. The pipeline should support parallel processing with configurable concurrency (e.g., 50 concurrent embedding requests). For a knowledge base of 500,000 documents with average processing rate of 200 documents/minute, total re-embedding time is approximately 42 hours. Infrastructure should be pre-provisioned for this capacity.
Dual-model retrieval during migration. During the re-embedding period, implement dual-model retrieval: query using both the old and new models, merge results, and deduplicate. This ensures that documents not yet re-embedded remain accessible through the old model. Example: query is embedded with both Model A and Model B; results from both models are merged by document ID; if a document appears in both result sets, the higher score is used; the merged set is ranked and returned. This adds latency (approximately 2x) but prevents retrieval gaps during migration.

Anti-Patterns to Avoid:

In-place model swap without re-embedding. Changing the query model without re-embedding existing documents is the most common and most damaging anti-pattern. It creates immediate, silent retrieval degradation for all legacy content. This is never acceptable for production systems.
Re-embedding without quality verification. Re-embedding all documents with a new model and deploying without running the benchmark test suite risks discovering quality problems in production. Quality verification before deployment is essential.
No rollback capability. If the migration fails quality verification or causes production issues, the organisation must be able to revert to the previous model and embeddings. Without stored backup embeddings or the ability to re-embed with the old model, rollback is impossible.
Ignoring cross-domain impact of fine-tuning. Fine-tuning on one domain can degrade retrieval in other domains. Impact assessment must cover all domains in the knowledge base, not just the fine-tuned domain.
No dependency tracking for vendor models. Using a vendor's embedding model without tracking the deprecation lifecycle creates the lock-in risk in Scenario C. The registry should include vendor deprecation timelines and trigger migration planning at least 6 months before deprecation.

Industry Considerations

Financial Services. Embedding model changes should be governed under the firm's model risk management framework (SR 11-7 or equivalent). The benchmark test suite should include financial terminology and regulatory concepts. MiFID II record-keeping requirements mean that the ability to retrieve historical documents must be maintained across model migrations.

Healthcare. Clinical terminology sensitivity requires that embedding model changes be validated against clinical query patterns. A model change that degrades retrieval for medical terminology could affect clinical decision support quality. The benchmark suite should include clinical queries with known-relevant clinical evidence.

Legal. Legal terminology is domain-specific and often counter-intuitive in embedding space (e.g., "consideration" in contract law versus general English). The benchmark suite should include legal domain queries. Fine-tuning for legal terminology should be impact-assessed for effects on non-legal content.

Maturity Model

Basic Implementation -- An embedding model registry exists documenting all models in use. Any model change requires a change request and approval. A benchmark test suite of at least 200 queries with ground truth is maintained. Quality verification runs before and after model changes. Rollback to the previous model is possible within 24 hours. Re-embedding is performed as a single batch operation. This meets minimum mandatory requirements.

Intermediate Implementation -- All basic capabilities plus: progressive re-embedding processes critical content first. Dual-model retrieval maintains access during migration. Embedding compatibility matrix documents cross-model compatibility. Re-embedding pipeline can process 1 million documents within 48 hours. Automated quality monitoring detects retrieval degradation within 6 hours of deployment. Model changes are versioned and auditable.

Advanced Implementation -- All intermediate capabilities plus: embedding versioning stores multiple embedding generations per document. Predictive impact assessment estimates retrieval quality impact before migration using a representative sample. The migration pipeline has been independently tested for completeness and rollback reliability. Zero-downtime migration is achieved through progressive re-embedding with dual-model retrieval. The organisation can demonstrate to auditors the complete embedding history of any document.

7. Evidence Requirements

Required artefacts:

Embedding model registry. The current registry showing all models in use, their deployment dates, scope, and document counts.
Migration change records. For each model change: the change request, impact assessment, quality verification results (pre- and post-migration benchmark scores), and approval.
Benchmark test suite and results. The benchmark query set with ground truth labels, and the results of quality verification for each model change.
Re-embedding pipeline documentation. Documentation of the re-embedding pipeline including: processing capacity, priority order, estimated completion time, and dual-model retrieval configuration during migration.
Rollback evidence. Evidence that rollback capability exists and has been tested, including the time required to complete a rollback.

Retention requirements:

Model registry and migration records: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Cross-Model Retrieval Degradation Detection

Stimulus: Embed 1,000 documents with Model A. Embed a query with Model B (incompatible). Attempt retrieval.
Expected behaviour: The system detects that the query model differs from the document model and either rejects the query, applies a compatibility layer, or flags the incompatibility.
Pass criteria: The incompatibility is detected. The system does not silently return degraded results. The detection is logged.
Fail criteria: Cross-model retrieval proceeds silently without detection or warning.

Test 8.2: Benchmark Quality Gate Enforcement

Stimulus: Execute the benchmark test suite against a migrated knowledge base where retrieval quality has degraded below the threshold (e.g., MRR decreased by 8%, exceeding the 5% threshold).
Expected behaviour: The quality gate fails. The migration is flagged for investigation. Deployment to production is blocked until the issue is resolved.
Pass criteria: The quality gate correctly identifies the degradation. Deployment is blocked. The specific metrics and thresholds are reported.
Fail criteria: The quality gate passes despite exceeding the degradation threshold, or the gate is not enforced.

Test 8.3: Rollback Execution

Stimulus: Execute a model migration and then trigger a rollback. Verify that retrieval quality returns to pre-migration levels.
Expected behaviour: Rollback restores the previous model and embeddings. Benchmark scores return to pre-migration levels within the defined rollback window.
Pass criteria: Rollback completes within the defined window (e.g., 24 hours). Post-rollback benchmark scores match pre-migration scores within 1% tolerance.
Fail criteria: Rollback cannot be completed, takes longer than the defined window, or does not restore pre-migration retrieval quality.

Test 8.4: Progressive Re-Embedding Priority

Stimulus: Initiate a re-embedding migration for a knowledge base with documents tagged by priority (critical, high, medium, low). Monitor the processing order.
Expected behaviour: Critical documents are re-embedded first, followed by high, medium, and low priority.
Pass criteria: Processing order follows the defined priority. Critical documents are available with new embeddings before low-priority documents begin processing.
Fail criteria: Documents are processed in random order or non-priority order.

Test 8.5: Dual-Model Retrieval During Migration

Stimulus: During a re-embedding migration (50% of documents re-embedded), query the knowledge base with a query relevant to both re-embedded and not-yet-re-embedded documents.
Expected behaviour: Both re-embedded and legacy documents are retrieved and merged. No documents are invisible due to the migration being in progress.
Pass criteria: Documents from both embedding generations appear in results. Recall is not degraded compared to pre-migration levels.
Fail criteria: Not-yet-re-embedded documents are missing from results, or retrieval quality degrades during migration.

Test 8.6: Registry Accuracy

Stimulus: After a model migration, verify that the embedding model registry accurately reflects the current state: new model recorded, document counts updated, migration timestamp recorded.
Expected behaviour: The registry is updated to reflect the migration. Document counts match actual counts. No stale registry entries remain active.
Pass criteria: Registry accurately reflects the current embedding model landscape. All fields are current and consistent with actual state.
Fail criteria: Registry contains stale information, missing model entries, or inaccurate document counts.

Conformance Scoring

Score 0: No embedding migration governance -- model changes are untracked, re-embedding is ad hoc, and retrieval quality is not verified after changes.
Score 1: An embedding model registry exists, but changes are not governed -- model updates occur without impact assessment or quality verification.
Score 2: Governed change process with impact assessment, benchmark quality verification, rollback capability, progressive re-embedding, and dual-model retrieval during migration.
Score 3: Verified by independent testing -- an independent party has confirmed that quality gates detect retrieval degradation, rollback completes within the defined window, and zero-downtime migration is achieved through dual-model retrieval. Benchmark test suite covers all content domains.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Supports compliance
EU AI Act	Article 15 (Accuracy, Robustness, Cybersecurity)	Direct requirement
EU AI Act	Article 12 (Record-Keeping)	Supports compliance
NIST AI RMF	MANAGE 2.2, MANAGE 4.1	Supports compliance
ISO 42001	Clause 6.1 (Actions to Address Risks), Clause 8.1 (Operational Planning and Control)	Supports compliance
DORA	Article 9 (ICT Risk Management Framework)	Supports compliance

EU AI Act -- Article 15 (Accuracy, Robustness, Cybersecurity)

Article 15 requires high-risk AI systems to maintain appropriate levels of accuracy and robustness. An unmanaged embedding model change can silently degrade retrieval accuracy, which in turn degrades the accuracy of all agent outputs. AG-337 directly supports accuracy maintenance by ensuring that model changes are impact-assessed and quality-verified before deployment. The rollback capability supports robustness by providing recovery from quality failures.

EU AI Act -- Article 12 (Record-Keeping)

Article 12 requires record-keeping for traceability. The embedding model registry and migration change records provide the audit trail for understanding which model produced which embeddings and when changes occurred. This traceability is essential for investigating retrieval quality issues and demonstrating due diligence in model management.

EU AI Act -- Article 9 (Risk Management System)

Embedding model changes are a risk to retrieval quality and, by extension, to agent output quality. AG-337's governed change process is a risk management control.

NIST AI RMF -- MANAGE 2.2, MANAGE 4.1

MANAGE 2.2 addresses risk mitigation. MANAGE 4.1 addresses post-deployment monitoring. Embedding model governance mitigates the risk of retrieval degradation, and quality verification provides post-deployment monitoring of retrieval quality.

ISO 42001 -- Clause 6.1, Clause 8.1

Clause 6.1 requires actions to address risks. Clause 8.1 requires operational planning and control. Embedding model migration is an operational process that must be planned and controlled to avoid retrieval quality risks.

DORA -- Article 9

Article 9 requires financial entities to maintain an ICT risk management framework. Embedding models are ICT components whose changes must be governed within the risk management framework to prevent service degradation.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide -- affects all retrieval across the entire knowledge base

Consequence chain: Without embedding model migration governance, model changes silently degrade retrieval quality across the entire knowledge base. The immediate failure is invisible document loss: documents embedded with an incompatible model become unretrievable despite being present in the database (Scenario A -- 250,000 documents invisible for 3 weeks, 847 incorrect responses). The secondary failure is cross-domain degradation from fine-tuning (Scenario B -- engineering team unable to retrieve technical documentation, £32,000 in rework). The operational failure is unplanned migration costs when vendor models are deprecated (Scenario C -- £28,000 emergency compute costs plus 2-week project diversion). The blast radius is organisation-wide because the embedding model underlies all retrieval across all collections. A single unmanaged model change can degrade every agent response that relies on the knowledge base.

Cross-references: AG-040 (Persistent Memory Governance) provides the foundational framework for the knowledge base that embeddings represent. AG-082 (Data Minimisation Enforcement) reduces the volume of content requiring re-embedding. AG-122 (Knowledge Integrity Verification) verifies the integrity of knowledge that must be preserved across model migrations. AG-132 (Memory Scope Boundary Enforcement) defines the scope boundaries that apply to re-embedding prioritisation. AG-179 (Memory Audit Trail Governance) captures migration events in the audit trail. AG-333 (Retrieved Evidence Confidence Governance) may need threshold recalibration after model changes. AG-336 (Knowledge Freshness Attestation Governance) is affected because re-embedding resets the embedding timestamp but does not reset content freshness. AG-338 (Retrieval Poisoning Quarantine Governance) should be re-evaluated after model changes as poisoning vectors may differ between models.

Cite this protocol

AgentGoverning. (2026). AG-337: Embedding Model Migration Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-337

← Previous Protocol

AG-336

Knowledge Freshness Attestation Governance

Next Protocol →

AG-338

Retrieval Poisoning Quarantine Governance