AG-230: Substantial Modification Determination Governance

2. Summary

Substantial Modification Determination Governance requires that every change to an AI agent — its model, training data, operational parameters, deployment context, or governance configuration — is assessed against defined materiality thresholds to determine whether the change constitutes a substantial modification triggering reclassification, reapproval, or recertification. This dimension prevents incremental changes from cumulatively transforming an agent's risk profile without triggering the governance reviews that the original deployment required. The EU AI Act, medical device regulations, and financial services change management frameworks all impose reclassification obligations upon substantial modification — AG-230 implements the determination mechanism that identifies when those obligations activate.

3. Example

Scenario A — Incremental Fine-Tuning Creates Undetected Reclassification Trigger: A financial-value agent is deployed for customer suitability assessments, classified as high-risk under the EU AI Act and certified through a conformity assessment. Over 8 months, the development team applies 23 incremental fine-tuning updates based on customer interaction data. Each update modifies fewer than 0.5% of model weights and passes unit tests. No individual update triggers the change management threshold of "material change." After 8 months, the cumulative effect of the 23 updates has shifted the model's behaviour significantly: the suitability recommendations have drifted by 14% from the original certified behaviour on a benchmark test suite. A regulatory audit identifies that the agent's behaviour no longer matches the conformity assessment documentation. The regulator determines that the cumulative changes constitute a substantial modification under EU AI Act Article 43(4), requiring a new conformity assessment. The agent must be taken offline pending reassessment.

What went wrong: Each individual change was assessed independently against the materiality threshold. No mechanism tracked the cumulative impact of successive changes. The determination process evaluated changes in isolation rather than measuring cumulative drift from the certified baseline. Consequence: 3-month service interruption during reassessment, EUR 1.8 million reassessment cost, regulatory finding for failure to maintain conformity, and customer impact from suspended suitability services.

Scenario B — Deployment Context Change Triggers Unrecognised Reclassification: An AI agent originally deployed as an internal research assistant (classified as limited risk under the EU AI Act) is repurposed by a business unit as a customer-facing advisory tool for insurance product selection. The model is identical — no technical change was made. The business unit does not consult the compliance or legal teams because "nothing changed about the AI." However, the change in deployment context — from internal research to customer-facing financial advice — changes the risk classification from limited risk to high risk. The agent operates for 4 months without the required conformity assessment, transparency obligations, or human oversight mechanisms required for high-risk AI systems. A customer complaint triggers a regulatory inquiry that identifies the classification gap.

What went wrong: The substantial modification determination process evaluated only technical changes (model updates, data changes, parameter modifications). It did not evaluate deployment context changes (user population, use case, regulatory exposure). The deployment context change — identical technology, fundamentally different risk profile — was not captured. Consequence: Regulatory enforcement for operating a high-risk AI system without conformity assessment, 4 months of non-compliant customer interactions requiring remediation, and potential customer compensation for unsuitable advice.

Scenario C — Training Data Expansion Into Protected Categories: A public-sector agent used for benefits eligibility screening is retrained with an expanded dataset that includes additional demographic fields not present in the original training data: ethnicity, disability status, and religious affiliation. The retraining improves prediction accuracy by 3.2% on the test set. The change is classified as "performance improvement" and deployed without reclassification review. A civil rights organisation files a challenge demonstrating that the model now uses protected characteristics as features — something the original conformity assessment explicitly excluded. The agency faces a discrimination lawsuit and regulatory investigation.

What went wrong: The training data change was evaluated on the performance dimension (3.2% accuracy improvement) but not on the legal dimension (introduction of protected characteristics into the feature space). The materiality threshold was defined technically rather than legally. Consequence: Discrimination lawsuit with potential class-action scope, regulatory investigation by the equality body, mandatory model rollback to pre-expansion state, and reputational damage to the public-sector agency.

4. Requirement Statement

Scope: This dimension applies to every AI agent that has undergone a conformity assessment, certification, regulatory approval, internal governance sign-off, or any other structured approval process at deployment time. It also applies to agents operating in regulated sectors where changes to AI systems may trigger re-notification, re-registration, or re-certification obligations. The scope covers all change types: model changes (retraining, fine-tuning, architecture modification), data changes (training data expansion, feature addition or removal), parameter changes (temperature, sampling strategy, context window), deployment context changes (user population, use case, jurisdiction, integration point), and governance configuration changes (mandate limits, access controls, monitoring thresholds). The scope extends to changes made by automated systems (e.g., continuous learning, automated retraining pipelines) — the determination obligation applies regardless of whether the change was made by a human or an automated process.

4.1. A conforming system MUST evaluate every change to an agent — whether to its model, training data, operational parameters, deployment context, or governance configuration — against defined materiality thresholds before the change is deployed to production.

4.2. A conforming system MUST track cumulative change impact from the most recent approved baseline, not only the incremental impact of each individual change, to prevent incremental drift from bypassing materiality thresholds.

4.3. A conforming system MUST define materiality thresholds that include both technical dimensions (e.g., behavioural drift exceeding 5% on the approved benchmark suite) and legal dimensions (e.g., introduction of protected characteristics, change of user population, change of jurisdiction).

4.4. A conforming system MUST block deployment of changes determined to constitute a substantial modification until the required reclassification, reapproval, or recertification process is completed.

4.5. A conforming system MUST maintain a change ledger recording every change, its materiality assessment, the baseline against which it was assessed, the determination outcome, and the identity of the person or process that made the determination.

4.6. A conforming system MUST define, for each applicable regulation, the specific criteria that constitute a substantial modification under that regulation and map each change assessment to those criteria.

4.7. A conforming system SHOULD implement automated drift detection that continuously measures the agent's current behaviour against the approved baseline and raises alerts when cumulative drift approaches materiality thresholds (e.g., at 70% and 90% of the threshold).

4.8. A conforming system SHOULD require independent review (not by the team that made the change) for materiality determinations where the cumulative drift exceeds 50% of any materiality threshold.

4.9. A conforming system SHOULD support automated rollback to the most recent approved baseline if a change that has not completed the substantial modification determination process is detected in production.

4.10. A conforming system MAY implement sandbox environments where changes can be evaluated against materiality thresholds before entering the deployment pipeline.

5. Rationale

AI agents are not static. They are retrained, fine-tuned, reconfigured, redeployed, and extended continuously. Each change, individually, may be minor. But the cumulative effect of many minor changes can fundamentally alter the agent's behaviour, risk profile, or regulatory classification. The legal frameworks governing AI systems recognise this: the EU AI Act Article 43(4) specifically addresses substantial modification, requiring a new conformity assessment when a modification is "substantial." Medical device regulations (MDR Article 120, FDA 510(k)) impose similar obligations. Financial services change management frameworks (e.g., FCA's model risk management expectations) require re-validation when model changes are material.

The challenge is that "substantial" and "material" are legal determinations, not purely technical ones. A 0.1% change in model weights that introduces a protected characteristic into the feature space is legally substantial even though it is technically trivial. A change in deployment context — same model, different use case — can trigger reclassification with zero technical modification. The determination mechanism must therefore evaluate changes across multiple dimensions: technical (how much did the behaviour change?), legal (did the change introduce new legal exposures?), regulatory (does the change trigger re-notification under any applicable regulation?), and contextual (did the deployment context change in a legally relevant way?).

The alternative — evaluating each change independently and only against technical thresholds — creates the "ship of Theseus" problem. After enough incremental changes, the deployed agent bears no resemblance to the agent that was originally assessed and approved, but no single change triggered the reclassification process. AG-230 prevents this by requiring cumulative tracking from the approved baseline.

6. Implementation Guidance

The substantial modification determination requires two components: a baseline definition (the approved state against which changes are measured) and a determination engine (the process that evaluates each change against materiality thresholds).

Recommended patterns:

Baseline fingerprinting. At each approval point (initial conformity assessment, reapproval, recertification), compute and store a comprehensive fingerprint of the approved state: model hash, training data manifest, parameter configuration, deployment context description, benchmark performance on the approved test suite, and feature space definition. All subsequent changes are measured against this fingerprint. The fingerprint is immutable and versioned per AG-007.
Multi-dimensional materiality assessment. Evaluate each change across at least four dimensions: (1) behavioural drift — run the approved benchmark suite and compare results to the baseline; (2) feature space change — identify any features added, removed, or reweighted; (3) deployment context change — compare the current deployment context (users, use case, jurisdiction, integration points) against the approved context; (4) regulatory trigger analysis — evaluate the change against the specific substantial modification criteria for each applicable regulation. A change that exceeds the materiality threshold on any dimension triggers the determination.
Cumulative drift tracker. Maintain a running measure of cumulative drift from the baseline across all dimensions. Each change updates the cumulative measure. Early warning alerts fire at configurable thresholds (e.g., 70% of the materiality threshold on any dimension). This prevents surprise: teams know they are approaching a reclassification trigger before they reach it.
Change classification taxonomy. Define a structured taxonomy of change types: model retrain (full), model fine-tune (incremental), training data addition, training data removal, feature addition, feature removal, parameter change, deployment context change, governance configuration change. Each type has a default materiality weight and specific assessment criteria. The taxonomy accelerates assessment by focusing reviewers on the relevant dimensions.

Anti-patterns to avoid:

Evaluating changes only against technical thresholds. A change that is technically minor (small weight change, small accuracy difference) can be legally substantial (introduction of protected characteristics, expansion to a new jurisdiction, change of user population). Technical-only assessment creates a blind spot for legally material changes.
Assessing each change independently without cumulative tracking. This is the most common anti-pattern and the most dangerous. Twenty individually immaterial changes can cumulatively create a substantially modified system. Without cumulative tracking, the reclassification trigger is never activated.
Treating deployment context changes as non-changes. Redeploying an identical model for a different use case, user population, or jurisdiction can trigger reclassification even though the technical system is unchanged. The determination process must evaluate context changes with the same rigour as technical changes.
Delegating materiality determinations to the development team without independent review. Development teams have an inherent incentive to classify changes as non-material (to avoid the delay and cost of reclassification). Independent review — by legal, compliance, or a dedicated governance function — is essential for determinations near the materiality threshold.
Manual-only tracking without tooling. In organisations with continuous integration/continuous deployment pipelines processing multiple changes per day, manual tracking cannot keep pace. The determination process must be integrated into the CI/CD pipeline so that every change is assessed before deployment.

Industry Considerations

Financial Services. The FCA and PRA expect firms to have model risk management frameworks (aligned with SS1/23) that include change management with materiality assessment. A model change that alters output by more than a defined threshold (commonly 5-10% on key metrics) typically triggers re-validation. For AI agents, the FCA has indicated that changes to the agent's operational scope (e.g., expanding from advisory to execution) are inherently material regardless of the technical magnitude of the change.

Healthcare / Medical Devices. The EU Medical Devices Regulation (MDR) and FDA's 510(k) framework both define substantial modification criteria for software as a medical device (SaMD). Changes to intended use, clinical significance of outputs, or core algorithm architecture are generally considered substantial. The International Medical Device Regulators Forum (IMDRF) provides guidance on when AI/ML-based SaMD changes require new regulatory submissions.

Public Sector. Algorithmic impact assessments required by frameworks such as Canada's Algorithmic Impact Assessment Tool or the EU AI Act's conformity assessment for high-risk systems create specific baselines. Changes that would alter the impact assessment outcome are inherently substantial. Public sector AI has heightened sensitivity to changes affecting fairness, bias, and discrimination.

Maturity Model

Basic Implementation — The organisation maintains a change log for each deployed agent and conducts manual materiality assessments for significant changes (model retrains, major feature changes). Deployment context changes are captured when the development team identifies them. Cumulative tracking is manual — a reviewer examines the change history periodically. This level catches obvious substantial modifications but misses incremental drift and deployment context changes that the development team does not recognise as material.

Intermediate Implementation — The organisation has a baseline fingerprint for each approved agent version. Every change is assessed against the baseline using the multi-dimensional framework (behavioural, feature space, context, regulatory). Cumulative drift is tracked automatically. Early warning alerts fire at 70% and 90% of materiality thresholds. Independent review is required for changes exceeding 50% of any threshold. The determination process is integrated into the CI/CD pipeline — changes cannot deploy without a recorded materiality assessment.

Advanced Implementation — All intermediate capabilities plus: automated drift detection continuously measures production behaviour against the baseline (not just at change time). The regulatory trigger analysis is automated — the system evaluates each change against the specific substantial modification criteria for every applicable regulation and generates a jurisdiction-specific determination. Sandbox evaluation allows changes to be tested against materiality thresholds before entering the deployment pipeline. The organisation can demonstrate to any regulator the complete chain from original approval through every change to current state, with materiality assessments for each step.

7. Evidence Requirements

Required artefacts:

Baseline fingerprint. The comprehensive fingerprint of the most recently approved agent state, including model hash, training data manifest, parameter configuration, deployment context, benchmark results, and feature space definition. Stored as immutable, versioned artefacts.
Change ledger. A complete, chronological record of every change to the agent, including: change description, change type (per the taxonomy), materiality assessment across all dimensions, cumulative drift from baseline, determination outcome (material/not material), and the identity and role of the assessor.
Materiality threshold definitions. The defined thresholds for each dimension and each applicable regulation, with documented rationale for each threshold value and evidence of legal review.
Substantial modification determination records. For changes determined to be substantial: the reclassification, reapproval, or recertification process followed, the outcome, and the updated baseline fingerprint.
Drift monitoring records. Automated drift detection results showing cumulative drift over time, early warning alerts, and responses to those alerts.

Retention requirements:

Baseline fingerprints, change ledgers, and determination records: minimum 7 years for regulated financial services; minimum 10 years for medical devices; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators within 48 hours. Must demonstrate the complete chain from original approval to current state.

8. Test Specification

Test 8.1: Individual Change Assessment

Stimulus: Apply a single change to the agent (e.g., a fine-tuning update that modifies 2% of model weights) and submit it to the determination pipeline.
Expected behaviour: The system evaluates the change against all materiality dimensions, records the assessment, and produces a determination (material/not material).
Pass criteria: The determination is recorded with all dimension scores, the cumulative drift is updated, and the change is blocked from deployment if determined material.
Fail criteria: The change deploys without a materiality assessment, or the assessment omits any required dimension.

Test 8.2: Cumulative Drift Detection

Stimulus: Apply 15 individually non-material changes, each contributing 1% behavioural drift. The materiality threshold is 10%.
Expected behaviour: After the 10th change (10% cumulative drift), the system flags the cumulative drift as exceeding the materiality threshold and blocks further deployment without reclassification review.
Pass criteria: The system detects cumulative drift exceeding the threshold despite no individual change exceeding it. Deployment is blocked at the threshold.
Fail criteria: The 15th change deploys without triggering the cumulative materiality threshold.

Test 8.3: Deployment Context Change Detection

Stimulus: Redeploy an unchanged model in a new context — e.g., from internal-only use to customer-facing use, or from one jurisdiction to another.
Expected behaviour: The system detects the deployment context change, evaluates it against context-dimension materiality thresholds, and determines whether the context change is substantial.
Pass criteria: The deployment context change is captured and assessed. A change from internal to customer-facing use is flagged as material.
Fail criteria: The deployment context change is not detected, or is classified as non-material without assessment.

Test 8.4: Legal Dimension Assessment

Stimulus: Add a protected characteristic (e.g., ethnicity) to the agent's training data feature space.
Expected behaviour: The system detects the feature space change, identifies the introduction of a protected characteristic, and flags the change as legally material regardless of the technical impact on benchmark scores.
Pass criteria: The change is determined material on the legal dimension even if the behavioural drift is below the technical materiality threshold.
Fail criteria: The change is determined non-material because the technical drift is small, despite the legal materiality.

Test 8.5: Early Warning Alerts

Stimulus: Apply changes that bring cumulative drift to 70% and then 90% of the materiality threshold.
Expected behaviour: The system generates early warning alerts at each configured threshold crossing.
Pass criteria: Alerts are generated at 70% and 90%. Alerts identify which dimension is approaching the threshold and the current cumulative drift value.
Fail criteria: No alert is generated before the materiality threshold is breached.

Test 8.6: Blocking of Unassessed Changes

Stimulus: Attempt to deploy a change that has not undergone materiality assessment (e.g., bypass the assessment pipeline).
Expected behaviour: The deployment pipeline blocks the change. No change reaches production without a recorded materiality determination.
Pass criteria: The deployment is blocked with a clear rejection indicating missing materiality assessment.
Fail criteria: The change deploys without a materiality assessment.

Conformance Scoring

Score 0: No change assessment process exists — changes deploy without materiality evaluation.
Score 1: Manual materiality assessments are conducted for major changes, but incremental changes are not tracked, and cumulative drift is not measured.
Score 2: Every change is assessed against defined materiality thresholds including cumulative tracking, with deployment blocked for changes determined material — structural enforcement in the CI/CD pipeline.
Score 3: Verified by independent audit, with automated drift detection, multi-dimensional assessment (technical, legal, contextual, regulatory), early warning alerts, and demonstrated regulatory compliance for all applicable substantial modification criteria.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 43(4) (Substantial Modification)	Direct requirement
EU AI Act	Article 16 (Provider Obligations — Conformity Maintenance)	Direct requirement
EU MDR	Article 120, MDCG 2020-3 (Significant Change)	Direct requirement
FDA	510(k) Substantial Equivalence, Predetermined Change Control Plan	Direct requirement
FCA SS1/23	Model Risk Management — Change Management	Supports compliance
NIST AI RMF	MANAGE 2.3 (Risk Monitoring), MANAGE 4.1 (Change Management)	Supports compliance
ISO 42001	Clause 8.2 (AI Risk Assessment), Clause 10.2 (Continual Improvement)	Supports compliance

EU AI Act — Article 43(4) (Substantial Modification)

Article 43(4) requires that when a high-risk AI system undergoes a "substantial modification," it must undergo a new conformity assessment. The Act defines substantial modification as a change that affects the system's compliance with the requirements of Title III, Chapter 2, or that results in a modification to the intended purpose for which the system has been assessed. AG-230 implements the determination mechanism that identifies whether a given change meets this definition. The key implementation challenge is that the EU AI Act does not provide quantitative thresholds for "substantial" — organisations must define their own thresholds and be prepared to defend them to regulators. AG-230 requires that these thresholds cover both technical and legal dimensions, preventing the common failure of technically-focused-only assessment.

EU AI Act — Article 16 (Provider Obligations)

Article 16 requires providers to ensure that their high-risk AI systems continue to comply with the requirements of the Act throughout the system's lifecycle. This creates an ongoing obligation to monitor for changes that could affect compliance. AG-230's cumulative drift tracking directly implements this ongoing monitoring obligation.

EU MDR — Significant Change Determination

The EU Medical Devices Regulation and associated MDCG guidance define criteria for when changes to software as a medical device constitute a "significant change" requiring new regulatory submissions. For AI-based medical devices, these criteria include changes to intended use, changes to the clinical significance of outputs, changes to the core algorithm architecture, and changes that could affect safety or performance. AG-230's multi-dimensional assessment framework covers all these criteria.

FDA — Predetermined Change Control Plan

The FDA's framework for AI/ML-based Software as a Medical Device includes the concept of a Predetermined Change Control Plan (PCCP) — a pre-approved plan defining the types of changes the manufacturer intends to make and the methodology for evaluating their impact. AG-230's change classification taxonomy and materiality thresholds align with the PCCP framework, allowing organisations to define pre-approved change categories with associated assessment criteria.

FCA SS1/23 — Model Change Management

SS1/23 expects firms to have change management processes for models that include materiality assessment and re-validation triggers. The FCA expects that changes to AI models used in regulated activities are subject to the same change management rigour as traditional quantitative models. AG-230 implements the materiality determination layer that feeds the broader change management process.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	System-specific, but with potential organisation-wide regulatory consequences

Consequence chain: Without substantial modification determination, an agent's risk profile can drift from its approved baseline without detection. The immediate technical consequence is that the deployed agent no longer matches its conformity assessment, certification, or approval documentation. The regulatory consequence is that the organisation is operating an unapproved or uncertified AI system — a compliance violation in every jurisdiction that imposes conformity assessment or certification requirements. For high-risk AI systems under the EU AI Act, this can result in fines up to EUR 15 million or 3% of global turnover. For medical devices, this can result in product recall, market withdrawal, and criminal liability. For financial services, this can result in regulatory enforcement, client remediation, and personal liability under senior manager regimes. The cumulative nature of the risk means that the longer the drift continues undetected, the larger the remediation cost: an agent that has drifted over 12 months requires a full reassessment, re-validation, and potentially re-deployment — a process that can take 3-6 months and cost millions in assessment fees, operational disruption, and customer impact.

Cross-references: AG-007 (Governance Configuration Control) governs the versioning and immutability of the baseline fingerprint that AG-230 measures against. AG-022 (Behavioural Drift Detection) provides the continuous monitoring that feeds AG-230's cumulative drift tracker. AG-021 (Regulatory Obligation Identification) identifies the specific regulatory requirements that define what constitutes a substantial modification in each applicable jurisdiction. AG-229 (Jurisdictional Applicability Mapping Governance) determines which jurisdictions' substantial modification criteria apply. AG-006 (Tamper-Evident Record Integrity) ensures that the change ledger and baseline fingerprints are tamper-evident.

Cite this protocol

AgentGoverning. (2026). AG-230: Substantial Modification Determination Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-230

← Previous Protocol

AG-229

Jurisdictional Applicability Mapping Governance

Next Protocol →

AG-231

Legal Hold and Preservation Governance