The Standard

The 841 Dimensions Regulatory Mapping Version History

Compliance

Compliance Leaderboard Platform Comparison

Verification

Submit for Verification Self-Assessment Tool

About

About AgentGoverning Press & Media

Contact

AG-119

Financial Model Challenge Governance

Financial Services & Value Transfer ~21 min read AGS v2.1 · April 2026

EU AI Act FCA NIST

2. Summary

Financial Model Challenge Governance requires that every model, algorithm, or decision logic used by an AI agent operating in financial services is subject to independent challenge — a structured process where qualified individuals or systems that are independent of the model's development and operation critically evaluate the model's assumptions, methodology, data, outputs, and fitness for purpose. The challenge function must have the authority to restrict or suspend model use when deficiencies are identified, must operate on a defined schedule and on a triggered basis when material changes occur, and must produce documented findings that are tracked to resolution. This dimension implements the "three lines of defence" model for AI agent operations: the agent and its operators constitute the first line, the model challenge function constitutes the second line, and internal audit (or external assurance) constitutes the third line. Without effective model challenge, an AI agent may operate on flawed assumptions, degraded models, or outdated logic for extended periods — generating systematic errors at scale without detection.

3. Example

Scenario A — Unchallenged Model Drift in Credit Scoring: An AI agent uses a credit scoring model trained on data from 2019–2022 to make consumer lending decisions. The model was validated at deployment and performed well against test data. However, no model challenge process exists to evaluate the model's ongoing fitness. By 2025, macroeconomic conditions have changed materially: interest rates have risen from 0.1% to 5.25%, the cost of living has increased significantly, and consumer debt patterns have shifted. The model's assumptions about default probability, which were calibrated to a low-interest-rate environment, systematically underestimate default risk. The agent approves 12,400 loans over 18 months that a properly calibrated model would have declined or priced differently. The eventual default rate on this cohort is 8.7% against the model's predicted 3.2%, generating £34,000,000 in excess losses.

What went wrong: No model challenge function evaluated whether the model's assumptions remained valid as economic conditions changed. The model was validated once at deployment but never challenged thereafter. No trigger mechanism existed to initiate challenge when material economic changes occurred (interest rate increases of 525 basis points represent a material change by any reasonable standard). The agent continued to rely on a model whose fundamental assumptions were invalidated by changed conditions. Consequence: £34,000,000 in excess credit losses, PRA enforcement action under SS1/23 for inadequate model risk management, requirement to suspend AI-driven lending pending model recalibration and independent validation, write-down of the loan book affecting the firm's capital ratios.

Scenario B — Challenger Model Identifies Systematic Pricing Error: An AI agent pricing insurance policies uses a catastrophe risk model that estimates expected losses from weather events. The model uses historical weather data from 1990–2020 as its primary input. An independent model challenge process runs a challenger model using climate-adjusted projections that account for increasing frequency and severity of extreme weather events. The challenger model estimates expected losses 40% higher than the production model for flood-risk properties. Without the challenge process, the agent would underprice flood risk by an average of £1,200 per policy across 8,500 flood-risk policies, creating an aggregate underpricing exposure of £10,200,000 per year. The challenge finding triggers a model recalibration that corrects the pricing before the exposure materialises.

What prevented harm: The independent model challenge process identified the discrepancy between the production model's historical assumptions and the challenger model's forward-looking projections. The challenge function had the authority to require recalibration before the underpriced policies were issued. The challenger model's methodology was documented, its assumptions were transparent, and its findings were tracked to resolution within a defined SLA. This is the model challenge process working as intended.

Scenario C — Model Challenge Identifies Overfitting in Fraud Detection: An AI agent managing fraud detection for payment transactions uses a machine learning model that achieves a 99.7% accuracy rate on its training data. The model challenge function evaluates the model using an independent holdout dataset and adversarial test cases. The challenge reveals that the model has overfit to specific fraud patterns in the training data and fails to detect novel fraud techniques — its accuracy on the independent dataset is 71.3%, and its detection rate for adversarial fraud patterns (transaction structuring to evade detection thresholds) is 12%. Without the challenge, the agent would deploy with apparent 99.7% effectiveness but actual real-world effectiveness of approximately 71%, leaving a 29% gap in fraud detection that equates to an estimated £7,800,000 per year in undetected fraud.

What the challenge found: The model had memorised training data patterns rather than learning generalisable fraud indicators. The 99.7% training accuracy was an artefact of overfitting, not evidence of genuine capability. The challenge function's use of independent data and adversarial test cases revealed the true performance gap. The finding blocked deployment pending model redesign with regularisation techniques, cross-validation, and adversarial robustness training. Consequence avoided: £7,800,000 per year in potential undetected fraud.

4. Requirement Statement

Scope: This dimension applies to all models, algorithms, decision logic, and quantitative methodologies used by AI agents in financial services to make or support decisions that affect: credit risk assessment, market risk measurement, counterparty risk evaluation, pricing of financial products, fraud detection, anti-money laundering screening, investment recommendations, portfolio construction, trade execution strategy, customer segmentation, vulnerability assessment, and any other quantitative process whose output influences financial decisions or customer outcomes. The scope includes both the primary model and any models used for pre-processing, feature engineering, or post-processing of the primary model's outputs. Agents using only deterministic, rule-based logic with no statistical or machine learning components may be excluded, provided the rules themselves are subject to periodic review — but note that most modern AI agents incorporate statistical components even in apparently rule-based systems.

4.1. A conforming system MUST establish an independent model challenge function for every model used by AI agents in financial services, where "independent" means the challenge function is organisationally separate from the model development team and the model's operational users, and has no financial or professional incentive to approve the model.

4.2. A conforming system MUST subject every model to challenge before initial deployment (pre-deployment validation) and on a recurring schedule thereafter (periodic challenge), where the periodic challenge frequency is commensurate with the model's risk materiality but not less than annually for high-risk models.

4.3. A conforming system MUST define trigger conditions that initiate unscheduled model challenge, including: material changes to input data distributions, material changes to economic or market conditions, model performance degradation below defined thresholds, regulatory changes affecting the model's domain, and changes to the model's code, parameters, or training data.

4.4. A conforming system MUST require the model challenge function to evaluate, at minimum: the model's conceptual soundness (are the assumptions reasonable?), the model's data quality and representativeness (is the training data appropriate and current?), the model's performance against independent data (not the data used for training or tuning), the model's sensitivity to key assumptions (how do outputs change under stressed scenarios?), and the model's fitness for the specific use case (does the model's output support the decisions being made?).

4.5. A conforming system MUST grant the model challenge function authority to restrict or suspend model use when material deficiencies are identified, without requiring approval from the model's development team or operational users.

4.6. A conforming system MUST track all challenge findings to documented resolution with defined SLAs — critical findings (model produces materially incorrect outputs) must be resolved or the model suspended within 5 business days; high findings (model has significant limitations not captured in documentation) within 20 business days; medium findings (model could be improved but is not materially flawed) within 60 business days.

4.7. A conforming system SHOULD implement challenger models — independent models that perform the same function as the production model using different methodology, data, or assumptions — to provide a quantitative benchmark for challenge.

4.8. A conforming system SHOULD implement automated model monitoring that continuously tracks model performance metrics (accuracy, precision, recall, calibration, stability) and triggers challenge when metrics breach defined thresholds.

4.9. A conforming system SHOULD maintain a model inventory that catalogues every model used by AI agents, including: model purpose, risk rating, last challenge date, next scheduled challenge, open findings, and model owner.

4.10. A conforming system MAY implement automated stress testing that periodically evaluates model performance under extreme but plausible scenarios relevant to the model's domain (e.g., interest rate shocks, market crashes, pandemic-scale events).

5. Rationale

Model challenge is the financial services industry's primary mechanism for ensuring that quantitative models remain fit for purpose. The PRA's supervisory statement SS1/23 (Model Risk Management Principles for Banks) establishes model risk management as a regulatory expectation for firms using models in material decision-making. The FCA's expectations under SYSC and the Senior Managers Regime create personal accountability for the adequacy of model governance. For AI agents, model challenge is not merely good practice — it is a regulatory expectation that, if absent, exposes the firm to enforcement action.

The specific challenge for AI agents is that the models underlying their behaviour are typically more complex, less interpretable, and more sensitive to distributional shifts than traditional financial models. A traditional credit scoring model might use 15–20 features in a logistic regression with well-understood statistical properties. An AI agent's underlying model might use hundreds or thousands of features in a deep learning architecture whose decision boundaries are not easily interpretable. This complexity makes independent challenge both more difficult and more important.

Model risk materialises through two primary channels: models that are wrong at deployment (due to overfitting, inappropriate assumptions, or inadequate validation) and models that become wrong over time (due to distributional shift, regime change, or concept drift). The first channel is addressed by pre-deployment validation. The second channel — which is particularly acute for AI agents operating in dynamic financial markets — is addressed by periodic challenge, trigger-based challenge, and continuous monitoring.

The financial consequences of model risk in AI agent operations are substantial and well-documented. The PRA estimated in its 2023 consultation that model risk in the UK banking sector accounts for potential losses in the billions of pounds annually. For AI agents making high-frequency decisions at scale, the amplification effect is significant: a model that is 1% wrong in a way that consistently favours the same direction creates systematic exposure that accumulates with every decision. An agent making 10,000 credit decisions per month with a 1% systematic error in default probability estimation generates measurable excess loss within one credit cycle.

The three lines of defence model — where the first line (model users) owns the risk, the second line (model challenge) provides independent oversight, and the third line (audit) provides assurance — is the established governance framework for model risk in financial services. AG-119 defines the requirements for the second line as it applies to AI agent models, ensuring that the challenge function has the independence, authority, and capability to perform its role effectively.

6. Implementation Guidance

AG-119 requires an independent model challenge capability that can evaluate the full range of models used by AI agents in financial services. This includes traditional statistical models, machine learning models, deep learning models, and hybrid systems that combine multiple model types.

Recommended patterns:

Dedicated model validation unit. Establish a team of model validators who are organisationally independent of model development and agent operations. The team should include expertise in: the financial domain (credit risk, market risk, pricing, fraud), quantitative methods (statistics, machine learning, deep learning), and data science (data quality assessment, feature engineering evaluation). The team operates under a charter that grants authority to restrict or suspend models, reports to a Chief Risk Officer or equivalent (not to the technology or business function), and has a defined budget independent of model development teams. Example: A firm deploys AI agents for credit decisions, fraud detection, and investment recommendations. The model validation unit maintains 6 validators (2 per domain). Each model is assigned a lead validator who is responsible for the challenge schedule, finding tracking, and escalation. The unit reports monthly to the Model Risk Committee on: models challenged, findings by severity, open findings by age, and models approaching their next scheduled challenge.
Challenger model framework. For each production model, develop and maintain at least one challenger model that performs the same function using different methodology. The challenger serves two purposes: (a) it provides a quantitative benchmark — if the production model and challenger diverge materially, one of them is wrong, and the investigation into which one is wrong is itself a valuable challenge activity; (b) it provides a ready replacement if the production model must be suspended due to a critical finding. Example: The production credit scoring model uses a gradient boosted tree ensemble. The challenger uses a logistic regression with hand-engineered features. Monthly comparison shows: production model predicts 3.2% cohort default rate, challenger predicts 5.8%. Investigation reveals the production model underweights recent economic data due to its training window. Recalibration of the production model brings predictions to 5.4%, within acceptable tolerance of the challenger. Without the challenger benchmark, the 2.6 percentage point error would have gone undetected.
Automated model monitoring pipeline. Implement a continuous monitoring system that tracks model performance metrics in production. Key metrics include: Population Stability Index (PSI) for input distribution shift (alert threshold: PSI > 0.2 indicating material distributional change), accuracy/precision/recall trends over rolling windows, calibration drift (predicted probability vs. observed frequency), and feature importance stability. When metrics breach thresholds, the system automatically triggers an unscheduled challenge and restricts model use to human-supervised mode pending challenge completion. Example: PSI monitoring for a fraud detection model detects that the input distribution of transaction amounts has shifted (PSI = 0.31) due to a new corporate customer segment being onboarded. The shift triggers an automatic challenge. The validation unit evaluates model performance on the new segment: detection rate drops from 94% to 67% for transactions in the new segment's typical range. Finding: model is not calibrated for the new segment. Remediation: retrain with new segment data, validate on holdout, and deploy updated model within 15 business days.
Model inventory and lifecycle management. Maintain a centralised inventory of all models used by AI agents, including: model identifier, purpose, risk rating (critical/high/medium/low), deployment date, last challenge date, next scheduled challenge, open findings by severity, model owner, and current status (active/restricted/suspended/retired). The inventory should enforce lifecycle governance: models cannot enter production without pre-deployment validation sign-off; models past their challenge due date are automatically flagged; models with unresolved critical findings are automatically restricted.

Anti-patterns to avoid:

Challenge by the development team. Self-assessment is not challenge. The team that built the model has inherent bias toward finding it adequate. Independence means organisational separation, not merely having a different person on the same team review the model.
Challenge limited to technical validation. Technical validation (does the model run correctly, is the code bug-free) is necessary but insufficient. Effective challenge evaluates conceptual soundness (are the assumptions reasonable for the current environment), data appropriateness (is the training data representative of the current population), and fitness for purpose (does the model's output support the decisions being made in the current context).
One-time validation at deployment with no ongoing challenge. Models degrade over time as conditions change. A model validated in 2023 may be unsuitable by 2025 without any change to the model itself — the world changed around it. Periodic challenge and trigger-based challenge are essential.
Challenge findings without teeth. If the challenge function identifies a material deficiency but has no authority to restrict the model, challenge is advisory and easily ignored under commercial pressure. The challenge function must have explicit authority to restrict or suspend model use, and this authority must be documented in the challenge function's charter.
Applying uniform challenge intensity regardless of risk. A model that determines the pricing of a £5,000 consumer loan does not require the same depth of challenge as a model that manages a £500,000,000 trading portfolio. Challenge intensity should scale with model materiality — but every model in production requires some level of challenge.

Industry Considerations

Banking (Credit Risk). PRA SS1/23 sets specific expectations for model risk management in banks, including: a model risk management framework, model inventory, independent validation, and ongoing monitoring. AG-119 requirements are designed to align with SS1/23 expectations. Banks using AI agents for credit decisions should map AG-119 requirements to their existing SS1/23 compliance framework, ensuring that AI agent models receive the same level of challenge as traditional credit models.

Asset Management. Investment models used by AI agents for portfolio construction, trade execution, or asset allocation are subject to challenge requirements under the AIFMD and UCITS management company obligations. The challenge should evaluate: model assumptions about asset correlations, expected returns, and risk factors; backtesting against historical periods including stress events; and sensitivity analysis to key parameters. For AI agents generating trade execution strategies, challenge should include transaction cost analysis validating that the execution model's predictions of market impact are accurate.

Insurance. Actuarial models used by AI agents for pricing, reserving, or capital modelling are subject to challenge under Solvency II (Technical Provisions, SCR calculation) and the actuarial function's review obligations. The challenge process should include actuarial peer review for models with actuarial content, and should evaluate whether the model's assumptions (mortality tables, morbidity rates, catastrophe frequencies) remain appropriate given recent experience and forward-looking projections.

Maturity Model

Basic Implementation — A model inventory exists listing all models used by AI agents, with model owners assigned. Pre-deployment validation is conducted for all models before production use. Periodic challenge occurs on a defined schedule (at least annually for high-risk models). Challenge is conducted by individuals who were not involved in model development, though they may be in the same organisational unit. Challenge findings are documented and tracked. The challenge function can recommend restrictions but enforcement depends on management agreement. This level meets minimum compliance requirements but lacks full organisational independence and binding authority.

Intermediate Implementation — The model challenge function is organisationally independent with a defined charter granting authority to restrict or suspend models. Challenge includes conceptual soundness review, independent data testing, sensitivity analysis, and fitness-for-purpose evaluation. Challenger models exist for high-risk production models. Automated model monitoring tracks key performance metrics and triggers unscheduled challenge when thresholds are breached. Challenge findings are tracked with defined SLAs and escalation paths. The model inventory enforces lifecycle governance (models past challenge due date are flagged, models with unresolved critical findings are automatically restricted). The challenge function reports to the Model Risk Committee with a direct escalation path to the board.

Advanced Implementation — All intermediate capabilities plus: continuous automated model monitoring with real-time dashboards. Challenger models exist for all production models, with automated divergence detection triggering investigation. Stress testing evaluates model performance under extreme scenarios on a quarterly cycle. Cross-model interaction analysis evaluates whether multiple models used by the same agent or across agents create emergent risks that individual model challenge would not detect. External independent validation (third-party model validation) supplements internal challenge for critical models on a defined schedule. The organisation can demonstrate to regulators a complete model risk management framework that meets PRA SS1/23 expectations, with full audit trail of challenge activities, findings, and resolutions.

7. Evidence Requirements

Required artefacts:

Model inventory. Complete inventory of all models used by AI agents, including: model identifier, purpose, risk rating, deployment date, last challenge date, next scheduled challenge, open findings by severity, model owner, and current status.
Challenge function charter. Documentation establishing the challenge function's independence, authority, scope, and reporting lines. Must demonstrate that the function can restrict or suspend models without requiring approval from model developers or operational users.
Pre-deployment validation reports. For each model, the validation report produced before production deployment, including: conceptual soundness assessment, data quality evaluation, independent performance testing results, sensitivity analysis, and fitness-for-purpose conclusion.
Periodic challenge reports. For each scheduled or triggered challenge, the challenge report including: scope of challenge, methodology, findings by severity, recommendations, and resolution tracking.
Finding resolution records. For each challenge finding, a record of: finding description, severity, assigned owner, resolution SLA, actual resolution date, resolution description, and validation that the resolution is adequate.
Model monitoring metrics. Continuous monitoring data for each production model, including: performance metrics over time, distributional stability metrics (PSI), calibration metrics, and threshold breach events.

Retention requirements:

Model challenge reports and finding resolutions: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.
Model inventory: maintained for the lifetime of the model plus 7 years after retirement.
Model monitoring data: minimum 5 years or the period required to demonstrate model performance trends.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Challenge reports must be queryable by model, finding severity, resolution status, and time period.

8. Test Specification

Test 8.1: Challenge Independence Verification

Stimulus: Review the organisational structure of the model challenge function. Verify: (a) reporting lines are independent of model development and agent operations, (b) challenge function personnel have no involvement in model development or parameter tuning, (c) the challenge function's budget and personnel decisions are not controlled by model development or operations leadership.
Expected behaviour: Clear organisational separation is documented and verifiable.
Pass criteria: The challenge function reports to a risk or governance function (not technology or business). No personnel overlap between challenge and development. Budget independence is documented.
Fail criteria: The challenge function reports to the same executive as model development, challenge personnel also perform development tasks, or the challenge function's budget is controlled by the development function.

Test 8.2: Challenge Authority Enforcement

Stimulus: Simulate a scenario where the challenge function identifies a critical finding (model produces materially incorrect outputs) and issues a restriction order. Verify that: (a) the model is restricted within the defined SLA (5 business days), (b) the restriction cannot be overridden by the model development team or operational users without escalation to the Model Risk Committee, (c) the agent cannot use the restricted model for production decisions.
Expected behaviour: Model restriction is enforced structurally — the model is removed from production service or the agent's access to the model is revoked.
Pass criteria: The model is restricted within 5 business days. Restriction is structural (not advisory). Override requires Model Risk Committee approval with documented rationale.
Fail criteria: The restriction is not enforced, is overridden without appropriate authority, or the agent continues to use the restricted model.

Test 8.3: Trigger-Based Challenge Activation

Stimulus: Introduce a material distributional shift in the model's input data (PSI > 0.2). Verify that: (a) the monitoring system detects the shift, (b) an unscheduled challenge is triggered, (c) the model is placed in human-supervised mode pending challenge completion.
Expected behaviour: Automated detection triggers the challenge process within the defined SLA. Model restrictions are applied pending challenge.
Pass criteria: Distributional shift detected within 24 hours. Unscheduled challenge initiated within 48 hours. Model restricted to human-supervised mode within the detection-to-restriction SLA.
Fail criteria: Distributional shift is not detected, challenge is not triggered, or the model continues in unsupervised production use during the challenge.

Test 8.4: Challenger Model Divergence Detection

Stimulus: Configure the production model and challenger model to produce divergent outputs for a test dataset — production model predicts a 3% default rate, challenger predicts 6%.
Expected behaviour: The divergence monitoring system detects the material discrepancy and generates an alert requiring investigation.
Pass criteria: Divergence detected within the monitoring cycle. Alert identifies the magnitude of divergence and the affected population segment. Investigation is initiated within the defined SLA.
Fail criteria: Divergence is not detected, or detection does not trigger investigation.

Test 8.5: Finding Resolution SLA Compliance

Stimulus: Create test findings at each severity level (critical, high, medium). Track resolution against defined SLAs (5, 20, 60 business days respectively).
Expected behaviour: Findings are assigned to owners, tracked in the finding management system, and resolved within SLA. SLA breaches trigger escalation.
Pass criteria: 100% of critical findings resolved or model suspended within 5 business days. 90% of high findings resolved within 20 business days. SLA breaches generate escalation to the Model Risk Committee.
Fail criteria: Critical findings remain open beyond 5 business days without model suspension. SLA breaches do not generate escalation.

Test 8.6: Pre-Deployment Validation Gate

Stimulus: Attempt to deploy a new model to production without pre-deployment validation sign-off.
Expected behaviour: The deployment pipeline rejects the model. The model cannot enter production without a documented validation sign-off from the challenge function.
Pass criteria: Deployment is blocked. The system requires a validation report reference and challenge function sign-off before allowing production deployment.
Fail criteria: A model enters production without challenge function sign-off.

Conformance Scoring

Score 0: No model challenge process — models are deployed and operated without independent review.
Score 1: Pre-deployment validation exists but is conducted by the development team. No periodic challenge schedule. No trigger-based challenge. No challenger models. Finding tracking is informal.
Score 2: Independent model challenge function with organisational separation and binding authority. Pre-deployment validation and periodic challenge on a defined schedule. Trigger conditions for unscheduled challenge. Automated model monitoring with threshold-based alerting. Formal finding management with defined SLAs. Model inventory with lifecycle governance.
Score 3: All Score 2 capabilities plus challenger models for all high-risk production models, continuous automated monitoring with real-time dashboards, stress testing under extreme scenarios, cross-model interaction analysis, external independent validation for critical models, and demonstrated regulatory compliance with PRA SS1/23 or equivalent.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
PRA SS1/23	Model Risk Management Principles for Banks	Direct requirement
FCA SYSC	6.1.1R (Systems and Controls)	Direct requirement
EU AI Act	Article 9 (Risk Management System)	Direct requirement
EU AI Act	Article 17 (Quality Management System)	Supports compliance
MiFID II	Article 16(5) (Algorithmic Trading Controls)	Supports compliance
Solvency II	Articles 44, 48 (Risk Management and Actuarial Functions)	Supports compliance
DORA	Article 9 (ICT Risk Management Framework)	Supports compliance
NIST AI RMF	GOVERN 1.2, MAP 3.5, MEASURE 1.1, MANAGE 2.4	Supports compliance

PRA SS1/23 — Model Risk Management Principles for Banks

SS1/23 establishes five principles for model risk management: (1) model identification and model risk classification, (2) governance, (3) model development, implementation, and use, (4) independent model validation, and (5) model risk mitigants. AG-119 directly implements Principle 4 (independent model validation) and supports Principles 1 (model inventory), 2 (governance through the challenge function charter), and 5 (risk mitigants through challenger models and monitoring). The PRA expects firms to have a model risk management framework that is proportionate to the nature, scale, and complexity of their model use. For firms deploying AI agents that rely on multiple models for financial decisions, the framework must cover all models in the agent's decision chain — including pre-processing, feature engineering, and post-processing models that may not be individually classified as "financial models" but collectively determine the agent's financial decisions.

EU AI Act — Article 9 (Risk Management System)

Article 9 requires providers of high-risk AI systems to establish a risk management system that includes the identification and analysis of known and reasonably foreseeable risks, and the adoption of suitable risk management measures. For AI agents in financial services, model risk is a foreseeable risk that requires management measures. AG-119's model challenge requirements implement the risk management measures for model risk. Article 9 also requires that risk management measures be "tested with a view to identifying the most appropriate risk management measures" — mapping to the challenge function's evaluation of model performance under independent testing.

EU AI Act — Article 17 (Quality Management System)

Article 17 requires providers to implement a quality management system that ensures compliance with the AI Act's requirements. For AI agents using models, the quality management system must include model validation and challenge processes. AG-119's model inventory, challenge schedule, finding management, and lifecycle governance directly contribute to the quality management system required by Article 17.

Solvency II — Articles 44 and 48

Article 44 requires insurance undertakings to have an effective risk management system, including model risk management. Article 48 requires the actuarial function to assess the quality of data used in the calculation of technical provisions and to inform the administrative, management, or supervisory body of the reliability and adequacy of the calculation. For AI agents using actuarial models in insurance operations, the actuarial function's review obligations map to AG-119's challenge requirements, with the additional expectation that actuarial models are challenged by qualified actuaries.

10. Failure Severity

Field	Value
Severity Rating	Critical
Blast Radius	Organisation-wide with potential systemic impact when model failures affect market behaviour or customer outcomes at scale

Consequence chain: Model challenge failure allows flawed models to operate in production without detection. The consequence is not a single bad decision but a systematic pattern of bad decisions accumulating over the period between model deployment and deficiency detection — which, without challenge, may be months or years. A credit model that underestimates default probability by 3 percentage points generates excess losses on every loan approved during the period of model error. At 10,000 loans per month with an average exposure of £15,000, a 3 percentage point default underestimate generates approximately £450,000 per month in excess expected losses — £5,400,000 per year. For pricing models, the exposure is underpriced risk; for fraud detection models, the exposure is undetected fraud; for investment models, the exposure is systematically suboptimal portfolio construction. The financial impact scales linearly with the volume of decisions made using the flawed model and the duration of the error. The blast radius extends beyond direct financial loss: regulators treat model risk management failure as a governance failure, triggering supervisory action, potential enforcement, and mandatory remediation that disrupts business operations. Under the Senior Managers Regime, the senior manager accountable for model risk management may face personal regulatory consequences including prohibition from holding senior management functions.

Cross-references: AG-119 provides the assurance layer for all models used across the sibling dimensions. AG-116 (Pre-Execution Risk Control Governance) relies on risk models for counterparty assessment, market impact estimation, and fraud scoring — AG-119 ensures these models are challenged and fit for purpose. AG-117 (Customer Outcome and Foreseeable Harm Monitoring Governance) may identify systematic outcome detriment whose root cause is a model deficiency — AG-119's challenge process investigates and remediates the model. AG-118 (Fair Treatment and Vulnerability Governance) uses vulnerability detection models and fairness assessment models that are themselves subject to AG-119 challenge — including evaluation of the vulnerability model's accuracy and the fairness model's methodology. AG-001 (Operational Boundary Enforcement) provides structural limits that are independent of model outputs, serving as a backstop when model challenge identifies a deficiency — the mandate limits constrain the agent's actions even when the models guiding those actions are flawed. AG-045 (Economic Incentive Alignment Verification) evaluates whether the agent's incentive structure creates pressure to use models in ways that maximise commercial outcomes rather than model accuracy. AG-011 (Action Reversibility and Settlement Integrity) determines the window within which model-driven decisions can be reversed when challenge identifies a deficiency.

Cite this protocol

AgentGoverning. (2026). AG-119: Financial Model Challenge Governance. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-119

← Previous Protocol

AG-118

Fair Treatment and Vulnerability Governance

Next Protocol →

AG-120

Browser Session and Token Governance