AG-255: Benefit Realisation Tracking Governance

2. Summary

Benefit Realisation Tracking Governance requires organisations to measure whether the efficiency, quality, control, or cost gains promised at the time of agent approval are actually achieved in operation. Every agent was approved based on a projected benefit — faster processing, reduced error rates, lower costs, improved customer experience, better compliance coverage. This dimension requires that those projections are tracked against actuals, that shortfalls are investigated and acted upon, and that the benefit data feeds back into the approval and sunset review processes to improve future decision-making. An agent that was approved based on projected savings of £200,000 per year but delivers only £40,000 per year is consuming governance resources, risk capacity, and portfolio space without proportionate justification.

3. Example

Scenario A — Projected Savings Never Materialised: A mid-size insurer deploys an AI agent to automate claims triage, projecting that the agent will reduce average triage time from 45 minutes to 8 minutes and enable redeployment of 6 FTEs (annual saving: £312,000). Twelve months later, the agent has reduced average triage time to 22 minutes — a meaningful improvement but less than half the projected benefit. The 6 FTEs have not been redeployed; they now spend their time reviewing agent triage decisions rather than performing triage themselves. The actual saving is approximately £40,000 per year (reduced overtime). No one has compared projected versus actual benefits because no tracking mechanism exists. The agent continues to be cited in board presentations as delivering £312,000 in annual savings.

What went wrong: The projected benefit was used to justify approval but was never tracked against actuals. The gap between projection and reality was not detected, investigated, or reported. The board believes the agent is delivering 8x the value it actually delivers. Consequence: £272,000 annual shortfall against projections, misallocated resources (6 FTEs reviewing rather than redeployed), distorted board understanding of agent portfolio value, and future agent approvals based on similarly optimistic projections that will not be validated.

Scenario B — Benefit Eroded Over Time Without Detection: A legal department deploys an AI agent to draft standard contract clauses, projecting 70% time saving on first drafts. In the first quarter, the agent achieves 65% time saving — close to projection. Over the next 12 months, model updates change the agent's drafting style, legal requirements evolve, and the organisation's clause library expands. By month 18, the time saving has declined to 25% because lawyers spend more time correcting the agent's drafts to match current requirements. No benefit tracking mechanism detects the decline.

What went wrong: Benefit was measured at deployment (or shortly after) but not continuously. The gradual erosion of benefit was invisible because no tracking mechanism existed to detect trends. Consequence: by month 18, the agent delivers less than half its projected value, but governance resources, risk exposure, and cost continue at the originally projected level. A sunset review (AG-254) without benefit data would not detect the decline.

Scenario C — Benefit Tracking Reveals Unexpected Value: A customer service team deploys an AI agent to handle routine billing queries, projecting a 40% reduction in call centre volume. After 6 months, the call centre volume reduction is only 22%. However, benefit tracking also reveals unexpected positive outcomes: customer satisfaction scores for billing queries handled by the agent are 18% higher than human-handled queries, and the agent's interaction logs have identified 3 systematic billing errors in the organisation's invoicing system that human agents had not detected. The tracked benefits — while different from projections — provide a more nuanced justification for continued operation and inform the sunset review with data-driven evidence.

What went right: Benefit tracking captured not just the projected metric but broader value indicators. The organisation could make an informed decision about continued operation based on actual evidence rather than the original projection alone.

4. Requirement Statement

Scope: This dimension applies to every AI agent approved under AG-249 that included projected benefits as part of the approval justification. If an agent was approved without any projected benefit — which should be rare, as most approvals require a business case — this dimension requires that measurable benefit criteria be defined retrospectively within 90 days of deployment. The scope extends to all categories of benefit: financial (cost savings, revenue generation), operational (time savings, throughput improvement, error reduction), quality (accuracy improvement, consistency, customer satisfaction), and strategic (competitive advantage, regulatory compliance improvement, risk reduction).

4.1. A conforming system MUST define measurable benefit criteria for each deployed agent at the time of approval, including specific metrics, baseline measurements, projected values, and measurement methodology.

4.2. A conforming system MUST measure actual benefits against projections at defined intervals — at minimum, at 6 months post-deployment and at each sunset review (AG-254).

4.3. A conforming system MUST investigate benefit shortfalls exceeding a defined threshold (e.g., actual benefits below 50% of projected benefits) and determine whether the shortfall is remediable or indicates that the agent should be retired or modified.

4.4. A conforming system MUST report benefit realisation data to the governance body at least annually, showing projected versus actual benefits across the portfolio.

4.5. A conforming system MUST feed benefit realisation data into the sunset review process (AG-254) as a mandatory input to the re-approval decision.

4.6. A conforming system SHOULD capture the total cost of agent operation — not just direct costs (API, hosting) but governance costs (monitoring, testing, audit) — to enable net benefit calculation.

4.7. A conforming system SHOULD track benefit trends over time to detect gradual erosion before it becomes critical.

4.8. A conforming system SHOULD use historical benefit realisation data to calibrate projections for future agent approvals — organisations that consistently over-project benefits should apply a calibration factor to future projections.

4.9. A conforming system MAY implement a benefit realisation dashboard showing real-time or near-real-time benefit metrics for each agent and the portfolio as a whole.

5. Rationale

Agent deployments are investments. Like all investments, they should be tracked against their projected returns. The business case that justified the approval included specific projections — time savings, cost reductions, quality improvements. If those projections are not tracked, the organisation has no way to know whether its agent portfolio is delivering value proportionate to its cost and risk.

The absence of benefit tracking creates two problems. First, underperforming agents persist because no one measures their underperformance. They consume governance resources, occupy portfolio capacity, and create risk exposure without delivering proportionate value. Second, future agent approvals are based on uncalibrated projections. If the organisation's agents consistently deliver 40% of projected benefits, but projections are never compared to actuals, every future approval is evaluated against similarly inflated projections. The result is a growing portfolio of agents justified by optimistic projections that are never validated.

Benefit tracking also provides essential input to the sunset review (AG-254). A sunset review without benefit data is assessing whether to continue operating the agent based on whether it is "working" (functional) rather than whether it is "worth it" (delivering proportionate value). AG-255 provides the "worth it" data.

This dimension connects to AG-045 (Economic Incentive Alignment Verification) because the economic benefit is a key tracking dimension. It connects to AG-251 (Strategic Fit and Substitution Governance) because benefit shortfalls may indicate that a simpler alternative has become preferable. It connects to AG-257 (Use-Case Prioritisation Governance) because benefit realisation data from existing agents should inform prioritisation of future agent investments.

6. Implementation Guidance

Benefit tracking must be designed into the agent lifecycle from the start — not retrofitted after deployment. The key to effective benefit tracking is defining measurable criteria before deployment, establishing baselines before the agent changes the process, and measuring consistently.

Recommended patterns:

Benefit specification at approval. Extend the AG-249 use-case specification to require a benefit specification that includes: each projected benefit as a named metric (e.g., "average triage time reduction"), the baseline value before agent deployment (e.g., "45 minutes average triage time, measured over Q3 2025, n=4,200 cases"), the projected post-deployment value (e.g., "8 minutes average triage time"), the measurement methodology (e.g., "mean triage time from case creation to first categorisation assignment, measured from the case management system, excluding outliers above the 99th percentile"), the measurement frequency (e.g., "monthly"), and the shortfall threshold that triggers investigation (e.g., "if actual reduction is less than 50% of projected reduction for 3 consecutive months"). Baseline measurement must occur before agent deployment, not after — once the agent changes the process, the baseline is lost.
Net benefit calculation. Track total cost alongside benefits to calculate net benefit. Total cost includes: direct costs (model API fees, compute, hosting, data storage), governance costs (proportion of governance team time spent on this agent, testing costs, audit preparation), operational costs (human review time if agent is at Recommendation level per AG-252, incident response, prompt maintenance), and opportunity cost (portfolio capacity consumed). Net benefit = gross benefit minus total cost. Example: Gross benefit £180,000 (time savings). Total cost: £36,000 API + £22,000 governance + £48,000 human review + £12,000 maintenance = £118,000. Net benefit: £62,000. This is meaningful but significantly below the £200,000 projected net benefit.
Benefit trend tracking. Plot benefit metrics monthly to detect trends. A benefit that was strong at month 3 but is declining by month 9 indicates erosion. Trend detection should trigger investigation before the benefit falls below the shortfall threshold. Causes of benefit erosion include: model updates changing output quality, process changes reducing the agent's applicability, user workarounds bypassing the agent, and data quality degradation. Each cause has a different remediation path.
Portfolio benefit dashboard. Aggregate benefit data across the portfolio for governance body reporting. The dashboard should show: total projected benefit across the portfolio, total actual benefit, portfolio benefit realisation rate (actual/projected), benefit by agent tier (high-risk, standard, low-risk), cost-to-benefit ratio, and agents below the shortfall threshold. Example: "Portfolio of 25 agents. Projected annual benefit: £4.2M. Actual annual benefit: £2.8M. Realisation rate: 67%. 4 agents below 50% threshold requiring investigation. Portfolio cost-to-benefit ratio: 0.42."
Projection calibration. After accumulating 12-24 months of benefit realisation data, calculate the organisation's average realisation rate and apply it as a calibration factor to future projections. If the organisation's agents consistently deliver 65% of projected benefits, future approvals should evaluate the business case at 65% of the proposing team's projections. This corrects for organisational optimism bias and produces more realistic approval decisions.

Anti-patterns to avoid:

Measuring only direct cost savings. Benefit is broader than cost reduction. Quality improvements, risk reduction, compliance enhancement, and customer experience gains are legitimate benefits but are harder to measure. The benefit specification should include these categories even if measurement is approximate — an approximate measure is better than no measure.
Comparing against a shifting baseline. If the baseline is measured after the agent has already been operating, the measurement is contaminated. Baselines must be established before deployment. If the baseline was not measured before deployment, acknowledge this gap and establish a proxy baseline from historical data.
Ignoring governance cost in benefit calculation. Governance costs are real and ongoing. An agent that delivers £50,000 in gross benefit but requires £45,000 in governance, monitoring, and testing costs is delivering only £5,000 in net benefit — and may not justify its continued operation.
Treating benefit data as optional in sunset reviews. If the sunset review can proceed without benefit data, teams will not collect it. Benefit data must be a mandatory input to the re-approval decision.

Industry Considerations

Financial Services. Financial services firms should align agent benefit tracking with their existing return-on-investment frameworks for technology investments. The benefit metrics should be auditable and should align with financial reporting standards. For agents that affect customer outcomes (advice, product recommendation, complaint handling), customer outcome metrics should be tracked alongside financial benefits.

Healthcare. Healthcare benefit tracking should include clinical outcome metrics where applicable. A clinical decision support agent's benefit is not just time saving — it includes diagnostic accuracy, appropriate referral rates, and patient outcome improvements. These metrics require clinical governance oversight and may require longer measurement periods (12-24 months) to achieve statistical significance.

Public Sector. Public sector benefit tracking should include value-for-money metrics required by HM Treasury's Green Book and Managing Public Money guidance. Benefits should be categorised as cash-releasing (actual budget savings), non-cash-releasing (time savings that improve service quality but do not reduce headcount), and qualitative (improved citizen experience, better compliance). The National Audit Office will expect to see benefit realisation evidence for significant agent deployments.

Maturity Model

Basic Implementation — Projected benefits are stated in the use-case approval but are not tracked post-deployment. No baseline measurements are established before deployment. Benefit data is not available at sunset reviews. The organisation cannot quantify the actual value delivered by its agent portfolio. This level creates a record of intent but provides no accountability.

Intermediate Implementation — Measurable benefit criteria with baselines are established before deployment. Benefits are measured at 6 months, 12 months, and at each sunset review. Shortfall thresholds trigger investigation. Net benefit calculations include governance costs. Benefit data is a mandatory input to sunset reviews. Portfolio benefit reports are provided to the governance body annually. Realisation rates are tracked and reported.

Advanced Implementation — All intermediate capabilities plus: real-time benefit dashboards for individual agents and the portfolio. Benefit trend tracking detects erosion before threshold breach. Projection calibration adjusts future projections based on historical realisation rates. Benefit data feeds into AG-257 (Use-Case Prioritisation Governance) to inform future investment decisions. The organisation can demonstrate the net return on its agent portfolio to the board with the same rigour as any other technology investment.

7. Evidence Requirements

Required artefacts:

Benefit specifications. For each deployed agent, the measurable benefit criteria defined at approval, including metrics, baselines, projections, methodology, and shortfall thresholds. Format: structured document within the use-case specification.
Baseline measurements. Pre-deployment baseline measurements for each benefit metric, with measurement methodology and sample size documented.
Benefit measurement records. Periodic (minimum 6-monthly) measurements of actual benefits against projections, including measurement methodology, data sources, and calculations.
Net benefit calculations. Total cost and net benefit calculations including governance overhead.
Shortfall investigation records. For agents below the shortfall threshold, the investigation findings and remediation or retirement decision.
Portfolio benefit reports. Annual (minimum) reports showing projected versus actual benefits across the portfolio, realisation rates, and trend data.

Retention requirements:

Benefit specifications, measurement records, and portfolio reports: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Benefit Specification Completeness at Approval

Stimulus: Review all agent approvals from the past 12 months and verify that each has a benefit specification with measurable criteria, baselines, projections, and shortfall thresholds.
Expected behaviour: Every approved agent has a complete benefit specification.
Pass criteria: 100% of approved agents have a benefit specification with all required elements.
Fail criteria: Any approved agent lacks a benefit specification or has a specification missing required elements (e.g., no baseline, no measurable metric, no shortfall threshold).

Test 8.2: Baseline Measurement Timing

Stimulus: Verify that baseline measurements were recorded before agent deployment for all agents deployed in the past 12 months.
Expected behaviour: Baseline measurements predate the agent's production deployment date.
Pass criteria: At least 90% of agents have baseline measurements recorded before deployment.
Fail criteria: More than 10% of agents have baselines established after deployment or have no baselines at all.

Test 8.3: Shortfall Investigation Trigger

Stimulus: Identify agents whose actual benefits are below 50% of projected benefits for 3 consecutive measurement periods.
Expected behaviour: Each of these agents has a documented shortfall investigation with findings and a decision (remediate, modify, or retire).
Pass criteria: 100% of agents below the shortfall threshold have documented investigations.
Fail criteria: Any agent below the threshold lacks a documented investigation.

Test 8.4: Sunset Review Benefit Data Inclusion

Stimulus: Review all sunset reviews completed in the past 12 months and verify that benefit realisation data was included.
Expected behaviour: Every sunset review includes projected versus actual benefit data as a documented input to the re-approval decision.
Pass criteria: 100% of sunset reviews include benefit realisation data.
Fail criteria: Any sunset review proceeds without benefit realisation data.

Test 8.5: Net Benefit Calculation Accuracy

Stimulus: Independently recalculate the net benefit for 3 deployed agents, including all cost categories (API, hosting, governance, human review, maintenance).
Expected behaviour: The independently calculated net benefit matches the reported net benefit within 15% tolerance.
Pass criteria: At least 2 of 3 agents have reported net benefits within 15% of independently calculated values.
Fail criteria: More than 1 of 3 agents has a net benefit discrepancy exceeding 15%.

Conformance Scoring

Score 0: No benefit tracking exists — agents are approved based on projections that are never verified.
Score 1: Benefit projections are stated at approval, but post-deployment tracking is informal or absent. Benefit data is not available for sunset reviews.
Score 2: Measurable benefit criteria with baselines are established before deployment. Benefits are tracked at defined intervals. Shortfall thresholds trigger investigation. Net benefit includes governance costs. Benefit data is mandatory for sunset reviews.
Score 3: All Score 2 capabilities plus real-time benefit dashboards, trend tracking with erosion detection, projection calibration from historical data, and portfolio-level benefit reporting to the board with the same rigour as traditional technology investment reporting.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
HM Treasury	Managing Public Money — Value for Money	Direct requirement
HM Treasury	Green Book — Appraisal and Evaluation	Direct requirement
EU AI Act	Article 9 (Risk Management System)	Supports compliance
ISO 42001	Clause 9.1 (Monitoring, Measurement, Analysis)	Supports compliance
NIST AI RMF	MAP 2.2, MANAGE 4.2	Supports compliance
FCA Consumer Duty	Outcome Monitoring	Supports compliance
NAO Expectations	Technology Investment Reporting	Supports compliance

HM Treasury — Managing Public Money

Managing Public Money requires that public expenditure delivers value for money. For public sector organisations deploying AI agents, this means the projected benefits must be tracked against actuals and the investment must be justified by demonstrable returns. The National Audit Office will expect to see benefit realisation evidence for any significant agent deployment. A public sector organisation that cannot demonstrate what benefit its agents have delivered is vulnerable to a value-for-money challenge.

HM Treasury — Green Book

The Green Book requires that government investments are appraised before commitment and evaluated after implementation. Benefit realisation tracking is the evaluation component — it closes the loop between the appraisal (which projected benefits) and the operational reality (which may differ). The Green Book explicitly requires post-implementation review to "learn lessons for future appraisals" — this maps directly to the projection calibration pattern.

FCA Consumer Duty — Outcome Monitoring

The Consumer Duty requires firms to monitor whether their products and services deliver good outcomes for customers. For agents that interact with customers (advice agents, complaint handling agents, onboarding agents), benefit tracking should include customer outcome metrics: are customers receiving better advice, faster complaint resolution, more appropriate products? Benefit tracking for customer-facing agents is a Consumer Duty compliance tool, not just an operational efficiency measure.

10. Failure Severity

Field	Value
Severity Rating	Medium
Blast Radius	Portfolio-wide — underperforming agents collectively waste resources and distort strategic decision-making

Consequence chain: Without benefit tracking, the organisation cannot distinguish high-value agents from low-value agents. Resources are allocated based on projections rather than evidence. Underperforming agents persist because no one measures their underperformance. Overperforming agents receive no additional investment because no one measures their overperformance. The portfolio becomes an unmanaged collection of investments with unknown returns. The strategic consequence is that future agent investments are approved based on uncalibrated projections — each new agent is justified by optimistic projections that will never be verified, creating a cycle of over-investment in a technology category whose actual returns are unknown. The financial consequence is quantifiable: the gap between projected and actual benefits, multiplied across the portfolio, represents wasted expenditure. For public sector organisations, this waste is a failure of the value-for-money obligation.

Cross-references: AG-249 (Use-Case Approval Governance) is where benefit projections are first defined. AG-254 (Sunset Review Governance) uses benefit realisation data as a mandatory input to re-approval decisions. AG-251 (Strategic Fit and Substitution Governance) benefit shortfalls may indicate that a simpler alternative has become preferable. AG-045 (Economic Incentive Alignment Verification) ensures that economic benefits include all cost categories. AG-257 (Use-Case Prioritisation Governance) uses benefit data to inform future investment priorities.

Cite this protocol

AgentGoverning. (2026). AG-255: Benefit Realisation Tracking Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-255

← Previous Protocol

AG-254

Sunset Review Governance

Next Protocol →

AG-256

Shadow AI Discovery Governance