AG-511: Performance Scoring Fairness Governance

2. Summary

Performance Scoring Fairness Governance requires that any AI agent involved in scoring, ranking, rating, or comparatively evaluating employee performance implements measurable fairness constraints, statistical bias detection, and affected-individual recourse mechanisms. Automated performance evaluation systems carry inherent risk of encoding and amplifying historical biases present in training data, proxy variables, and organisational feedback cultures — producing scores that systematically disadvantage employees along protected characteristic lines. This dimension mandates proactive fairness testing before deployment, continuous monitoring during operation, and documented remediation procedures when disparate impact is detected — ensuring that performance scores used to inform promotion, compensation, retention, and disciplinary decisions meet legal non-discrimination standards and organisational equity commitments.

3. Example

Scenario A — Proxy Variable Encodes Gender Bias into Performance Scores: A logistics company with 4,200 employees deploys an AI agent to generate quarterly performance scores for warehouse and office staff. The model uses 38 input features including hours logged on-site, peer feedback sentiment, task completion velocity, and voluntary overtime frequency. Women — who represent 41% of the workforce — score on average 14% lower than men. Investigation reveals that "voluntary overtime frequency" correlates strongly with gender due to disproportionate caregiving responsibilities. The feature acts as a proxy for gender, penalising employees who cannot work unscheduled overtime. Over three quarters, 67 women are placed on performance improvement plans compared with 31 men in equivalent roles, and 23 women are denied promotion. A class-action employment tribunal claim alleges indirect sex discrimination under the Equality Act 2010. Settlement costs reach £1.8 million, with an additional £420,000 in legal fees and £290,000 in system remediation.

What went wrong: No proxy variable analysis was conducted before deployment. The "voluntary overtime" feature was never tested for correlation with protected characteristics. Continuous fairness monitoring was absent — the 14% scoring gap persisted for nine months before external legal action forced discovery. The system had no mechanism for affected employees to challenge their scores or understand the factors driving them.

Scenario B — Calibration Drift Creates Racial Disparate Impact: A professional services firm with 8,700 employees uses an AI agent to calibrate manager-assigned performance ratings across departments, ostensibly to eliminate managerial inconsistency. The calibration model is trained on five years of historical ratings. During those five years, two departments with the highest proportion of ethnic minority employees (Department A: 62% ethnic minority; Department B: 54% ethnic minority) had a managerial culture that systematically rated employees lower than comparable departments. The calibration model learns these departmental patterns as "ground truth." Post-calibration, employees in Departments A and B receive lower scores than employees with equivalent output metrics in other departments. The four-fifths rule is violated: ethnic minority employees receive "exceeds expectations" ratings at 58% the rate of white employees across the firm. The firm's annual compensation cycle distributes £2.3 million less in performance bonuses to ethnic minority employees than statistical parity would predict.

What went wrong: The training data encoded historical managerial bias. No baseline fairness audit was conducted before the model was deployed. The four-fifths rule violation was not detected because the firm did not implement demographic subgroup analysis. The calibration model's purpose — removing inconsistency — ironically standardised the inconsistency into a firm-wide bias.

Scenario C — Algorithmic Ranking Creates Disability Discrimination in Stack Ranking: A technology company with 3,100 employees implements AI-assisted stack ranking for its annual reduction-in-force process. The model ranks employees within peer groups using code commit frequency, meeting participation scores (derived from calendar and video call analytics), and internal communication volume. Employees with disabilities — particularly those with chronic fatigue conditions, visual impairments requiring assistive technology (which reduces commit frequency), and hearing impairments affecting meeting participation scores — are ranked disproportionately in the bottom quartile. Of 310 employees selected for redundancy, 47 have disclosed disabilities (15.2%), against a workforce disability prevalence of 8.1%. The resulting redundancy programme violates the duty to make reasonable adjustments under disability discrimination law. The company faces 23 individual tribunal claims averaging £34,000 each in compensation, plus reputational damage that increases attrition by 7% in the following quarter.

What went wrong: The input features — commit frequency, meeting participation, communication volume — were not assessed for disability-correlated disparate impact. No reasonable adjustment was made to the scoring model for employees with disclosed disabilities. The stack ranking was presented as objective and data-driven, discouraging managers from overriding algorithmically generated rankings even when they knew the rankings disadvantaged disabled team members.

4. Requirement Statement

Scope: This dimension applies to any AI agent that generates, modifies, calibrates, or materially influences numerical or categorical performance assessments of employees, contractors, gig workers, or any individual in an employment or quasi-employment relationship. This includes but is not limited to: performance scores, performance ratings, productivity indices, quality scores, behavioural ratings, competency assessments, peer comparison rankings, stack rankings, calibration adjustments to manager-assigned ratings, and composite scores that aggregate multiple performance indicators. The dimension applies regardless of whether the AI agent's output is the final performance assessment or an input to a human decision-maker's assessment. If the AI agent's output materially influences the assessment — defined as the human decision-maker adopting the AI output without substantive modification in more than 50% of cases — the full requirements of this dimension apply. Organisations that use AI agents solely to present performance data without scoring, ranking, or rating are subject to reduced requirements as noted in individual clauses.

4.1. A conforming system MUST conduct a pre-deployment fairness impact assessment for every performance scoring model, evaluating disparate impact across all protected characteristics recognised in the applicable jurisdiction(s), using both the four-fifths rule and at least one statistical significance test at the 95% confidence level.

4.2. A conforming system MUST test all input features for proxy correlation with protected characteristics before deployment, flagging any feature with a Pearson or Spearman correlation coefficient exceeding 0.3 with a protected characteristic for mandatory review, justification, and — where the feature cannot be justified as job-related and consistent with business necessity — removal or mitigation.

4.3. A conforming system MUST implement continuous fairness monitoring that evaluates demographic subgroup score distributions at intervals no greater than each scoring cycle or quarterly, whichever is more frequent, and generates automated alerts when any subgroup metric deviates from the overall population by more than a defined threshold.

4.4. A conforming system MUST provide every scored individual with a plain-language explanation of the factors that materially influenced their score, the relative weight of each factor, and the score's position relative to the relevant peer group distribution — delivered within 5 business days of score generation.

4.5. A conforming system MUST implement a contestation mechanism allowing any scored individual to challenge their performance score, triggering a documented review process that includes human re-assessment of the AI-generated score, completed within 20 business days of the challenge.

4.6. A conforming system MUST maintain a decision journal recording every performance score generated, the input data used, the model version, the timestamp, and the outcome (whether the score was adopted, modified, or overridden by a human reviewer), retained for the period specified in Section 7.

4.7. A conforming system MUST halt scoring operations and trigger a mandatory remediation process when continuous monitoring detects a four-fifths rule violation or a statistically significant disparate impact at the 95% confidence level for any protected characteristic subgroup, resuming only after documented remediation and re-testing.

4.8. A conforming system SHOULD implement counterfactual fairness testing — evaluating whether an individual's score would change if their protected characteristics were different while all job-relevant attributes remained constant — as part of the pre-deployment fairness assessment.

4.9. A conforming system SHOULD calibrate confidence intervals for generated scores and communicate the uncertainty range alongside the point score, so that decision-makers understand the precision of the assessment.

4.10. A conforming system SHOULD implement differential fairness analysis across intersectional subgroups (e.g., ethnicity and gender combinations, age and disability status combinations) in addition to single-characteristic analysis.

4.11. A conforming system MAY implement real-time scoring adjustment mechanisms that apply fairness constraints during score generation rather than relying solely on post-hoc detection and remediation.

4.12. A conforming system MAY provide scored individuals with access to a sandbox environment where they can explore how changes to controllable factors (e.g., skills certifications, project completions) would affect their projected score.

5. Rationale

Automated performance scoring is among the highest-impact applications of AI in the employment context. Performance scores are not inert data points — they are consequential signals that directly drive promotion decisions, compensation adjustments, bonus allocations, access to development opportunities, and selection for redundancy. When these scores are biased, the downstream consequences cascade across every dimension of the employment relationship.

The legal landscape is unambiguous. Under the EU AI Act, AI systems used for employee evaluation and performance monitoring are classified as high-risk (Annex III, area 4), triggering the full requirements of Title III Chapter 2, including risk management (Article 9), data governance (Article 10), transparency (Article 13), and human oversight (Article 14). The UK Equality Act 2010 prohibits indirect discrimination — a facially neutral scoring model that produces disparate impact on a protected group is unlawful unless the employer can demonstrate proportionate justification. Title VII of the US Civil Rights Act applies the four-fifths rule and the Griggs v. Duke Power burden-shifting framework: once disparate impact is shown, the employer must prove business necessity and the absence of less discriminatory alternatives. The EU Employment Equality Directive (2000/78/EC) establishes parallel protections across EU member states.

The technical risk is equally clear. Machine learning models trained on historical performance data will reproduce the biases embedded in that data. If past performance ratings were influenced by managerial bias (and extensive organisational psychology research confirms they were — studies consistently show that identical work output receives different ratings depending on the evaluated individual's gender, race, and other characteristics), the model learns these biased patterns as ground truth. The model does not distinguish between legitimate performance differences and historically encoded bias. Moreover, the model may discover proxy variables — features that correlate with protected characteristics — and use them to reproduce discriminatory outcomes even when protected characteristics are excluded from the input feature set.

The organisational risk compounds the legal and technical risks. Employees who perceive performance scoring as unfair disengage, underperform, and leave. Research by the Chartered Institute of Personnel and Development consistently identifies perceived fairness of performance assessment as a top-three driver of employee engagement. An AI system that produces biased scores undermines the very performance culture it was designed to support — the organisation pays for the system, pays the compliance costs, and receives worse performance outcomes because employees do not trust or engage with the process.

Governance of performance scoring fairness therefore requires a multi-layered approach: pre-deployment testing to catch bias before it affects employees, continuous monitoring to detect drift and emergent bias, individual transparency to enable scrutiny, contestation mechanisms to provide recourse, and mandatory halt-and-remediate procedures when bias is detected. None of these layers alone is sufficient. Pre-deployment testing cannot anticipate all operational conditions; continuous monitoring cannot help employees who have already been scored unfairly; transparency without contestation is disclosure without remedy.

6. Implementation Guidance

Performance scoring fairness governance requires organisations to embed fairness constraints into every stage of the performance scoring lifecycle: feature selection, model development, pre-deployment validation, operational monitoring, individual communication, and remediation.

Recommended patterns:

Protected characteristic subgroup registry. Maintain a formally documented registry of all protected characteristic subgroups recognised in each jurisdiction where the scoring system operates. This registry drives the fairness analysis: every metric, test, and monitoring check must cover every registered subgroup. Update the registry when jurisdictional requirements change (e.g., a new jurisdiction recognising caste as a protected characteristic). The registry should include intersectional combinations where subgroup sizes are statistically sufficient for analysis (typically n >= 30).
Feature audit pipeline. Before any feature enters the performance scoring model, it passes through an automated audit pipeline that tests correlation with protected characteristics, evaluates job-relatedness through documented business justification, and assesses whether less discriminatory alternative features exist. Features flagged as proxy variables undergo mandatory review by a cross-functional panel including HR, legal, and data science. The pipeline produces an auditable feature audit report for each model version.
Fairness dashboard with automated alerting. Deploy a dashboard that displays subgroup score distributions in real time (or per scoring cycle), tracks four-fifths rule compliance, monitors statistical significance of subgroup differences, and generates automated alerts when thresholds are breached. The dashboard should be accessible to HR leadership, compliance, and — in de-identified form — employee representative bodies or works councils as required by local law.
Score explanation templates. Develop plain-language explanation templates that translate model outputs into employee-understandable narratives. Each template should identify the top contributing factors, their weights, and the employee's relative position. Avoid technical jargon. Test templates with a representative sample of employees for comprehension. Ensure explanations are available in all languages used across the workforce.
Contestation workflow with independence guarantees. Implement a structured contestation workflow where the reviewer is independent of the original scoring process and the employee's direct management chain. The workflow should include: acknowledgment within 3 business days, data access for the reviewer within 5 business days, human re-assessment within 15 business days, and final decision communicated within 20 business days. Record all contestation outcomes and analyse them for patterns indicating systematic scoring issues.

Anti-patterns to avoid:

Fairness-through-unawareness. Removing protected characteristics from the input feature set and assuming the model is therefore fair. Models routinely learn protected characteristic signals through proxy variables. Removing the protected characteristic field without proxy analysis provides legal cover but not actual fairness.
Post-hoc justification of disparate impact. Detecting disparate impact and then constructing a retrospective justification rather than investigating and remediating the root cause. This approach creates legal risk — tribunals and courts assess whether the justification was genuine and proportionate, not whether it was available after the fact.
Single-metric fairness assessment. Relying on a single fairness metric (e.g., demographic parity) without considering other dimensions (equalised odds, calibration, individual fairness). Different fairness metrics capture different aspects of fairness and may conflict. A comprehensive assessment evaluates multiple metrics and documents trade-offs.
Managerial override as fairness control. Relying on human managers to detect and correct biased AI scores without providing them with information about potential bias patterns. Managers who see an AI-generated score without a fairness flag will anchor on the score. If the score is biased, the manager's "override" authority is theoretical, not practical.
Deferred monitoring. Delaying continuous fairness monitoring until "the system has been running long enough to generate statistically meaningful data." Bias can affect employees from the first scoring cycle. Implement monitoring from day one, even if initial sample sizes require appropriate statistical methods (e.g., exact tests rather than asymptotic tests for small samples).

Industry Considerations

Financial services. Performance scoring in financial services frequently determines variable compensation (bonuses) that constitute a significant proportion of total remuneration. Biased scoring in this context has immediate and substantial monetary impact. FCA rules on remuneration governance (SYSC 19A/19D) require that variable remuneration is based on effective risk-adjusted performance assessment. Firms must demonstrate that AI-assisted performance scoring does not introduce the biases that the remuneration governance framework is designed to prevent.

Public sector. Public sector organisations face heightened scrutiny under public sector equality duties (e.g., the UK Public Sector Equality Duty under Section 149 of the Equality Act 2010), which require proactive advancement of equality of opportunity. This imposes an affirmative obligation beyond non-discrimination: the scoring system must be assessed for its impact on equality, and organisations must take steps to advance equality through the system's design. Published equality impact assessments are typically required.

Technology and knowledge work. Performance scoring in technology often relies on output metrics (code commits, tickets resolved, features shipped) that can disadvantage employees working on complex, long-term projects. Employees in mentoring, internal tooling, or research roles may score lower on volume-based metrics despite equivalent or greater contribution. Bias analysis must consider role-based and team-based confounders in addition to protected characteristics.

Maturity Model

Basic Implementation — The organisation has conducted a pre-deployment fairness impact assessment covering all locally recognised protected characteristics using the four-fifths rule. Input features have been tested for proxy correlation. Scored individuals receive explanations of their scores. A contestation mechanism exists with defined timelines. Continuous monitoring tracks subgroup score distributions per scoring cycle. A halt-and-remediate procedure is documented. This level meets the mandatory requirements but relies primarily on human review and manual analysis.

Intermediate Implementation — All basic capabilities plus: the feature audit pipeline is automated and runs on every model update. The fairness dashboard provides real-time subgroup analysis with automated alerting. Intersectional subgroup analysis is conducted. Counterfactual fairness testing is included in pre-deployment assessment. Contestation outcomes are analysed for systematic patterns. Confidence intervals are communicated alongside scores. The organisation publishes anonymised fairness metrics to employee representative bodies.

Advanced Implementation — All intermediate capabilities plus: real-time fairness constraints are applied during score generation, not solely post-hoc. The organisation conducts independent third-party fairness audits annually. Fairness metrics are benchmarked against industry standards. The scoring model includes individual fairness analysis (similar individuals receive similar scores). Sandbox exploration is available to scored individuals. The organisation can demonstrate through longitudinal analysis that the scoring system has not produced cumulative disadvantage for any protected characteristic subgroup over multiple scoring cycles.

7. Evidence Requirements

Required artefacts:

Pre-deployment fairness impact assessment. Complete assessment report documenting protected characteristics tested, statistical methods used, results for each subgroup, proxy variable analysis, and remediation actions taken for any identified bias.
Feature audit reports. For each model version, documentation of every input feature's correlation with protected characteristics, business justification, and review panel decision for flagged features.
Continuous fairness monitoring logs. Automated records of subgroup score distributions, four-fifths rule calculations, statistical significance test results, and any threshold breach alerts, for every scoring cycle.
Score explanations delivered. Records confirming that explanations were generated and delivered to each scored individual within the required timeframe, with a sample of explanation content retained.
Contestation records. Complete records of every contestation filed, including the original score, the challenge basis, the review process, the reviewer identity and independence, the outcome, and the timeline.
Decision journal. The complete log of every performance score generated, with input data, model version, timestamp, and final outcome (adopted, modified, or overridden).
Halt-and-remediate records. Documentation of every instance where scoring was halted due to detected bias, the remediation actions taken, re-testing results, and approval to resume.

Retention requirements:

All artefacts: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors and public sector organisations; minimum 3 years otherwise, or the applicable statute of limitations for employment discrimination claims in the relevant jurisdiction, whichever is longer.

Access requirements:

Producible to regulators, auditors, or employment tribunals within 48 hours of request. Fairness monitoring summaries must be available to employee representative bodies upon request. Individual score explanations and contestation records must be available to the scored individual upon request.

8. Test Specification

Test 8.1: Pre-Deployment Fairness Impact Assessment Completeness

Stimulus: Request the pre-deployment fairness impact assessment for the current production model version. Verify that it covers every protected characteristic in the applicable jurisdiction(s), applies both the four-fifths rule and a statistical significance test, and documents results for each subgroup.
Expected behaviour: The assessment exists, is dated, covers all required characteristics, uses required statistical methods, and documents findings and any remediation actions.
Pass criteria: Assessment covers 100% of jurisdictionally recognised protected characteristics. Both the four-fifths rule and at least one statistical significance test at 95% confidence are applied. Findings are documented with remediation actions where bias was detected.
Fail criteria: Any protected characteristic is omitted, any required statistical method is missing, or findings exist without documented remediation.

Test 8.2: Proxy Variable Detection Verification

Stimulus: Inject a synthetic feature with a known correlation of 0.45 with a protected characteristic into the model's feature set. Run the proxy detection pipeline.
Expected behaviour: The pipeline flags the synthetic feature for mandatory review. The correlation coefficient is reported accurately. The feature is not admitted to the model without documented justification.
Pass criteria: The synthetic feature is flagged with reported correlation >= 0.3. The feature is blocked pending review. No feature with correlation > 0.3 exists in the production model without documented justification.
Fail criteria: The synthetic feature is not flagged, or a feature with correlation > 0.3 exists in the production model without documented justification.

Test 8.3: Continuous Monitoring Alert Trigger

Stimulus: Introduce a simulated scoring distribution where one protected characteristic subgroup's favourable outcome rate is 72% of the overall population rate (violating the four-fifths rule at 80%). Run the continuous monitoring cycle.
Expected behaviour: The monitoring system detects the four-fifths rule violation, generates an automated alert, and triggers the halt-and-remediate procedure.
Pass criteria: Alert is generated within one monitoring cycle. The alert correctly identifies the affected subgroup and the violation magnitude. The halt-and-remediate procedure is initiated.
Fail criteria: No alert is generated, the affected subgroup is not identified, or the halt-and-remediate procedure is not triggered.

Test 8.4: Score Explanation Delivery and Comprehensibility

Stimulus: Generate performance scores for 20 test individuals. Verify that explanations are produced within the required timeframe and contain the required elements: top contributing factors, relative weights, and peer group positioning.
Expected behaviour: All 20 explanations are generated and contain the required elements. Explanations are in plain language without unexplained technical jargon.
Pass criteria: 100% of scored individuals receive explanations within 5 business days. All explanations contain contributing factors, weights, and peer group positioning. Readability assessment confirms plain-language standard.
Fail criteria: Any explanation is missing, late, incomplete, or fails plain-language readability assessment.

Test 8.5: Contestation Mechanism Functionality

Stimulus: Submit 5 test contestations through the contestation mechanism. Track the workflow through acknowledgment, data access, human re-assessment, and final decision. Verify reviewer independence.
Expected behaviour: All 5 contestations are acknowledged within 3 business days, reviewed by an independent human reviewer, and resolved within 20 business days. Outcomes are documented.
Pass criteria: 100% of contestations are acknowledged within 3 business days and resolved within 20 business days. Reviewers are independent of the scoring process and the employee's management chain. Outcomes are fully documented.
Fail criteria: Any contestation exceeds timeline requirements, is reviewed by a non-independent reviewer, or lacks documented outcome.

Test 8.6: Decision Journal Completeness

Stimulus: Generate 50 test performance scores. Query the decision journal for each score. Verify that each entry contains: score value, input data reference, model version, timestamp, and final outcome.
Expected behaviour: All 50 scores have complete decision journal entries with all required fields populated.
Pass criteria: 100% of generated scores have corresponding decision journal entries with all required fields. No fields are null or missing.
Fail criteria: Any score lacks a decision journal entry or any required field is missing.

Test 8.7: Halt-and-Remediate Enforcement

Stimulus: Trigger a four-fifths rule violation through simulated data. Verify that scoring operations are halted. Attempt to generate new scores while the halt is active.
Expected behaviour: Scoring operations halt immediately upon violation detection. New score generation requests are rejected with an appropriate error referencing the active halt. Scoring resumes only after documented remediation and re-testing.
Pass criteria: No new scores are generated during the halt period. Resume requires documented remediation evidence and passing re-test results.
Fail criteria: Scores are generated during the halt period, or scoring resumes without documented remediation and re-testing.

Conformance Scoring

Score 0: No fairness assessment has been conducted. Performance scores are generated without any bias detection, monitoring, or contestation mechanism. Scores are treated as objective outputs requiring no scrutiny.
Score 1: A pre-deployment fairness assessment has been conducted using the four-fifths rule, and basic score explanations are provided to employees. Continuous monitoring exists but is manual or infrequent. A contestation mechanism exists but timelines are not enforced.
Score 2: Pre-deployment assessment uses both the four-fifths rule and statistical significance testing. Proxy variable analysis is automated. Continuous monitoring runs every scoring cycle with automated alerts. Score explanations include contributing factors and peer group positioning. Contestation has enforced timelines and independent reviewers. A halt-and-remediate procedure is functional. The decision journal is complete.
Score 3: Verified through independent third-party fairness audit. Intersectional and counterfactual fairness testing is conducted. Real-time fairness constraints are applied during score generation. Longitudinal analysis demonstrates no cumulative disadvantage. Contestation outcomes are systematically analysed. The organisation benchmarks fairness metrics against industry standards.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System), Annex III Area 4 (Employment, Workers Management)	Direct requirement
EU AI Act	Article 10 (Data and Data Governance)	Direct requirement
EU AI Act	Article 14 (Human Oversight)	Direct requirement
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	19A.3.3R, 19D.3.28R (Performance Assessment for Remuneration)	Direct requirement
NIST AI RMF	MAP 2.3, MEASURE 2.6, MANAGE 1.3	Supports compliance
ISO 42001	Clause 6.1 (Actions to Address Risks), Annex B.5 (Data for AI)	Supports compliance
DORA	Article 9 (ICT Risk Management Framework)	Supports compliance

EU AI Act — Annex III Area 4 and Articles 9, 10, 14

The EU AI Act explicitly classifies AI systems used in employment for "evaluation and monitoring of performance and behaviour" as high-risk (Annex III, paragraph 4(b)). This triggers the full Chapter 2 requirements. Article 9 mandates a risk management system that identifies foreseeable risks of the AI system — discriminatory scoring is a foreseeable risk that must be identified and mitigated. Article 10 requires that training data is examined for biases and that appropriate bias detection and correction measures are applied. Article 14 requires human oversight with the ability to override or reverse AI decisions. AG-511's requirements for pre-deployment fairness assessment (4.1), proxy variable detection (4.2), continuous monitoring (4.3), and contestation (4.5) directly implement these Article requirements.

SOX — Section 404

When performance scores influence variable compensation for employees involved in financial reporting, the integrity of the scoring process becomes material to SOX compliance. A biased scoring model that systematically overpays or underpays performance-based compensation creates a control weakness in the financial reporting chain. AG-511's decision journal (4.6) and continuous monitoring (4.3) support the internal control documentation and testing requirements of Section 404.

FCA SYSC — Remuneration Governance

FCA rules require that variable remuneration for code staff and material risk-takers is based on risk-adjusted performance that is assessed using both financial and non-financial criteria. SYSC 19A.3.3R and 19D.3.28R require effective performance assessment processes. An AI system that produces biased performance scores undermines the regulatory objective. AG-511 ensures that AI-assisted performance scoring used for remuneration purposes is fair, monitored, and contestable.

NIST AI RMF

MAP 2.3 addresses documenting the AI system's intended benefits and potential harms, including discriminatory outcomes. MEASURE 2.6 addresses bias testing and evaluation across demographic groups. MANAGE 1.3 addresses response and recovery actions when risks materialise. AG-511 maps directly: pre-deployment assessment (MAP 2.3), continuous monitoring (MEASURE 2.6), and halt-and-remediate (MANAGE 1.3).

ISO 42001

Clause 6.1 requires organisations to determine risks and opportunities related to their AI management system. Annex B.5 addresses data quality and bias management. AG-511's proxy variable analysis and fairness testing directly support ISO 42001 compliance by providing specific, testable implementations of these high-level requirements.

DORA

Article 9 requires financial entities to implement ICT risk management frameworks that identify, manage, and monitor ICT risks. An AI performance scoring system used within a financial entity constitutes ICT, and bias in such a system constitutes an ICT risk. AG-511 ensures that this specific risk is identified, monitored, and managed.

10. Failure Severity

Field	Value
Severity Rating	Critical
Blast Radius	Organisation-wide — affects every scored employee and all downstream decisions (promotion, compensation, retention, disciplinary action) that depend on performance scores

Consequence chain: Biased performance scores propagate through every downstream employment decision. The immediate technical failure is disparate impact in score distributions — one or more protected characteristic subgroups receive systematically lower scores than their job performance warrants. The first-order operational consequence is biased decision-making: affected employees are denied promotions, receive lower bonuses, are placed on performance improvement plans, or are selected for redundancy at disproportionate rates. The second-order consequence is legal liability: employment discrimination claims (individual and class-action), regulatory enforcement actions under the EU AI Act for non-compliant high-risk AI deployment, and FCA enforcement for deficient remuneration governance in financial services. The third-order consequence is organisational: loss of employee trust in performance management, reduced engagement and productivity among affected populations, increased attrition (particularly among high-performing members of disadvantaged subgroups who have the most external options), and reputational damage that impairs talent acquisition. The financial impact is compounded: remediation costs (system rebuild, retrospective score correction, affected-employee compensation) plus legal costs (settlements, tribunal awards, regulatory fines) plus operational costs (productivity loss, attrition replacement). In the scenarios described in Section 3, individual instances range from £780,000 to £2.5 million. For organisations with large workforces, a systemic scoring bias affecting thousands of employees over multiple cycles can produce eight-figure liabilities.

Cross-references: AG-049 (Explainability Governance), AG-022 (Behavioural Drift Detection), AG-509 (Hiring Decision Contestability Governance), AG-512 (Pay and Scheduling Fairness Governance), AG-517 (Disciplinary Action Review Governance), AG-452 (Counterfactual Explanation Governance), AG-415 (Decision Journal Completeness Governance), AG-442 (Confidence Calibration Interface Governance).

Cite this protocol

AgentGoverning. (2026). AG-511: Performance Scoring Fairness Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-511

← Previous Protocol

AG-510

Workplace Surveillance Minimisation Governance

Next Protocol →

AG-512

Pay and Scheduling Fairness Governance