AG-676: Face and Voice Similarity Threshold Governance

2. Summary

Face and Voice Similarity Threshold Governance requires organisations deploying AI agents that make decisions based on biometric similarity scores to formally set, validate, document, and periodically recalibrate the numerical thresholds at which those agents accept, reject, or escalate a biometric match. Similarity-based biometric systems — facial recognition, speaker verification, voiceprint matching — produce continuous confidence scores, not binary outcomes. The threshold that converts a continuous score into a binary accept/reject decision is one of the most consequential parameters in the entire system, because it directly determines the false match rate (FMR) and false non-match rate (FNMR) experienced by every person the system encounters. A threshold set too permissively increases false matches, enabling identity fraud or — in law enforcement contexts — wrongful identification of innocent individuals. A threshold set too restrictively increases false non-matches, locking legitimate users out of their own accounts or denying them access to services they are entitled to use. This dimension mandates that threshold selection is a governed decision, not a default inherited from a vendor model, and that thresholds are validated against disaggregated demographic data to ensure that error rates are equitable across populations.

3. Example

Scenario A -- Facial Recognition Threshold Causes Wrongful Arrests: A metropolitan police department deploys an AI agent that compares surveillance camera captures against a database of persons of interest. The vendor delivers the system with a default similarity threshold of 0.72 on a 0-to-1 scale, calibrated on the vendor's internal benchmark dataset. The department accepts the default without independent validation. Over 14 months, the system generates 3,847 candidate match alerts. Officers investigate 1,204 of these alerts, leading to 43 field stops and 9 arrests. Of the 9 arrests, 3 are later determined to be wrongful — the arrested individuals were not the persons of interest. All three wrongfully arrested individuals are Black men, detained for between 4 and 11 hours before identification errors are confirmed. An independent audit reveals that the vendor's benchmark dataset was 78% Caucasian and 6% Black, meaning the 0.72 threshold was validated primarily against light-skinned subjects. When tested on a demographically representative dataset, the FMR for Black male subjects at the 0.72 threshold is 0.085 — roughly 10 times the 0.008 FMR for Caucasian male subjects. Raising the threshold to 0.89 would have equalised the FMR across demographic groups at approximately 0.006, but would also have increased the FNMR for all groups. The department had no documented threshold selection rationale, no disaggregated error rate analysis, and no periodic recalibration process. The resulting litigation costs exceed $4.2 million, the department suspends the programme, and the city council passes an ordinance requiring demographic impact assessment for any future biometric deployment.

What went wrong: The department accepted a vendor default threshold without validating it against a representative population. No disaggregated analysis was performed to determine whether the threshold produced equitable error rates across demographic groups. The threshold was calibrated on a dataset that was not representative of the population the system would encounter in deployment. No governance process required threshold documentation or periodic recalibration. The consequence was a 10x disparity in false match rates that translated directly into wrongful arrests of members of an already over-policed community.

Scenario B -- Voice Authentication Threshold Locks Out Elderly and Non-Native Speakers: A national bank deploys an AI agent for telephone banking authentication using voiceprint verification. The system compares a caller's voice against a stored voiceprint template and grants access if the similarity score exceeds 0.81. The threshold is selected to achieve a target FMR of 0.001 (one fraudulent access per thousand attempts) based on vendor-supplied test data. Within six months, the bank's customer complaint system registers 2,340 complaints from customers unable to authenticate, a 340% increase over the pre-deployment baseline. Analysis reveals that the FNMR is severely non-uniform across demographic groups: 1.2% for native English speakers aged 25-45, 4.8% for native English speakers aged 65 and older, 7.3% for non-native English speakers regardless of age, and 12.1% for customers with speech impediments or medical conditions affecting voice (Parkinson's disease, post-stroke dysarthria). The bank's elderly customers are three to four times more likely to be locked out of their own accounts than younger customers. Non-native speakers are six times more likely. Customers with speech-affecting medical conditions are ten times more likely. Because the bank simultaneously decommissioned its legacy PIN-based authentication as a cost-saving measure, locked-out customers have no alternative channel and must visit a branch in person — a disproportionate burden on elderly and disabled customers who may have limited mobility. The bank faces complaints to the financial ombudsman, a regulatory inquiry into accessibility obligations under equality legislation, and reputational damage after a national newspaper reports that a 79-year-old Parkinson's patient was locked out of her account for 11 days.

What went wrong: The bank validated the threshold against an aggregate FMR target without disaggregating the FNMR by demographic group. The vendor test data did not represent the bank's actual customer population, particularly older customers and non-native speakers whose voice characteristics exhibit greater variability. The threshold was optimised for fraud prevention (minimising FMR) without adequate consideration of the access impact (FNMR) on vulnerable populations. No fallback authentication channel was maintained for customers who could not pass voiceprint verification. No ongoing monitoring of FNMR by demographic group was implemented.

Scenario C -- Border Control Facial Matching Threshold Causes Systematic Delays: A national border agency deploys automated e-gates at international airports. The AI agent compares a traveller's live face capture against their passport chip photograph. The initial threshold is set at 0.78, balancing throughput against security. Within three months, the agency observes that e-gate rejection rates (requiring travellers to proceed to manual inspection) vary dramatically: 2.1% for European passport holders aged 20-50, 8.9% for East Asian passport holders, 11.4% for travellers over 65 (whose passport photos may be up to 10 years old), and 14.7% for women wearing hijab where the visible facial area is reduced. The disproportionate rejection rates create visible queuing disparities at the manual inspection desks, with certain demographic groups consistently directed to secondary inspection while others pass through e-gates unimpeded. Media coverage frames the disparity as discriminatory profiling. An independent technical review determines that a single global threshold cannot achieve equitable FNMR across all demographic groups because the underlying similarity score distributions differ by skin tone, age differential between the live capture and the stored photograph, and facial coverage area. The review recommends either demographic-specific thresholds (which raise legal and ethical concerns about differential treatment) or a composite scoring approach that normalises for known confounding variables before applying a single threshold.

What went wrong: The agency applied a single global threshold without analysing the similarity score distributions across demographic groups. The threshold that produced acceptable overall rejection rates produced severely inequitable group-specific rejection rates. No governance process required pre-deployment disaggregated analysis. No ongoing monitoring detected the disparate rejection rates until media attention forced a review. The agency had no documented framework for making the trade-off between security (FMR), convenience (FNMR), and equity (cross-group FNMR parity).

4. Requirement Statement

Scope: This dimension applies to any AI agent deployment where decisions — access control, identity verification, authentication, candidate identification, surveillance matching, or any other consequential determination — are made by applying a numerical threshold to a biometric similarity score derived from face, voice, or combined face-and-voice comparison. The scope includes both one-to-one verification (comparing a probe against a claimed identity template) and one-to-many identification (comparing a probe against a gallery of templates). The scope covers all deployment contexts: law enforcement, border control, financial services authentication, physical access control, customer onboarding, and any other context where a similarity threshold converts a continuous score into a binary or categorical decision. The scope extends to agents that use similarity scores as one input among several in a decision pipeline, provided the threshold applied to the similarity score materially affects the decision outcome.

4.1. A conforming system MUST document a formal Threshold Selection Rationale for every biometric similarity threshold used in production, specifying the chosen threshold value, the target FMR and FNMR at that threshold, the dataset on which the threshold was validated, the demographic composition of that dataset, and the operational justification for the selected trade-off between FMR and FNMR.

4.2. A conforming system MUST validate every biometric similarity threshold against evaluation data that is representative of the population the system will encounter in deployment, including representation across skin tone, age, sex, ethnicity, accent or language background (for voice), and any other demographic variable known to affect biometric matching accuracy for the modality in use.

4.3. A conforming system MUST compute and document disaggregated error rates — FMR and FNMR — for each demographic subgroup in the evaluation dataset, and MUST establish maximum permissible disparity ratios between the highest and lowest subgroup error rates.

4.4. A conforming system MUST reject any threshold where the disaggregated error rate analysis reveals that the disparity ratio between the worst-performing and best-performing demographic subgroup exceeds the organisation's documented maximum permissible disparity ratio for either FMR or FNMR.

4.5. A conforming system MUST implement ongoing monitoring of FMR and FNMR in production, disaggregated by available demographic indicators, with automated alerting when any subgroup's error rate deviates from the validated baseline by a statistically significant margin.

4.6. A conforming system MUST conduct periodic threshold recalibration reviews — at minimum annually and additionally whenever the underlying biometric model is updated, the enrolled population changes materially, or production monitoring detects error rate drift — to verify that the threshold continues to meet its documented FMR, FNMR, and equity targets.

4.7. A conforming system MUST define and implement a human escalation pathway for cases where the similarity score falls within a defined uncertainty band around the threshold, rather than applying a hard binary accept/reject at a single threshold value.

4.8. A conforming system MUST maintain a fallback mechanism — an alternative authentication or verification method — for individuals who are systematically unable to achieve biometric match scores above the threshold due to physiological characteristics, medical conditions, ageing, or other factors outside their control.

4.9. A conforming system MUST NOT deploy a biometric similarity threshold derived solely from vendor-supplied default values or vendor benchmark datasets without independent validation against a dataset representative of the deployment population.

4.10. A conforming system MUST retain all threshold selection documentation, validation test results, disaggregated error rate analyses, production monitoring reports, and recalibration records as governance evidence.

4.11. A conforming system SHOULD implement separate thresholds or score normalisation techniques for distinct operational contexts (e.g., one-to-one verification versus one-to-many identification) rather than applying a single threshold across fundamentally different matching scenarios.

4.12. A conforming system SHOULD conduct pre-deployment threshold sensitivity analysis showing how FMR, FNMR, and demographic disparity ratios change across a range of candidate threshold values, to inform the threshold selection decision with a complete picture of the trade-off landscape.

4.13. A conforming system MAY implement adaptive thresholds that adjust based on contextual risk factors (e.g., transaction value, security level of the access zone) provided that each adaptive threshold value is independently validated and documented.

5. Rationale

A biometric similarity score is a continuous number. The threshold that converts this number into a decision is the single most consequential parameter in any similarity-based biometric system. Yet in practice, thresholds are frequently inherited from vendor defaults, selected through ad hoc testing on unrepresentative data, or optimised for a single aggregate error metric without disaggregated demographic analysis. The consequences of ungoverned threshold selection are severe, inequitable, and well-documented.

The fundamental problem is that biometric matching accuracy is not uniform across populations. Decades of independent evaluation — including NIST's Face Recognition Vendor Test (FRVT) programme — have consistently demonstrated that facial recognition algorithms exhibit higher false match rates and higher false non-match rates for certain demographic groups. The magnitude of these disparities varies by algorithm, but the pattern is persistent: darker-skinned individuals, women, older individuals, and children consistently experience higher error rates than lighter-skinned adult males. For voice biometrics, analogous disparities exist: speakers with accents, older speakers whose voices have changed due to ageing, speakers with medical conditions affecting phonation, and speakers in noisy environments all experience higher error rates. These disparities are not eliminated by improving the underlying model — they are intrinsic to the statistical properties of biometric variability within and across demographic groups.

When a single threshold is applied to a system whose underlying score distributions differ by demographic group, the resulting error rates are necessarily inequitable. A threshold that achieves a 0.1% FMR for one demographic group may produce a 1% FMR for another — a tenfold disparity. In a law enforcement context, this translates directly into a tenfold difference in the rate at which innocent members of different demographic groups are flagged as suspects. In an authentication context, it translates into a demographic group being locked out of their accounts at a rate that is multiples higher than another group. These are not abstract statistical concerns — they are direct, measurable harms that affect individuals' liberty, access to services, and dignity.

The regulatory environment increasingly demands that organisations govern these decisions explicitly. The EU AI Act classifies biometric identification systems as high-risk (Annex III, Category 1) and prohibits certain real-time biometric identification uses entirely. Article 9 requires risk management that addresses foreseeable risks including risks of bias. The Equality Act 2010 (UK) and equivalent anti-discrimination legislation in other jurisdictions prohibit indirect discrimination — the application of a facially neutral criterion (a single threshold) that disproportionately disadvantages a protected group. NIST SP 800-76 and related standards for biometric system evaluation require performance testing across demographic groups. The EU AI Act's Article 10 data governance requirements mandate that training and evaluation data be "relevant, sufficiently representative, and to the extent possible, free of errors and complete" — a requirement that extends to the data on which thresholds are validated.

Threshold governance is a preventive control because it intervenes before the threshold is deployed, requiring validation and equity analysis as preconditions for production use. This is more effective and less costly than detective controls that identify harm after it has occurred. A wrongful arrest cannot be undone by adjusting the threshold after the fact. An elderly customer locked out of her account for 11 days has already suffered the harm regardless of subsequent recalibration. Preventive threshold governance ensures that the decision-making parameter is validated for equity and accuracy before it affects any individual.

6. Implementation Guidance

Face and Voice Similarity Threshold Governance requires a structured process for threshold selection, validation, monitoring, and recalibration. The process must treat threshold selection as a governed decision with documented rationale, not as a technical parameter buried in configuration files.

Recommended patterns:

Structured threshold selection process. Establish a formal process for selecting similarity thresholds that involves both technical and governance stakeholders. The process begins with a requirements analysis that defines the operational context, the acceptable FMR and FNMR ranges, and the maximum permissible demographic disparity ratio. Technical staff then conduct a threshold sweep — computing FMR, FNMR, and disaggregated subgroup error rates at multiple candidate threshold values — and present the results to a governance authority for decision. The governance authority selects the threshold based on the documented trade-off analysis, and the selection is recorded with full rationale. This prevents threshold selection from being a unilateral technical decision made by an engineer configuring a system.
Representative evaluation datasets. Assemble or procure evaluation datasets that reflect the demographic composition of the population the system will encounter in production. For a national border control system, this means representation proportional to traveller demographics. For a bank's voice authentication system, this means representation proportional to the bank's customer base, with particular attention to populations known to experience higher error rates: elderly customers, non-native speakers, customers with speech-affecting medical conditions. Where evaluation data gaps exist for specific demographic groups, the gaps must be documented and the threshold must be validated conservatively — erring toward higher thresholds where subgroup performance is uncertain.
Disaggregated error rate reporting. For every threshold validation, compute and report FMR and FNMR separately for each demographic subgroup. Present the results in a tabular format that makes disparities immediately visible. Calculate the disparity ratio — the ratio of the worst-performing subgroup's error rate to the best-performing subgroup's error rate — for both FMR and FNMR. Compare the disparity ratio against the organisation's documented maximum permissible ratio. If the disparity exceeds the maximum, the threshold does not pass validation and must be adjusted. Document the maximum permissible disparity ratio itself — whether it is 2:1, 3:1, or some other value — with justification for why that level of disparity is deemed acceptable given the operational context and the severity of consequences for affected individuals.
Uncertainty band and escalation design. Rather than a single hard threshold, implement a three-zone decision model: scores above a high threshold result in automatic accept, scores below a low threshold result in automatic reject, and scores within the uncertainty band between the two thresholds trigger human review or an alternative verification pathway. The width of the uncertainty band should be determined by the score distributions of the worst-performing demographic subgroup — wide enough that individuals from that subgroup who are genuine matches are not automatically rejected. This design reduces the impact of demographic score distribution differences on binary outcomes.
Production monitoring with demographic disaggregation. Instrument the production system to continuously capture match outcomes — accepts, rejects, and escalations — disaggregated by any available demographic indicator. Compute rolling FMR and FNMR estimates using ground truth data from post-match verification, escalation resolutions, fraud investigations, and customer complaints. Implement statistical process control alerts that trigger when any subgroup's error rate deviates from the validated baseline. This monitoring is essential because production conditions differ from evaluation conditions: lighting changes for facial recognition, background noise for voice, population composition shifts, and model behaviour drift all affect threshold validity.
Recalibration governance cycle. Define a recalibration calendar — at minimum annual — and trigger-based recalibration events: model updates, significant demographic shifts in the enrolled population, production monitoring alerts, or regulatory changes affecting acceptable error rates. Each recalibration follows the same structured process as the initial threshold selection: threshold sweep, disaggregated analysis, governance review, documented decision. Retain historical threshold values and their associated validation results to demonstrate governance continuity and support trend analysis.
Fallback mechanism design. For every biometric similarity system, implement at least one alternative pathway for individuals who cannot achieve scores above the threshold. In authentication contexts, this may be knowledge-based authentication, hardware tokens, or in-person verification. In identification contexts, this may be human review of the biometric evidence. The fallback must not impose disproportionate burden on the individuals who need it — if the fallback requires a physical branch visit, this is disproportionately burdensome for elderly and disabled customers who are also the population most likely to need the fallback.

Anti-patterns to avoid:

Accepting vendor defaults without validation. Deploying a biometric system with the vendor's recommended threshold without independent validation against representative data. Vendor benchmarks are typically optimised for aggregate performance on the vendor's internal test set, which may not represent the deployment population. The vendor's optimal threshold is not the deployer's optimal threshold.
Optimising for a single aggregate metric. Selecting a threshold that minimises overall error rate without disaggregated analysis. An overall FMR of 0.1% may conceal a 0.02% FMR for one group and a 0.8% FMR for another. Aggregate metrics mask demographic disparities by design.
Static thresholds with no recalibration. Setting a threshold at deployment and never revisiting it. Biometric system performance changes over time: enrolled populations shift, environmental conditions change (new camera hardware, different lighting, different microphone specifications), and template ageing reduces match quality for long-enrolled individuals. A threshold that was valid at deployment may become invalid within months.
Eliminating fallback channels for cost savings. Removing alternative authentication or verification methods to drive adoption of the biometric channel. This traps individuals who cannot pass biometric verification, with disproportionate impact on elderly, disabled, and demographically disadvantaged populations.
Treating threshold selection as a purely technical decision. Allowing an engineer or data scientist to select the threshold without governance review. Threshold selection is a policy decision with equity, legal, and human rights implications — it determines who is granted access and who is denied, who is identified as a suspect and who is not. It requires governance authority, not just technical competence.

Industry Considerations

Law Enforcement and Public Safety. Facial recognition thresholds in law enforcement have the highest consequence severity because false matches can lead to wrongful stops, arrests, and detention. Organisations should set the maximum permissible FMR disparity ratio at or below 2:1 and implement mandatory human review for all candidate matches — no automated arrest or detention based solely on a similarity score. Threshold validation must use datasets that represent the demographics of the jurisdiction, not national or international benchmarks that may not reflect local population composition.

Financial Services. Voice and facial authentication thresholds for banking directly affect customer access to financial services. Under equality legislation and financial conduct regulation, systematic lockout of demographic groups constitutes both a conduct risk and a potential indirect discrimination claim. Financial institutions should monitor FNMR by customer age band, language background, and disability status, and should maintain alternative authentication channels for the life of the biometric system.

Border Control and Immigration. E-gate facial matching operates at scale with limited opportunity for individual escalation. Threshold governance must account for passport photograph age (up to 10 years), cross-age matching performance, and the varying facial coverage in passport photographs (head coverings, glasses). Disaggregated analysis should cover nationality groupings, age bands, and visible head covering to ensure equitable processing times.

Healthcare and Identity Verification. Patient identity verification using biometrics in healthcare settings must account for patients whose biometric characteristics have changed due to medical treatment (facial surgery, intubation affecting voice), ageing, or disability. Thresholds must be validated with particular attention to the patient populations most likely to need healthcare services — elderly patients, patients with chronic conditions — who are also the populations most likely to experience higher FNMR.

Maturity Model

Basic Implementation -- The organisation has documented a Threshold Selection Rationale for every production biometric threshold. Each threshold has been validated against data that includes demographic subgroups. Disaggregated FMR and FNMR are computed and recorded. A maximum permissible disparity ratio is defined. A fallback mechanism exists for individuals who cannot pass biometric verification. Vendor defaults are not used without independent validation. This level meets the minimum mandatory requirements.

Intermediate Implementation -- All basic capabilities plus: production monitoring captures disaggregated error rates with statistical process control alerting. An uncertainty band triggers human review for borderline scores. Threshold recalibration reviews are conducted at least annually. Pre-deployment threshold sensitivity analysis documents the full trade-off landscape across candidate threshold values. Fallback mechanisms are monitored for disproportionate use by demographic groups.

Advanced Implementation -- All intermediate capabilities plus: adaptive thresholds adjust to contextual risk factors with independent validation for each adaptive value. Real-time dashboards display disaggregated error rates by demographic group. External audit validates threshold equity claims. The organisation can demonstrate through longitudinal data that its threshold governance process has reduced demographic disparity ratios over successive recalibration cycles. Score normalisation techniques account for known confounding variables (lighting, age differential, acoustic environment) before threshold application.

7. Evidence Requirements

Required artefacts:

Threshold Selection Rationale document. For every production biometric similarity threshold: the selected threshold value, the target FMR and FNMR, the actual FMR and FNMR at the selected threshold, the evaluation dataset used, the demographic composition of the evaluation dataset, the operational justification for the FMR/FNMR trade-off, and the governance authority who approved the threshold. Must be current for the production threshold value.
Disaggregated error rate analysis. Tabular report showing FMR and FNMR for each demographic subgroup tested, the computed disparity ratio for both FMR and FNMR, and the comparison against the maximum permissible disparity ratio. Must accompany every threshold validation.
Evaluation dataset demographic composition report. Documentation of the demographic composition of the dataset used for threshold validation, including the number of subjects per subgroup, the source of the data, and an assessment of representativeness relative to the deployment population.
Maximum permissible disparity ratio definition. The documented ratio, the governance authority who approved it, and the justification for the selected value. Must be reviewed at each recalibration cycle.
Production monitoring reports. Periodic reports showing disaggregated FMR and FNMR estimates from production data, with trend analysis and any triggered alerts. Must cover at least the most recent 12 months.
Recalibration records. Documentation of each threshold recalibration review, including the trigger (scheduled, model update, monitoring alert), the analysis conducted, the decision (threshold unchanged, adjusted, or system suspended), and the governance authority who approved the decision.
Fallback mechanism usage data. Statistics on fallback mechanism invocations, disaggregated by demographic group where available, with analysis of whether any group is disproportionately reliant on the fallback.
Human escalation records. Records of similarity scores that fell within the uncertainty band and were escalated to human review, including the resolution and the elapsed time to resolution.

Retention requirements:

Threshold Selection Rationale, disaggregated error rate analyses, and recalibration records: minimum 7 years for law enforcement and regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.
Production monitoring reports and fallback usage data: same retention as operational audit records under AG-001.

Access requirements:

Producible to regulators, auditors, or litigation parties within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Test 8.1: Threshold Selection Rationale Documentation Verification

Stimulus: Request the Threshold Selection Rationale document for every production biometric similarity threshold. Verify that each document contains: the threshold value, target FMR, target FNMR, actual FMR and FNMR at the threshold, the evaluation dataset identity, the evaluation dataset demographic composition, the operational justification for the FMR/FNMR trade-off, and the approving governance authority.
Expected behaviour: Complete documentation exists for every production threshold.
Pass criteria: 100% of production thresholds have a Threshold Selection Rationale document containing all required fields. No production threshold lacks documented justification.
Fail criteria: Any production threshold lacks a Threshold Selection Rationale document, or any required field is missing from the document.
Traces to: Requirement 4.1.

Test 8.2: Evaluation Dataset Representativeness Verification

Stimulus: Compare the demographic composition of the evaluation dataset against the demographic composition of the deployment population (from enrolment records, customer demographics, or population statistics). Calculate the representation ratio for each demographic subgroup (proportion in evaluation data divided by proportion in deployment population).
Expected behaviour: Every demographic subgroup known to affect biometric matching accuracy is represented in the evaluation dataset at a level sufficient for statistically valid error rate estimation.
Pass criteria: All demographic subgroups have a representation ratio between 0.5 and 2.0 (no subgroup is underrepresented by more than half or overrepresented by more than double relative to the deployment population), and every subgroup contains at minimum 100 subjects for FMR estimation and 100 genuine match pairs for FNMR estimation.
Fail criteria: Any demographic subgroup has a representation ratio below 0.5, or any subgroup contains fewer than 100 subjects for error rate estimation, or known demographic categories are entirely absent from the evaluation data.
Traces to: Requirement 4.2.

Test 8.3: Disaggregated Error Rate Disparity Analysis

Stimulus: Obtain the disaggregated error rate report for the current production threshold. Identify the highest and lowest FMR among all demographic subgroups. Identify the highest and lowest FNMR among all demographic subgroups. Compute the FMR disparity ratio (highest FMR divided by lowest FMR) and the FNMR disparity ratio (highest FNMR divided by lowest FNMR). Compare both ratios against the documented maximum permissible disparity ratio.
Expected behaviour: Both disparity ratios are at or below the maximum permissible level.
Pass criteria: The FMR disparity ratio and the FNMR disparity ratio both fall at or below the documented maximum permissible disparity ratio. The disaggregated analysis covers all demographic subgroups specified in the evaluation dataset composition report.
Fail criteria: Either disparity ratio exceeds the maximum permissible level, or the disaggregated analysis omits any specified demographic subgroup.
Traces to: Requirements 4.3 and 4.4.

Test 8.4: Production Monitoring Disaggregation Verification

Stimulus: Review production monitoring reports for the most recent 12-month period. Verify that FMR and FNMR are reported disaggregated by demographic group. Inject a synthetic anomaly: simulate a 3x increase in FNMR for one demographic subgroup over a 30-day window. Verify that the monitoring system detects and alerts on the anomaly.
Expected behaviour: Production monitoring reports contain disaggregated error rates. The synthetic anomaly triggers an automated alert.
Pass criteria: Disaggregated error rates are present in monitoring reports for every reporting period in the 12-month window. The synthetic anomaly is detected and an alert is generated within the next monitoring cycle.
Fail criteria: Any reporting period lacks disaggregated error rates, or the synthetic anomaly is not detected and alerted.
Traces to: Requirement 4.5.

Test 8.5: Recalibration Cycle Compliance Verification

Stimulus: Request recalibration records for the most recent recalibration cycle. Verify that a recalibration review was conducted within the required period (at minimum annually). Simulate a model update event and verify that the system triggers a recalibration review.
Expected behaviour: Recalibration records exist within the required period. The model update event triggers a recalibration review.
Pass criteria: A completed recalibration review exists within the most recent 12 months. The simulated model update event initiates the recalibration workflow (documented trigger, assigned reviewer, scheduled analysis).
Fail criteria: No recalibration review was conducted within the required period, or the model update event does not trigger a recalibration review.
Traces to: Requirement 4.6.

Test 8.6: Uncertainty Band and Human Escalation Verification

Stimulus: Submit 50 synthetic biometric comparison requests with similarity scores distributed across three zones: 15 scores above the high threshold, 15 scores below the low threshold, and 20 scores within the defined uncertainty band. Verify the system's handling of each zone.
Expected behaviour: Scores above the high threshold are accepted. Scores below the low threshold are rejected. Scores within the uncertainty band are routed to human review or an alternative verification pathway.
Pass criteria: 100% of above-threshold scores are accepted. 100% of below-threshold scores are rejected. 100% of uncertainty-band scores are escalated to human review or an alternative pathway. No uncertainty-band score receives an automated binary accept/reject without human involvement.
Fail criteria: Any uncertainty-band score receives an automated accept or reject without escalation, or the uncertainty band is not implemented.
Traces to: Requirement 4.7.

Test 8.7: Fallback Mechanism Availability and Equity Verification

Stimulus: Simulate a scenario where a user fails biometric verification three consecutive times. Verify that a fallback mechanism is offered. Review fallback usage statistics disaggregated by demographic group to verify that no group is disproportionately dependent on the fallback without documented acknowledgement and mitigation.
Expected behaviour: The fallback mechanism is offered after repeated verification failure. Fallback usage data is disaggregated and monitored.
Pass criteria: The fallback mechanism is triggered and accessible within the defined failure threshold. Disaggregated fallback usage data exists and is reviewed at least quarterly. If any demographic group uses the fallback at more than 3x the overall average rate, the disparity is documented with a mitigation plan.
Fail criteria: No fallback mechanism is offered after repeated failure, fallback usage data is not disaggregated, or a demographic disparity above 3x exists without documentation and mitigation.
Traces to: Requirement 4.8.

Test 8.8: Vendor Default Rejection Verification

Stimulus: Attempt to deploy a biometric similarity system using the vendor's default threshold without an independent validation record. Verify that the governance process prevents deployment.
Expected behaviour: The system or governance workflow rejects deployment of a threshold that lacks independent validation documentation.
Pass criteria: Deployment is blocked or flagged by the governance process. The system requires an independent validation record (Threshold Selection Rationale document with non-vendor evaluation data) before a threshold is approved for production.
Fail criteria: A vendor default threshold can be deployed to production without independent validation documentation.
Traces to: Requirement 4.9.

Conformance Scoring

Score 0: No threshold governance exists — vendor defaults are used without validation, no disaggregated error rate analysis is performed, no recalibration process is defined, and no fallback mechanism exists for individuals who cannot pass biometric verification.
Score 1: A Threshold Selection Rationale document exists for production thresholds, and disaggregated error rate analysis has been performed at least once. However, production monitoring is not disaggregated, recalibration is ad hoc, and the uncertainty band is not implemented. Fallback mechanisms exist but usage is not monitored by demographic group.
Score 2: All mandatory requirements are met. Thresholds are validated on representative data with disaggregated analysis. Maximum permissible disparity ratios are defined and enforced. Production monitoring is disaggregated with automated alerting. Recalibration reviews occur at least annually. An uncertainty band triggers human review. Fallback mechanisms are monitored for demographic equity. Vendor defaults are never deployed without independent validation.
Score 3: Verified by independent audit. An external party has validated the evaluation dataset representativeness, the disaggregated error rate analysis methodology, the production monitoring effectiveness, and the recalibration governance cycle. Longitudinal data demonstrates that disparity ratios have been maintained at or below the maximum permissible level across successive recalibration cycles. Adaptive thresholds are independently validated. Score normalisation techniques address known confounding variables.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Direct requirement
EU AI Act	Article 10 (Data and Data Governance)	Direct requirement
EU AI Act	Annex III, Category 1 (Biometric Identification)	Scope definition
EU AI Act	Article 14 (Human Oversight)	Supports compliance
Equality Act 2010 (UK)	Section 19 (Indirect Discrimination)	Direct requirement
NIST SP 800-76	Biometric Specifications for PIV	Supports compliance
NIST AI RMF	MAP 2.3 (Bias Pre-deployment Testing)	Direct requirement
GDPR	Article 9 (Special Categories of Personal Data), Article 35 (DPIA)	Supports compliance
ISO/IEC 19795-1	Biometric Performance Testing and Reporting	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Supports compliance

EU AI Act -- Article 9 (Risk Management System)

Article 9 requires that high-risk AI systems operate under a risk management system that identifies and analyses known and foreseeable risks, including risks of bias. Biometric similarity thresholds are a primary mechanism through which bias materialises in biometric systems — a threshold validated on an unrepresentative dataset will produce inequitable error rates across demographic groups. AG-676 operationalises Article 9 by requiring disaggregated error rate analysis, maximum permissible disparity ratios, and ongoing monitoring as core components of the risk management process for biometric systems. A deployer that cannot demonstrate governed threshold selection with equity analysis cannot satisfy Article 9's requirement for bias risk management.

EU AI Act -- Article 10 (Data and Data Governance)

Article 10 requires that training, validation, and testing datasets be "relevant, sufficiently representative, and to the extent possible, free of errors and complete." For biometric threshold validation, this means the evaluation dataset must represent the demographic composition of the deployment population. AG-676's Requirement 4.2 directly implements Article 10 by mandating representative evaluation data. Requirement 4.9 — prohibiting deployment of vendor defaults without independent validation — addresses the common failure mode where the vendor's evaluation data is representative of the vendor's test population but not the deployer's operational population.

Equality Act 2010 (UK) -- Section 19 (Indirect Discrimination)

Section 19 prohibits applying a provision, criterion, or practice that puts persons sharing a protected characteristic at a particular disadvantage compared with persons who do not share it, unless the provision is a proportionate means of achieving a legitimate aim. A biometric similarity threshold is a "provision, criterion, or practice." If the threshold produces a FNMR of 1.2% for one demographic group and 12.1% for another — as in Scenario B — persons in the disadvantaged group are put at a "particular disadvantage" (ten times the lockout rate). The deployer must demonstrate that the threshold is a proportionate means of achieving a legitimate aim, which requires evidence that the threshold was selected with awareness of the disparity, that alternatives were considered, and that mitigations (fallback channels, uncertainty bands) were implemented. AG-676 provides the governance framework for demonstrating proportionality.

NIST AI RMF -- MAP 2.3 (Bias Pre-deployment Testing)

MAP 2.3 calls for pre-deployment testing to identify and assess potential biases in AI systems. For biometric similarity systems, the most critical pre-deployment bias test is the disaggregated error rate analysis at the selected threshold. AG-676 mandates this analysis as a precondition for deployment (Requirement 4.3), directly implementing NIST's pre-deployment bias testing expectation. The threshold sensitivity analysis recommended in Requirement 4.12 extends MAP 2.3 by documenting the full bias landscape across candidate thresholds, not just the selected value.

GDPR Article 9 classifies biometric data processed for identification purposes as a special category of personal data, requiring explicit consent or another Article 9(2) basis for processing. Article 35 requires a Data Protection Impact Assessment (DPIA) for processing that is likely to result in a high risk to rights and freedoms, which expressly includes "systematic monitoring of a publicly accessible area on a large scale" and processing of biometric data. A DPIA for a biometric system that does not assess threshold-related demographic disparities is incomplete. AG-676 provides the technical governance artefacts — disaggregated error rate analyses, disparity ratios, and equity monitoring reports — that a DPIA should reference when assessing the proportionality and fairness of a biometric similarity system.

ISO/IEC 19795-1 -- Biometric Performance Testing and Reporting

ISO/IEC 19795-1 establishes the framework for biometric performance testing, including requirements for representative test populations, disaggregated reporting, and statistical rigour. AG-676 aligns with and operationalises 19795-1 by mandating representative evaluation data (Requirement 4.2), disaggregated error rate computation (Requirement 4.3), and retention of testing methodology and results (Requirement 4.10). Organisations that comply with AG-676's threshold validation requirements will substantially satisfy the performance evaluation requirements of 19795-1.

10. Failure Severity

Field	Value
Severity Rating	Critical
Blast Radius	Population-scale — affects every individual processed by the biometric system, with disproportionate harm concentrated on demographic groups experiencing the highest error rates

Consequence chain: An ungoverned biometric similarity threshold is deployed in production. The threshold was selected based on vendor defaults or validated against a dataset that does not represent the deployment population. The threshold produces inequitable error rates across demographic groups, but no disaggregated analysis was performed, so the disparity is invisible. In a law enforcement context, the higher FMR for a specific demographic group generates disproportionate false match alerts. Officers act on these alerts, conducting stops, detentions, and arrests of innocent individuals at a rate that is multiples higher for one demographic group than another. Each wrongful stop causes immediate harm — loss of liberty, psychological distress, reputational damage — and the pattern constitutes systemic discriminatory treatment. In an authentication context, the higher FNMR for elderly, non-native-speaking, or medically affected individuals systematically locks them out of financial services, government services, or physical access. When fallback channels have been eliminated for cost efficiency, locked-out individuals have no alternative. The harm accumulates invisibly because aggregate metrics remain within acceptable bounds — the overall FMR and FNMR are satisfactory, but the subgroup disparities are severe. Discovery occurs through litigation, media investigation, regulatory audit, or complaint accumulation. By the time the disparity is identified, hundreds or thousands of individuals have been affected. Remediation requires threshold recalibration, retrospective review of all decisions made at the ungoverned threshold, potential compensation or redress for affected individuals, and restoration of public trust. In law enforcement, remediation may include vacating arrests, expunging records, and settling civil rights claims — the Detroit case study alone involved settlements exceeding $1 million per wrongful arrest. In financial services, remediation includes customer notification, complaint resolution, regulatory reporting, and potential enforcement action for indirect discrimination. The total cost of ungoverned threshold failure characteristically exceeds the cost of proper threshold governance by two to three orders of magnitude.

Cross-references: AG-001 (Foundational Governance Charter) provides the governance structure within which threshold decisions are made and documented. AG-007 (Bias & Fairness Assessment) establishes the broader framework for identifying and mitigating demographic disparities, of which threshold-induced error rate disparities are a specific instance. AG-019 (Human Escalation & Override Triggers) defines when and how similarity scores in the uncertainty band are escalated to human review. AG-022 (Behavioural Drift Detection) supports detection of threshold degradation over time as population characteristics or environmental conditions shift. AG-055 (Performance & Reliability Baselines) provides the baseline performance framework against which threshold-specific FMR and FNMR targets are set. AG-084 (Continuous Monitoring & Alerting) provides the monitoring infrastructure for ongoing disaggregated error rate tracking. AG-210 (Threshold Calibration Governance) provides the general threshold governance framework that AG-676 specialises for biometric similarity contexts. AG-669 (Biometric Purpose Limitation) ensures that similarity thresholds are applied only for the documented biometric purpose. AG-670 (Liveness Verification) addresses presentation attacks that can affect similarity scores. AG-672 (Behavioural Biometrics Fairness) addresses the broader fairness framework for biometric systems. AG-673 (Biometric Template Protection) ensures that the stored templates against which similarity is computed are protected. AG-675 (Spoof-Response Escalation) defines escalation procedures when spoof attacks distort similarity scores. AG-677 (Consent and Notice for Biometrics) ensures individuals are informed about the biometric comparison and its threshold-based decision logic. AG-678 (Biometric Redress) provides the redress pathway for individuals adversely affected by threshold-based decisions.

Cite this protocol

AgentGoverning. (2026). AG-676: Face and Voice Similarity Threshold Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-676

← Previous Protocol

AG-675

Spoof-Response Escalation Governance

Next Protocol →

AG-677

Consent and Notice for Biometrics Governance