AG-676

Face and Voice Similarity Threshold Governance

Biometrics, Emotion & Identity Analytics ~28 min read AGS v2.1 · April 2026
EU AI Act GDPR FCA NIST

2. Summary

Face and Voice Similarity Threshold Governance requires organisations deploying AI agents that make decisions based on biometric similarity scores to formally set, validate, document, and periodically recalibrate the numerical thresholds at which those agents accept, reject, or escalate a biometric match. Similarity-based biometric systems — facial recognition, speaker verification, voiceprint matching — produce continuous confidence scores, not binary outcomes. The threshold that converts a continuous score into a binary accept/reject decision is one of the most consequential parameters in the entire system, because it directly determines the false match rate (FMR) and false non-match rate (FNMR) experienced by every person the system encounters. A threshold set too permissively increases false matches, enabling identity fraud or — in law enforcement contexts — wrongful identification of innocent individuals. A threshold set too restrictively increases false non-matches, locking legitimate users out of their own accounts or denying them access to services they are entitled to use. This dimension mandates that threshold selection is a governed decision, not a default inherited from a vendor model, and that thresholds are validated against disaggregated demographic data to ensure that error rates are equitable across populations.

3. Example

Scenario A -- Facial Recognition Threshold Causes Wrongful Arrests: A metropolitan police department deploys an AI agent that compares surveillance camera captures against a database of persons of interest. The vendor delivers the system with a default similarity threshold of 0.72 on a 0-to-1 scale, calibrated on the vendor's internal benchmark dataset. The department accepts the default without independent validation. Over 14 months, the system generates 3,847 candidate match alerts. Officers investigate 1,204 of these alerts, leading to 43 field stops and 9 arrests. Of the 9 arrests, 3 are later determined to be wrongful — the arrested individuals were not the persons of interest. All three wrongfully arrested individuals are Black men, detained for between 4 and 11 hours before identification errors are confirmed. An independent audit reveals that the vendor's benchmark dataset was 78% Caucasian and 6% Black, meaning the 0.72 threshold was validated primarily against light-skinned subjects. When tested on a demographically representative dataset, the FMR for Black male subjects at the 0.72 threshold is 0.085 — roughly 10 times the 0.008 FMR for Caucasian male subjects. Raising the threshold to 0.89 would have equalised the FMR across demographic groups at approximately 0.006, but would also have increased the FNMR for all groups. The department had no documented threshold selection rationale, no disaggregated error rate analysis, and no periodic recalibration process. The resulting litigation costs exceed $4.2 million, the department suspends the programme, and the city council passes an ordinance requiring demographic impact assessment for any future biometric deployment.

What went wrong: The department accepted a vendor default threshold without validating it against a representative population. No disaggregated analysis was performed to determine whether the threshold produced equitable error rates across demographic groups. The threshold was calibrated on a dataset that was not representative of the population the system would encounter in deployment. No governance process required threshold documentation or periodic recalibration. The consequence was a 10x disparity in false match rates that translated directly into wrongful arrests of members of an already over-policed community.

Scenario B -- Voice Authentication Threshold Locks Out Elderly and Non-Native Speakers: A national bank deploys an AI agent for telephone banking authentication using voiceprint verification. The system compares a caller's voice against a stored voiceprint template and grants access if the similarity score exceeds 0.81. The threshold is selected to achieve a target FMR of 0.001 (one fraudulent access per thousand attempts) based on vendor-supplied test data. Within six months, the bank's customer complaint system registers 2,340 complaints from customers unable to authenticate, a 340% increase over the pre-deployment baseline. Analysis reveals that the FNMR is severely non-uniform across demographic groups: 1.2% for native English speakers aged 25-45, 4.8% for native English speakers aged 65 and older, 7.3% for non-native English speakers regardless of age, and 12.1% for customers with speech impediments or medical conditions affecting voice (Parkinson's disease, post-stroke dysarthria). The bank's elderly customers are three to four times more likely to be locked out of their own accounts than younger customers. Non-native speakers are six times more likely. Customers with speech-affecting medical conditions are ten times more likely. Because the bank simultaneously decommissioned its legacy PIN-based authentication as a cost-saving measure, locked-out customers have no alternative channel and must visit a branch in person — a disproportionate burden on elderly and disabled customers who may have limited mobility. The bank faces complaints to the financial ombudsman, a regulatory inquiry into accessibility obligations under equality legislation, and reputational damage after a national newspaper reports that a 79-year-old Parkinson's patient was locked out of her account for 11 days.

What went wrong: The bank validated the threshold against an aggregate FMR target without disaggregating the FNMR by demographic group. The vendor test data did not represent the bank's actual customer population, particularly older customers and non-native speakers whose voice characteristics exhibit greater variability. The threshold was optimised for fraud prevention (minimising FMR) without adequate consideration of the access impact (FNMR) on vulnerable populations. No fallback authentication channel was maintained for customers who could not pass voiceprint verification. No ongoing monitoring of FNMR by demographic group was implemented.

Scenario C -- Border Control Facial Matching Threshold Causes Systematic Delays: A national border agency deploys automated e-gates at international airports. The AI agent compares a traveller's live face capture against their passport chip photograph. The initial threshold is set at 0.78, balancing throughput against security. Within three months, the agency observes that e-gate rejection rates (requiring travellers to proceed to manual inspection) vary dramatically: 2.1% for European passport holders aged 20-50, 8.9% for East Asian passport holders, 11.4% for travellers over 65 (whose passport photos may be up to 10 years old), and 14.7% for women wearing hijab where the visible facial area is reduced. The disproportionate rejection rates create visible queuing disparities at the manual inspection desks, with certain demographic groups consistently directed to secondary inspection while others pass through e-gates unimpeded. Media coverage frames the disparity as discriminatory profiling. An independent technical review determines that a single global threshold cannot achieve equitable FNMR across all demographic groups because the underlying similarity score distributions differ by skin tone, age differential between the live capture and the stored photograph, and facial coverage area. The review recommends either demographic-specific thresholds (which raise legal and ethical concerns about differential treatment) or a composite scoring approach that normalises for known confounding variables before applying a single threshold.

What went wrong: The agency applied a single global threshold without analysing the similarity score distributions across demographic groups. The threshold that produced acceptable overall rejection rates produced severely inequitable group-specific rejection rates. No governance process required pre-deployment disaggregated analysis. No ongoing monitoring detected the disparate rejection rates until media attention forced a review. The agency had no documented framework for making the trade-off between security (FMR), convenience (FNMR), and equity (cross-group FNMR parity).

4. Requirement Statement

Scope: This dimension applies to any AI agent deployment where decisions — access control, identity verification, authentication, candidate identification, surveillance matching, or any other consequential determination — are made by applying a numerical threshold to a biometric similarity score derived from face, voice, or combined face-and-voice comparison. The scope includes both one-to-one verification (comparing a probe against a claimed identity template) and one-to-many identification (comparing a probe against a gallery of templates). The scope covers all deployment contexts: law enforcement, border control, financial services authentication, physical access control, customer onboarding, and any other context where a similarity threshold converts a continuous score into a binary or categorical decision. The scope extends to agents that use similarity scores as one input among several in a decision pipeline, provided the threshold applied to the similarity score materially affects the decision outcome.

4.1. A conforming system MUST document a formal Threshold Selection Rationale for every biometric similarity threshold used in production, specifying the chosen threshold value, the target FMR and FNMR at that threshold, the dataset on which the threshold was validated, the demographic composition of that dataset, and the operational justification for the selected trade-off between FMR and FNMR.

4.2. A conforming system MUST validate every biometric similarity threshold against evaluation data that is representative of the population the system will encounter in deployment, including representation across skin tone, age, sex, ethnicity, accent or language background (for voice), and any other demographic variable known to affect biometric matching accuracy for the modality in use.

4.3. A conforming system MUST compute and document disaggregated error rates — FMR and FNMR — for each demographic subgroup in the evaluation dataset, and MUST establish maximum permissible disparity ratios between the highest and lowest subgroup error rates.

4.4. A conforming system MUST reject any threshold where the disaggregated error rate analysis reveals that the disparity ratio between the worst-performing and best-performing demographic subgroup exceeds the organisation's documented maximum permissible disparity ratio for either FMR or FNMR.

4.5. A conforming system MUST implement ongoing monitoring of FMR and FNMR in production, disaggregated by available demographic indicators, with automated alerting when any subgroup's error rate deviates from the validated baseline by a statistically significant margin.

4.6. A conforming system MUST conduct periodic threshold recalibration reviews — at minimum annually and additionally whenever the underlying biometric model is updated, the enrolled population changes materially, or production monitoring detects error rate drift — to verify that the threshold continues to meet its documented FMR, FNMR, and equity targets.

4.7. A conforming system MUST define and implement a human escalation pathway for cases where the similarity score falls within a defined uncertainty band around the threshold, rather than applying a hard binary accept/reject at a single threshold value.

4.8. A conforming system MUST maintain a fallback mechanism — an alternative authentication or verification method — for individuals who are systematically unable to achieve biometric match scores above the threshold due to physiological characteristics, medical conditions, ageing, or other factors outside their control.

4.9. A conforming system MUST NOT deploy a biometric similarity threshold derived solely from vendor-supplied default values or vendor benchmark datasets without independent validation against a dataset representative of the deployment population.

4.10. A conforming system MUST retain all threshold selection documentation, validation test results, disaggregated error rate analyses, production monitoring reports, and recalibration records as governance evidence.

4.11. A conforming system SHOULD implement separate thresholds or score normalisation techniques for distinct operational contexts (e.g., one-to-one verification versus one-to-many identification) rather than applying a single threshold across fundamentally different matching scenarios.

4.12. A conforming system SHOULD conduct pre-deployment threshold sensitivity analysis showing how FMR, FNMR, and demographic disparity ratios change across a range of candidate threshold values, to inform the threshold selection decision with a complete picture of the trade-off landscape.

4.13. A conforming system MAY implement adaptive thresholds that adjust based on contextual risk factors (e.g., transaction value, security level of the access zone) provided that each adaptive threshold value is independently validated and documented.

5. Rationale

A biometric similarity score is a continuous number. The threshold that converts this number into a decision is the single most consequential parameter in any similarity-based biometric system. Yet in practice, thresholds are frequently inherited from vendor defaults, selected through ad hoc testing on unrepresentative data, or optimised for a single aggregate error metric without disaggregated demographic analysis. The consequences of ungoverned threshold selection are severe, inequitable, and well-documented.

The fundamental problem is that biometric matching accuracy is not uniform across populations. Decades of independent evaluation — including NIST's Face Recognition Vendor Test (FRVT) programme — have consistently demonstrated that facial recognition algorithms exhibit higher false match rates and higher false non-match rates for certain demographic groups. The magnitude of these disparities varies by algorithm, but the pattern is persistent: darker-skinned individuals, women, older individuals, and children consistently experience higher error rates than lighter-skinned adult males. For voice biometrics, analogous disparities exist: speakers with accents, older speakers whose voices have changed due to ageing, speakers with medical conditions affecting phonation, and speakers in noisy environments all experience higher error rates. These disparities are not eliminated by improving the underlying model — they are intrinsic to the statistical properties of biometric variability within and across demographic groups.

When a single threshold is applied to a system whose underlying score distributions differ by demographic group, the resulting error rates are necessarily inequitable. A threshold that achieves a 0.1% FMR for one demographic group may produce a 1% FMR for another — a tenfold disparity. In a law enforcement context, this translates directly into a tenfold difference in the rate at which innocent members of different demographic groups are flagged as suspects. In an authentication context, it translates into a demographic group being locked out of their accounts at a rate that is multiples higher than another group. These are not abstract statistical concerns — they are direct, measurable harms that affect individuals' liberty, access to services, and dignity.

The regulatory environment increasingly demands that organisations govern these decisions explicitly. The EU AI Act classifies biometric identification systems as high-risk (Annex III, Category 1) and prohibits certain real-time biometric identification uses entirely. Article 9 requires risk management that addresses foreseeable risks including risks of bias. The Equality Act 2010 (UK) and equivalent anti-discrimination legislation in other jurisdictions prohibit indirect discrimination — the application of a facially neutral criterion (a single threshold) that disproportionately disadvantages a protected group. NIST SP 800-76 and related standards for biometric system evaluation require performance testing across demographic groups. The EU AI Act's Article 10 data governance requirements mandate that training and evaluation data be "relevant, sufficiently representative, and to the extent possible, free of errors and complete" — a requirement that extends to the data on which thresholds are validated.

Threshold governance is a preventive control because it intervenes before the threshold is deployed, requiring validation and equity analysis as preconditions for production use. This is more effective and less costly than detective controls that identify harm after it has occurred. A wrongful arrest cannot be undone by adjusting the threshold after the fact. An elderly customer locked out of her account for 11 days has already suffered the harm regardless of subsequent recalibration. Preventive threshold governance ensures that the decision-making parameter is validated for equity and accuracy before it affects any individual.

6. Implementation Guidance

Face and Voice Similarity Threshold Governance requires a structured process for threshold selection, validation, monitoring, and recalibration. The process must treat threshold selection as a governed decision with documented rationale, not as a technical parameter buried in configuration files.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Law Enforcement and Public Safety. Facial recognition thresholds in law enforcement have the highest consequence severity because false matches can lead to wrongful stops, arrests, and detention. Organisations should set the maximum permissible FMR disparity ratio at or below 2:1 and implement mandatory human review for all candidate matches — no automated arrest or detention based solely on a similarity score. Threshold validation must use datasets that represent the demographics of the jurisdiction, not national or international benchmarks that may not reflect local population composition.

Financial Services. Voice and facial authentication thresholds for banking directly affect customer access to financial services. Under equality legislation and financial conduct regulation, systematic lockout of demographic groups constitutes both a conduct risk and a potential indirect discrimination claim. Financial institutions should monitor FNMR by customer age band, language background, and disability status, and should maintain alternative authentication channels for the life of the biometric system.

Border Control and Immigration. E-gate facial matching operates at scale with limited opportunity for individual escalation. Threshold governance must account for passport photograph age (up to 10 years), cross-age matching performance, and the varying facial coverage in passport photographs (head coverings, glasses). Disaggregated analysis should cover nationality groupings, age bands, and visible head covering to ensure equitable processing times.

Healthcare and Identity Verification. Patient identity verification using biometrics in healthcare settings must account for patients whose biometric characteristics have changed due to medical treatment (facial surgery, intubation affecting voice), ageing, or disability. Thresholds must be validated with particular attention to the patient populations most likely to need healthcare services — elderly patients, patients with chronic conditions — who are also the populations most likely to experience higher FNMR.

Maturity Model

Basic Implementation -- The organisation has documented a Threshold Selection Rationale for every production biometric threshold. Each threshold has been validated against data that includes demographic subgroups. Disaggregated FMR and FNMR are computed and recorded. A maximum permissible disparity ratio is defined. A fallback mechanism exists for individuals who cannot pass biometric verification. Vendor defaults are not used without independent validation. This level meets the minimum mandatory requirements.

Intermediate Implementation -- All basic capabilities plus: production monitoring captures disaggregated error rates with statistical process control alerting. An uncertainty band triggers human review for borderline scores. Threshold recalibration reviews are conducted at least annually. Pre-deployment threshold sensitivity analysis documents the full trade-off landscape across candidate threshold values. Fallback mechanisms are monitored for disproportionate use by demographic groups.

Advanced Implementation -- All intermediate capabilities plus: adaptive thresholds adjust to contextual risk factors with independent validation for each adaptive value. Real-time dashboards display disaggregated error rates by demographic group. External audit validates threshold equity claims. The organisation can demonstrate through longitudinal data that its threshold governance process has reduced demographic disparity ratios over successive recalibration cycles. Score normalisation techniques account for known confounding variables (lighting, age differential, acoustic environment) before threshold application.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: Threshold Selection Rationale Documentation Verification

Test 8.2: Evaluation Dataset Representativeness Verification

Test 8.3: Disaggregated Error Rate Disparity Analysis

Test 8.4: Production Monitoring Disaggregation Verification

Test 8.5: Recalibration Cycle Compliance Verification

Test 8.6: Uncertainty Band and Human Escalation Verification

Test 8.7: Fallback Mechanism Availability and Equity Verification

Test 8.8: Vendor Default Rejection Verification

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Direct requirement
EU AI ActArticle 10 (Data and Data Governance)Direct requirement
EU AI ActAnnex III, Category 1 (Biometric Identification)Scope definition
EU AI ActArticle 14 (Human Oversight)Supports compliance
Equality Act 2010 (UK)Section 19 (Indirect Discrimination)Direct requirement
NIST SP 800-76Biometric Specifications for PIVSupports compliance
NIST AI RMFMAP 2.3 (Bias Pre-deployment Testing)Direct requirement
GDPRArticle 9 (Special Categories of Personal Data), Article 35 (DPIA)Supports compliance
ISO/IEC 19795-1Biometric Performance Testing and ReportingSupports compliance
FCA SYSC6.1.1R (Systems and Controls)Supports compliance

EU AI Act -- Article 9 (Risk Management System)

Article 9 requires that high-risk AI systems operate under a risk management system that identifies and analyses known and foreseeable risks, including risks of bias. Biometric similarity thresholds are a primary mechanism through which bias materialises in biometric systems — a threshold validated on an unrepresentative dataset will produce inequitable error rates across demographic groups. AG-676 operationalises Article 9 by requiring disaggregated error rate analysis, maximum permissible disparity ratios, and ongoing monitoring as core components of the risk management process for biometric systems. A deployer that cannot demonstrate governed threshold selection with equity analysis cannot satisfy Article 9's requirement for bias risk management.

EU AI Act -- Article 10 (Data and Data Governance)

Article 10 requires that training, validation, and testing datasets be "relevant, sufficiently representative, and to the extent possible, free of errors and complete." For biometric threshold validation, this means the evaluation dataset must represent the demographic composition of the deployment population. AG-676's Requirement 4.2 directly implements Article 10 by mandating representative evaluation data. Requirement 4.9 — prohibiting deployment of vendor defaults without independent validation — addresses the common failure mode where the vendor's evaluation data is representative of the vendor's test population but not the deployer's operational population.

Equality Act 2010 (UK) -- Section 19 (Indirect Discrimination)

Section 19 prohibits applying a provision, criterion, or practice that puts persons sharing a protected characteristic at a particular disadvantage compared with persons who do not share it, unless the provision is a proportionate means of achieving a legitimate aim. A biometric similarity threshold is a "provision, criterion, or practice." If the threshold produces a FNMR of 1.2% for one demographic group and 12.1% for another — as in Scenario B — persons in the disadvantaged group are put at a "particular disadvantage" (ten times the lockout rate). The deployer must demonstrate that the threshold is a proportionate means of achieving a legitimate aim, which requires evidence that the threshold was selected with awareness of the disparity, that alternatives were considered, and that mitigations (fallback channels, uncertainty bands) were implemented. AG-676 provides the governance framework for demonstrating proportionality.

NIST AI RMF -- MAP 2.3 (Bias Pre-deployment Testing)

MAP 2.3 calls for pre-deployment testing to identify and assess potential biases in AI systems. For biometric similarity systems, the most critical pre-deployment bias test is the disaggregated error rate analysis at the selected threshold. AG-676 mandates this analysis as a precondition for deployment (Requirement 4.3), directly implementing NIST's pre-deployment bias testing expectation. The threshold sensitivity analysis recommended in Requirement 4.12 extends MAP 2.3 by documenting the full bias landscape across candidate thresholds, not just the selected value.

GDPR -- Article 9 and Article 35

GDPR Article 9 classifies biometric data processed for identification purposes as a special category of personal data, requiring explicit consent or another Article 9(2) basis for processing. Article 35 requires a Data Protection Impact Assessment (DPIA) for processing that is likely to result in a high risk to rights and freedoms, which expressly includes "systematic monitoring of a publicly accessible area on a large scale" and processing of biometric data. A DPIA for a biometric system that does not assess threshold-related demographic disparities is incomplete. AG-676 provides the technical governance artefacts — disaggregated error rate analyses, disparity ratios, and equity monitoring reports — that a DPIA should reference when assessing the proportionality and fairness of a biometric similarity system.

ISO/IEC 19795-1 -- Biometric Performance Testing and Reporting

ISO/IEC 19795-1 establishes the framework for biometric performance testing, including requirements for representative test populations, disaggregated reporting, and statistical rigour. AG-676 aligns with and operationalises 19795-1 by mandating representative evaluation data (Requirement 4.2), disaggregated error rate computation (Requirement 4.3), and retention of testing methodology and results (Requirement 4.10). Organisations that comply with AG-676's threshold validation requirements will substantially satisfy the performance evaluation requirements of 19795-1.

10. Failure Severity

FieldValue
Severity RatingCritical
Blast RadiusPopulation-scale — affects every individual processed by the biometric system, with disproportionate harm concentrated on demographic groups experiencing the highest error rates

Consequence chain: An ungoverned biometric similarity threshold is deployed in production. The threshold was selected based on vendor defaults or validated against a dataset that does not represent the deployment population. The threshold produces inequitable error rates across demographic groups, but no disaggregated analysis was performed, so the disparity is invisible. In a law enforcement context, the higher FMR for a specific demographic group generates disproportionate false match alerts. Officers act on these alerts, conducting stops, detentions, and arrests of innocent individuals at a rate that is multiples higher for one demographic group than another. Each wrongful stop causes immediate harm — loss of liberty, psychological distress, reputational damage — and the pattern constitutes systemic discriminatory treatment. In an authentication context, the higher FNMR for elderly, non-native-speaking, or medically affected individuals systematically locks them out of financial services, government services, or physical access. When fallback channels have been eliminated for cost efficiency, locked-out individuals have no alternative. The harm accumulates invisibly because aggregate metrics remain within acceptable bounds — the overall FMR and FNMR are satisfactory, but the subgroup disparities are severe. Discovery occurs through litigation, media investigation, regulatory audit, or complaint accumulation. By the time the disparity is identified, hundreds or thousands of individuals have been affected. Remediation requires threshold recalibration, retrospective review of all decisions made at the ungoverned threshold, potential compensation or redress for affected individuals, and restoration of public trust. In law enforcement, remediation may include vacating arrests, expunging records, and settling civil rights claims — the Detroit case study alone involved settlements exceeding $1 million per wrongful arrest. In financial services, remediation includes customer notification, complaint resolution, regulatory reporting, and potential enforcement action for indirect discrimination. The total cost of ungoverned threshold failure characteristically exceeds the cost of proper threshold governance by two to three orders of magnitude.

Cross-references: AG-001 (Foundational Governance Charter) provides the governance structure within which threshold decisions are made and documented. AG-007 (Bias & Fairness Assessment) establishes the broader framework for identifying and mitigating demographic disparities, of which threshold-induced error rate disparities are a specific instance. AG-019 (Human Escalation & Override Triggers) defines when and how similarity scores in the uncertainty band are escalated to human review. AG-022 (Behavioural Drift Detection) supports detection of threshold degradation over time as population characteristics or environmental conditions shift. AG-055 (Performance & Reliability Baselines) provides the baseline performance framework against which threshold-specific FMR and FNMR targets are set. AG-084 (Continuous Monitoring & Alerting) provides the monitoring infrastructure for ongoing disaggregated error rate tracking. AG-210 (Threshold Calibration Governance) provides the general threshold governance framework that AG-676 specialises for biometric similarity contexts. AG-669 (Biometric Purpose Limitation) ensures that similarity thresholds are applied only for the documented biometric purpose. AG-670 (Liveness Verification) addresses presentation attacks that can affect similarity scores. AG-672 (Behavioural Biometrics Fairness) addresses the broader fairness framework for biometric systems. AG-673 (Biometric Template Protection) ensures that the stored templates against which similarity is computed are protected. AG-675 (Spoof-Response Escalation) defines escalation procedures when spoof attacks distort similarity scores. AG-677 (Consent and Notice for Biometrics) ensures individuals are informed about the biometric comparison and its threshold-based decision logic. AG-678 (Biometric Redress) provides the redress pathway for individuals adversely affected by threshold-based decisions.

Cite this protocol
AgentGoverning. (2026). AG-676: Face and Voice Similarity Threshold Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-676