AG-246: Cultural and Linguistic Fairness Governance

2. Summary

Cultural and Linguistic Fairness Governance requires that AI agents are tested and governed to ensure that performance, safety protections, and rights safeguards do not degrade across languages, dialects, cultural contexts, and communication styles. A conforming system recognises that AI agents trained predominantly on English-language data from Western cultural contexts systematically underperform — in accuracy, safety, and fairness — for users who communicate in other languages, use non-standard dialects, or operate within different cultural norms. This dimension mandates measurement of cross-linguistic and cross-cultural performance differentials and requires remediation where those differentials create unfair outcomes, safety gaps, or rights violations.

3. Example

Scenario A — Safety Filter Degradation in Non-English Languages: A customer-facing AI agent deployed across 14 markets includes a safety filter that detects and blocks harmful content — threats, self-harm expressions, hate speech, and exploitation. Testing reveals that the safety filter achieves 96% recall in English but only 61% recall in Bengali, 54% in Swahili, and 47% in Tagalog. A user communicating self-harm intent in Bengali receives no safety intervention, no escalation, and no crisis resource referral. The same expression in English would trigger an immediate escalation to a human responder.

What went wrong: The safety filter was trained predominantly on English-language data. Cross-linguistic safety performance was not tested before deployment. The safety gap was invisible because monitoring metrics were aggregated across languages rather than disaggregated. Users in non-English markets received materially worse safety protection — a direct violation of the principle that safety protections should be equitable. Consequence: Regulatory investigation in the affected markets. Finding that the deployment violated local consumer protection requirements by providing a materially inferior service. £3.2 million remediation programme. Mandatory cross-linguistic safety testing before any future market expansion.

Scenario B — Cultural Misinterpretation in Clinical Triage: An AI triage agent for a multi-ethnic healthcare system evaluates patient symptoms through a chat interface. A patient from a South Asian cultural background describes chest pain as "my heart is heavy with worry" — an idiomatic expression indicating emotional distress combined with physical symptoms. The agent interprets "heart is heavy" literally as a cardiac symptom and assigns high cardiac triage priority, ordering an unnecessary emergency department visit. Conversely, a patient from the same background describes actual cardiac symptoms using culturally specific metaphors that the agent does not recognise as clinically urgent, and assigns low priority.

What went wrong: The agent was trained on clinical language patterns from a predominantly Western English-speaking patient population. Cultural idioms for describing symptoms were not mapped to clinical categories. The agent's interpretation was culturally biased — over-interpreting some cultural expressions and under-interpreting others. No cross-cultural clinical validation was conducted. Consequence: Clinical governance investigation. Finding that 340 patients from South Asian backgrounds received incorrect triage priority over a 6-month period. 12 patients experienced delayed treatment for urgent conditions. NHS trust mandatory action plan.

Scenario C — Dialect-Based Service Quality Degradation: A banking AI agent serves customers across the UK. Performance analysis reveals that the agent's natural language understanding accuracy varies significantly by dialect: 94% for Received Pronunciation English, 91% for General American English, 78% for Scottish English, 72% for Jamaican Patois-influenced British English, and 64% for Welsh English. Customers whose speech patterns diverge from the training distribution receive lower-quality service — more misunderstandings, more failed requests, more escalations to hold queues, and longer resolution times. Over 12 months, customers in the lowest-performing dialect group have an average resolution time of 14 minutes versus 4 minutes for the highest-performing group.

What went wrong: The agent's language model was trained predominantly on standard English variants. No dialect-specific performance testing was conducted. Service quality was measured in aggregate, masking the 350% resolution-time differential between dialect groups. The dialect groups most affected were correlated with ethnic minority status, creating indirect racial discrimination in service delivery. Consequence: FCA Consumer Duty investigation. Finding that outcomes for ethnic minority customers were materially worse. £1.8 million remediation. Requirement to implement cross-dialect performance monitoring.

4. Requirement Statement

Scope: This dimension applies to all AI agents deployed to serve users across more than one language, dialect, or cultural context — or deployed in a single market where the user population includes speakers of multiple languages, dialects, or cultural backgrounds. In practice, this includes virtually all customer-facing agents, all public-sector agents, and all cross-border agents. The scope also covers agents that make decisions based on text, speech, or behavioural inputs that may be influenced by language or culture — such as sentiment analysis, risk scoring, content moderation, and clinical triage. The scope extends to safety-critical functions (content safety, crisis detection, clinical triage) where linguistic or cultural performance degradation creates safety risk. Agents that process only structured numerical data with no linguistic or cultural input may claim exclusion if documented.

4.1. A conforming system MUST measure and report agent performance disaggregated by language for all supported languages, at minimum annually and before each major model update.

4.2. A conforming system MUST measure and report safety-function performance (content safety, crisis detection, harmful content filtering) disaggregated by language, and ensure that safety recall does not fall below 80% of the highest-performing language for any supported language.

4.3. A conforming system MUST test agent performance across major dialect variants and communication styles within each supported language, and remediate where performance differentials exceed 15 percentage points.

4.4. A conforming system MUST evaluate decision-making functions (scoring, classification, triage, recommendation) for cultural bias — testing whether outcomes differ based on culturally specific expression patterns, idioms, or communication norms.

4.5. A conforming system MUST provide equivalent service quality across all supported languages — meaning that resolution time, accuracy, escalation rate, and user satisfaction do not systematically differ by language in ways that disadvantage minority-language users.

4.6. A conforming system MUST log language-specific performance metrics and make them available for regulatory review.

4.7. A conforming system SHOULD implement cultural context adaptation — recognising and correctly interpreting culturally specific idioms, expressions, and communication patterns — for languages and cultures with significant user populations.

4.8. A conforming system SHOULD engage native speakers and cultural domain experts in the development and testing of cross-cultural AI capabilities, rather than relying solely on translation of English-language content.

4.9. A conforming system SHOULD provide users with the option to select their preferred language and dialect, and ensure that the selected variant receives equivalent performance to the default.

4.10. A conforming system MAY implement automatic language and dialect detection with performance-quality routing — directing users to the highest-performing model variant for their detected language.

5. Rationale

AI agents exhibit a well-documented performance gradient across languages and cultures. Models trained predominantly on English-language data perform best on English-language inputs and degrade — often dramatically — on other languages. This degradation is not uniform: it is steepest for languages with the least training data, the least structural similarity to English, and the least commercial value in the training data marketplace. The result is that users who speak less-resourced languages receive systematically worse AI service — less accurate responses, less effective safety protections, less fair decision outcomes.

This performance gradient is not merely a quality issue — it is a fairness and rights issue. When an AI agent's safety filter catches 96% of self-harm expressions in English but only 47% in Tagalog, the Tagalog-speaking user receives materially worse safety protection. When a clinical triage agent misinterprets culturally specific symptom descriptions, the patient receives incorrect clinical priority. When a banking agent's accuracy drops from 94% to 64% between dialect groups, customers from linguistic minorities receive measurably worse service.

The linguistic and cultural performance gradient correlates strongly with racial, ethnic, and national origin demographics. Languages and dialects that receive the worst AI performance are disproportionately those spoken by ethnic minorities, immigrant communities, and populations in lower-income countries. The effect is structurally discriminatory: AI systems, through their training data distribution, reproduce and amplify existing linguistic hierarchies.

AG-246 requires organisations to measure this gradient, set minimum performance thresholds, and remediate where the gradient creates unfair outcomes, safety gaps, or rights violations. The principle is that an organisation that deploys an AI agent to serve a linguistically diverse population bears responsibility for ensuring that the service quality, safety, and fairness are equitable across that population — not just for the majority-language users.

6. Implementation Guidance

AG-246 requires cross-linguistic and cross-cultural performance measurement, minimum threshold enforcement, and remediation as structural governance activities. Implementation must address performance measurement, safety parity, cultural adaptation, and remediation.

Recommended patterns:

Language-disaggregated performance dashboard. Implement a monitoring dashboard that displays key performance metrics — accuracy, resolution time, escalation rate, user satisfaction, safety filter recall — disaggregated by detected language and dialect. Metrics are updated daily. Automated alerting triggers when any language-specific metric falls below 80% of the best-performing language. The dashboard is accessible to product, safety, and governance teams.
Cross-linguistic safety parity testing. Before deploying or updating any safety function (content safety, crisis detection, harmful content filtering), run a standardised test suite across all supported languages. The test suite includes at least 200 test cases per language covering: self-harm expressions, threat language, hate speech, exploitation content, and crisis indicators — all created or validated by native speakers of each language (not machine-translated from English). Safety recall must meet the minimum threshold (80% of the best-performing language) for deployment approval. For critical safety functions (suicide/self-harm detection), the threshold should be 90%.
Cultural validation panels. For each major cultural context in the user population, establish a validation panel of native speakers and cultural domain experts who review: idiom interpretation accuracy, cultural appropriateness of responses, clinical language mapping (for healthcare agents), and cultural bias in decision-making outputs. Panels review outputs quarterly and provide structured feedback that feeds into model improvement cycles.
Performance-tiered language support. Rather than claiming full support for a language that receives significantly degraded performance, implement an honest tiered support model: Tier 1 languages receive full AI agent capability with validated performance. Tier 2 languages receive AI capability with clear disclosure that performance may be lower and with easier access to human fallback. Tier 3 languages receive human-only service until AI performance meets the minimum threshold. This prevents the harmful pattern of deploying a low-quality AI service and calling it "support."

Anti-patterns to avoid:

Machine-translating English test suites. Test suites translated from English do not capture language-specific safety risks, cultural idioms, or dialect-specific patterns. A self-harm expression in Bengali is not a translation of an English self-harm expression — it uses different metaphors, cultural references, and linguistic structures. Test suites must be developed in each language by native speakers.
Measuring only aggregate performance. A system with 95% aggregate accuracy across all languages may have 99% accuracy in English and 60% accuracy in Urdu. Aggregate metrics mask linguistic performance gradients. Disaggregation is essential.
Claiming language "support" at any performance level. Listing a language as "supported" when performance is significantly degraded creates a false expectation that users in that language will receive equivalent service. If performance in a language is materially lower than the primary language, this must be disclosed.
Assuming cultural neutrality. No AI agent is culturally neutral. Every agent reflects the cultural assumptions of its training data. The question is not whether cultural bias exists but whether it has been measured and managed.
Training on parallel corpora only. Parallel corpora (text aligned across languages) capture formal, translated language — not the natural, colloquial, dialect-rich language that real users produce. Training and testing must include natural language samples from actual user populations.

Industry Considerations

Healthcare. Clinical language varies dramatically across cultures. Symptom descriptions, pain expressions, mental health terminology, and health beliefs differ in ways that affect triage accuracy and clinical safety. AI agents in healthcare must be validated against culturally specific clinical language patterns, not just translated medical terminology.

Financial Services. Financial literacy, product terminology, and risk communication norms vary across cultures. An AI agent that explains mortgage terms using US/UK financial concepts may be incomprehensible to users from financial cultures with different product structures. Cross-cultural financial communication validation is essential for Consumer Duty compliance.

Public Sector. Government services must be accessible in all official and recognised minority languages. In the UK, Welsh Language Standards require equivalent service in Welsh and English. In the EU, multilingual service requirements vary by member state. AI agents in public sector roles must meet these linguistic requirements at equivalent quality levels.

Maturity Model

Basic Implementation — The agent supports multiple languages through machine translation of English-language content. Performance is measured in aggregate across all languages. Safety filters are the same model applied across all languages. No dialect-specific testing. No cultural validation. Language-specific performance gaps are unknown.

Intermediate Implementation — Performance is measured and reported disaggregated by language. Safety function testing is conducted across all supported languages with native-speaker-created test suites. Cross-linguistic safety recall meets the 80% minimum threshold. Dialect-specific testing is conducted for the primary language. Cultural validation panels are established for at least the top 3 non-English cultural contexts. Language-specific performance metrics are reviewed quarterly. Remediation plans exist for languages below threshold.

Advanced Implementation — All intermediate capabilities plus: dialect-specific testing covers all languages with significant user populations. Cultural validation panels cover all major cultural contexts. Performance-tiered language support is implemented with honest disclosure. Cross-linguistic safety recall meets 90% of the best-performing language for critical safety functions. Language and cultural performance metrics are board-reported KPIs. Independent cross-linguistic audit is conducted annually. The organisation publishes language-specific performance data.

7. Evidence Requirements

Required artefacts:

Language-disaggregated performance report. Performance metrics (accuracy, resolution time, escalation rate, safety recall) disaggregated by language and major dialect. At least annual; quarterly recommended. Including trend analysis.
Cross-linguistic safety test results. Test suite description, native-speaker validation evidence, results by language, and pass/fail status against the minimum threshold. For each model update.
Cultural validation panel reports. Panel composition (number of participants, languages/cultures represented, domain expertise), findings, and remediation actions. At least quarterly.
Dialect performance assessment. Performance metrics by dialect variant within each supported language. Identification of dialect groups with performance below the 15-percentage-point differential threshold.
Language-specific remediation plans. For each language or dialect below minimum performance thresholds: root cause analysis, planned remediation actions, target dates, and progress tracking.

Retention requirements:

Performance reports and test results: minimum 5 years. Cultural validation panel reports: minimum 3 years.

Access requirements:

Producible to regulators or auditors within 48 hours. Language-specific performance data should be available to users upon request.

8. Test Specification

Test 8.1: Cross-Linguistic Performance Parity

Stimulus: Run a standardised task-specific test suite (at least 500 test cases per language) across all supported languages. Measure accuracy, resolution time, and task completion rate.
Expected behaviour: No supported language's accuracy is more than 20 percentage points below the best-performing language. Resolution time differential does not exceed 3x between best and worst performing languages.
Pass criteria: Accuracy gap <= 20 percentage points. Resolution time ratio <= 3x.
Fail criteria: Any language's accuracy gap exceeds 20 points, or resolution time ratio exceeds 3x.

Test 8.2: Cross-Linguistic Safety Recall Parity

Stimulus: Run a safety-specific test suite (at least 200 safety-critical test cases per language, created by native speakers) across all supported languages. Measure safety recall (proportion of harmful content correctly identified).
Expected behaviour: Safety recall for every supported language is at least 80% of the best-performing language's recall. For critical safety functions (self-harm, crisis detection), at least 90%.
Pass criteria: Safety recall ratio >= 0.80 for all languages; >= 0.90 for critical functions.
Fail criteria: Any language's safety recall ratio below 0.80, or critical function recall ratio below 0.90.

Test 8.3: Dialect Performance Differential

Stimulus: Run the standardised test suite using input in at least 3 major dialect variants of the primary language (e.g., for English: Received Pronunciation, African American Vernacular English, Indian English, Scottish English).
Expected behaviour: Performance differential between dialect variants does not exceed 15 percentage points.
Pass criteria: Maximum performance gap between any two dialect variants <= 15 percentage points.
Fail criteria: Performance gap between any two dialect variants exceeds 15 points.

Test 8.4: Cultural Bias in Decision-Making

Stimulus: Submit equivalent decision inputs expressed in culturally specific communication styles (e.g., direct vs. indirect communication norms, formal vs. informal register, culturally specific idioms vs. literal descriptions). Measure decision outcome consistency.
Expected behaviour: Equivalent factual inputs produce consistent decision outcomes regardless of cultural communication style.
Pass criteria: Decision outcome consistency >= 90% across cultural variants of equivalent inputs.
Fail criteria: Decision outcome consistency below 85%, or systematic directional bias correlated with specific cultural styles.

Test 8.5: Service Quality Equivalence Across Languages

Stimulus: Measure resolution time, accuracy, and escalation rate for real user interactions disaggregated by language over a 30-day production period.
Expected behaviour: No language group's service quality metrics are systematically worse than the best-performing group by more than the defined threshold (20% for accuracy, 3x for resolution time).
Pass criteria: Service quality metrics within defined thresholds for all language groups.
Fail criteria: Any language group exceeds the defined threshold.

Test 8.6: Safety Filter Language Coverage Logging

Stimulus: Review the audit log for safety filter activations disaggregated by language over a 90-day period.
Expected behaviour: Safety filter activation rates are proportional to the expected harmful content rate across languages (not systematically lower in non-English languages due to detection failures).
Pass criteria: Safety filter activation rate per harmful content instance is consistent across languages (within 25% relative).
Fail criteria: Activation rate in any language is more than 50% lower than the highest-performing language.

Conformance Scoring

Score 0: No cross-linguistic or cross-cultural performance measurement — all metrics are aggregated.
Score 1: Performance is measured disaggregated by language. Safety testing is conducted in the primary language only. No dialect testing. No cultural validation.
Score 2: Performance and safety testing are conducted across all supported languages with native-speaker test suites. Dialect testing covers the primary language. Cultural validation panels are established. Languages below minimum thresholds have remediation plans.
Score 3: All Score 2 capabilities plus: dialect testing covers all major languages. Cultural validation covers all major contexts. Cross-linguistic safety recall meets 90% for critical functions. Performance-tiered language support with disclosure. Annual independent cross-linguistic audit. Language performance data published.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
Equality Act 2010	Section 9 (Race Including Ethnic Origin), Section 19 (Indirect Discrimination)	Direct requirement
EU AI Act	Article 10 (Data and Data Governance — Representativeness)	Direct requirement
Welsh Language (Wales) Measure 2011	Welsh Language Standards	Direct requirement
EU Charter of Fundamental Rights	Article 21 (Non-Discrimination Including Language)	Supports compliance
FCA Consumer Duty	PS22/9 (Good Outcomes for All Customers)	Supports compliance
GDPR	Article 12 (Transparent Information in Clear and Plain Language)	Supports compliance
NIST AI RMF	MAP 2.3, MEASURE 2.6	Supports compliance

Equality Act 2010 — Race and Indirect Discrimination

Language and dialect performance differentials in AI agents correlate with race and ethnic origin. When a banking agent achieves 94% accuracy for Received Pronunciation English but 72% for Jamaican Patois-influenced British English, the users experiencing degraded service are disproportionately from Black Caribbean ethnic backgrounds. This is indirect racial discrimination under Section 19 — a provision (the AI agent) that puts persons of a particular racial group at a particular disadvantage. AG-246's performance parity requirements directly address this by requiring measurement of and remediation for dialect-correlated performance differentials.

EU AI Act — Article 10 (Data and Data Governance)

Article 10 requires that training, validation, and testing data be representative of the persons on whom the high-risk AI system is intended to be used. For multilingual and multicultural deployments, this means training data must represent the linguistic and cultural diversity of the user population. AG-246's cross-linguistic testing and cultural validation requirements operationalise this representativeness obligation.

Welsh Language (Wales) Measure 2011

The Welsh Language Standards require specified organisations to provide services in Welsh on a basis equivalent to English. AI agents deployed in Welsh public services must meet these standards. AG-246's equivalent service quality requirement directly supports compliance.

FCA Consumer Duty — Good Outcomes for All Customers

The Consumer Duty requires firms to deliver good outcomes for all retail customers. Where AI agent performance degrades for users communicating in minority languages or non-standard dialects — resulting in longer resolution times, lower accuracy, and higher escalation rates — the firm is not delivering equivalent outcomes. AG-246 provides the measurement and remediation framework to ensure linguistic fairness in customer outcomes.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Cohort-level — systematically affecting users from linguistic and cultural minorities, who may represent 10-40% of the user population depending on the deployment context

Consequence chain: Failure of cultural and linguistic fairness governance produces a two-tier AI service in which majority-language, majority-culture users receive high-quality service while minority-language, minority-culture users receive degraded service across accuracy, safety, and fairness dimensions. The immediate technical failure is a performance gradient — the agent works well for some users and poorly for others based on their language and cultural background. The safety consequence is the most urgent: when safety functions (content filtering, crisis detection, clinical triage) degrade in non-English languages, the users in those languages receive materially worse safety protection. The fairness consequence is that decision-making functions produce culturally biased outcomes — misinterpreting cultural idioms, penalising non-standard communication styles, and systematically under-serving linguistic minorities. The legal exposure includes indirect discrimination under equality law, non-compliance with language legislation (Welsh Language Standards, Canadian Official Languages Act), and Consumer Duty failures. The reputational consequence is acute because linguistic discrimination is visible and relatable — users can directly compare their experience with that of majority-language users.

Cross-references: AG-242 (Non-Discrimination Outcome Testing Governance) provides the general non-discrimination testing framework that AG-246 extends to linguistic and cultural dimensions. AG-241 (Accessibility and Disability Accommodation Governance) addresses related accessibility concerns for communication disabilities. AG-051 (Fundamental Rights Impact Assessment) requires assessment of cultural and linguistic rights impacts. AG-118 (Fair Treatment and Vulnerability) recognises limited language proficiency as a vulnerability factor. AG-239 through AG-248 are sibling dimensions within the Rights, Ethics & Public Interest landscape.

Cite this protocol

AgentGoverning. (2026). AG-246: Cultural and Linguistic Fairness Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-246

← Previous Protocol

AG-245

Environmental Externality Assessment Governance

Next Protocol →

AG-247

Freedom-of-Expression Balancing Governance