AG-246

Cultural and Linguistic Fairness Governance

Rights, Ethics & Public Interest ~15 min read AGS v2.1 · April 2026
EU AI Act GDPR FCA NIST

2. Summary

Cultural and Linguistic Fairness Governance requires that AI agents are tested and governed to ensure that performance, safety protections, and rights safeguards do not degrade across languages, dialects, cultural contexts, and communication styles. A conforming system recognises that AI agents trained predominantly on English-language data from Western cultural contexts systematically underperform — in accuracy, safety, and fairness — for users who communicate in other languages, use non-standard dialects, or operate within different cultural norms. This dimension mandates measurement of cross-linguistic and cross-cultural performance differentials and requires remediation where those differentials create unfair outcomes, safety gaps, or rights violations.

3. Example

Scenario A — Safety Filter Degradation in Non-English Languages: A customer-facing AI agent deployed across 14 markets includes a safety filter that detects and blocks harmful content — threats, self-harm expressions, hate speech, and exploitation. Testing reveals that the safety filter achieves 96% recall in English but only 61% recall in Bengali, 54% in Swahili, and 47% in Tagalog. A user communicating self-harm intent in Bengali receives no safety intervention, no escalation, and no crisis resource referral. The same expression in English would trigger an immediate escalation to a human responder.

What went wrong: The safety filter was trained predominantly on English-language data. Cross-linguistic safety performance was not tested before deployment. The safety gap was invisible because monitoring metrics were aggregated across languages rather than disaggregated. Users in non-English markets received materially worse safety protection — a direct violation of the principle that safety protections should be equitable. Consequence: Regulatory investigation in the affected markets. Finding that the deployment violated local consumer protection requirements by providing a materially inferior service. £3.2 million remediation programme. Mandatory cross-linguistic safety testing before any future market expansion.

Scenario B — Cultural Misinterpretation in Clinical Triage: An AI triage agent for a multi-ethnic healthcare system evaluates patient symptoms through a chat interface. A patient from a South Asian cultural background describes chest pain as "my heart is heavy with worry" — an idiomatic expression indicating emotional distress combined with physical symptoms. The agent interprets "heart is heavy" literally as a cardiac symptom and assigns high cardiac triage priority, ordering an unnecessary emergency department visit. Conversely, a patient from the same background describes actual cardiac symptoms using culturally specific metaphors that the agent does not recognise as clinically urgent, and assigns low priority.

What went wrong: The agent was trained on clinical language patterns from a predominantly Western English-speaking patient population. Cultural idioms for describing symptoms were not mapped to clinical categories. The agent's interpretation was culturally biased — over-interpreting some cultural expressions and under-interpreting others. No cross-cultural clinical validation was conducted. Consequence: Clinical governance investigation. Finding that 340 patients from South Asian backgrounds received incorrect triage priority over a 6-month period. 12 patients experienced delayed treatment for urgent conditions. NHS trust mandatory action plan.

Scenario C — Dialect-Based Service Quality Degradation: A banking AI agent serves customers across the UK. Performance analysis reveals that the agent's natural language understanding accuracy varies significantly by dialect: 94% for Received Pronunciation English, 91% for General American English, 78% for Scottish English, 72% for Jamaican Patois-influenced British English, and 64% for Welsh English. Customers whose speech patterns diverge from the training distribution receive lower-quality service — more misunderstandings, more failed requests, more escalations to hold queues, and longer resolution times. Over 12 months, customers in the lowest-performing dialect group have an average resolution time of 14 minutes versus 4 minutes for the highest-performing group.

What went wrong: The agent's language model was trained predominantly on standard English variants. No dialect-specific performance testing was conducted. Service quality was measured in aggregate, masking the 350% resolution-time differential between dialect groups. The dialect groups most affected were correlated with ethnic minority status, creating indirect racial discrimination in service delivery. Consequence: FCA Consumer Duty investigation. Finding that outcomes for ethnic minority customers were materially worse. £1.8 million remediation. Requirement to implement cross-dialect performance monitoring.

4. Requirement Statement

Scope: This dimension applies to all AI agents deployed to serve users across more than one language, dialect, or cultural context — or deployed in a single market where the user population includes speakers of multiple languages, dialects, or cultural backgrounds. In practice, this includes virtually all customer-facing agents, all public-sector agents, and all cross-border agents. The scope also covers agents that make decisions based on text, speech, or behavioural inputs that may be influenced by language or culture — such as sentiment analysis, risk scoring, content moderation, and clinical triage. The scope extends to safety-critical functions (content safety, crisis detection, clinical triage) where linguistic or cultural performance degradation creates safety risk. Agents that process only structured numerical data with no linguistic or cultural input may claim exclusion if documented.

4.1. A conforming system MUST measure and report agent performance disaggregated by language for all supported languages, at minimum annually and before each major model update.

4.2. A conforming system MUST measure and report safety-function performance (content safety, crisis detection, harmful content filtering) disaggregated by language, and ensure that safety recall does not fall below 80% of the highest-performing language for any supported language.

4.3. A conforming system MUST test agent performance across major dialect variants and communication styles within each supported language, and remediate where performance differentials exceed 15 percentage points.

4.4. A conforming system MUST evaluate decision-making functions (scoring, classification, triage, recommendation) for cultural bias — testing whether outcomes differ based on culturally specific expression patterns, idioms, or communication norms.

4.5. A conforming system MUST provide equivalent service quality across all supported languages — meaning that resolution time, accuracy, escalation rate, and user satisfaction do not systematically differ by language in ways that disadvantage minority-language users.

4.6. A conforming system MUST log language-specific performance metrics and make them available for regulatory review.

4.7. A conforming system SHOULD implement cultural context adaptation — recognising and correctly interpreting culturally specific idioms, expressions, and communication patterns — for languages and cultures with significant user populations.

4.8. A conforming system SHOULD engage native speakers and cultural domain experts in the development and testing of cross-cultural AI capabilities, rather than relying solely on translation of English-language content.

4.9. A conforming system SHOULD provide users with the option to select their preferred language and dialect, and ensure that the selected variant receives equivalent performance to the default.

4.10. A conforming system MAY implement automatic language and dialect detection with performance-quality routing — directing users to the highest-performing model variant for their detected language.

5. Rationale

AI agents exhibit a well-documented performance gradient across languages and cultures. Models trained predominantly on English-language data perform best on English-language inputs and degrade — often dramatically — on other languages. This degradation is not uniform: it is steepest for languages with the least training data, the least structural similarity to English, and the least commercial value in the training data marketplace. The result is that users who speak less-resourced languages receive systematically worse AI service — less accurate responses, less effective safety protections, less fair decision outcomes.

This performance gradient is not merely a quality issue — it is a fairness and rights issue. When an AI agent's safety filter catches 96% of self-harm expressions in English but only 47% in Tagalog, the Tagalog-speaking user receives materially worse safety protection. When a clinical triage agent misinterprets culturally specific symptom descriptions, the patient receives incorrect clinical priority. When a banking agent's accuracy drops from 94% to 64% between dialect groups, customers from linguistic minorities receive measurably worse service.

The linguistic and cultural performance gradient correlates strongly with racial, ethnic, and national origin demographics. Languages and dialects that receive the worst AI performance are disproportionately those spoken by ethnic minorities, immigrant communities, and populations in lower-income countries. The effect is structurally discriminatory: AI systems, through their training data distribution, reproduce and amplify existing linguistic hierarchies.

AG-246 requires organisations to measure this gradient, set minimum performance thresholds, and remediate where the gradient creates unfair outcomes, safety gaps, or rights violations. The principle is that an organisation that deploys an AI agent to serve a linguistically diverse population bears responsibility for ensuring that the service quality, safety, and fairness are equitable across that population — not just for the majority-language users.

6. Implementation Guidance

AG-246 requires cross-linguistic and cross-cultural performance measurement, minimum threshold enforcement, and remediation as structural governance activities. Implementation must address performance measurement, safety parity, cultural adaptation, and remediation.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Healthcare. Clinical language varies dramatically across cultures. Symptom descriptions, pain expressions, mental health terminology, and health beliefs differ in ways that affect triage accuracy and clinical safety. AI agents in healthcare must be validated against culturally specific clinical language patterns, not just translated medical terminology.

Financial Services. Financial literacy, product terminology, and risk communication norms vary across cultures. An AI agent that explains mortgage terms using US/UK financial concepts may be incomprehensible to users from financial cultures with different product structures. Cross-cultural financial communication validation is essential for Consumer Duty compliance.

Public Sector. Government services must be accessible in all official and recognised minority languages. In the UK, Welsh Language Standards require equivalent service in Welsh and English. In the EU, multilingual service requirements vary by member state. AI agents in public sector roles must meet these linguistic requirements at equivalent quality levels.

Maturity Model

Basic Implementation — The agent supports multiple languages through machine translation of English-language content. Performance is measured in aggregate across all languages. Safety filters are the same model applied across all languages. No dialect-specific testing. No cultural validation. Language-specific performance gaps are unknown.

Intermediate Implementation — Performance is measured and reported disaggregated by language. Safety function testing is conducted across all supported languages with native-speaker-created test suites. Cross-linguistic safety recall meets the 80% minimum threshold. Dialect-specific testing is conducted for the primary language. Cultural validation panels are established for at least the top 3 non-English cultural contexts. Language-specific performance metrics are reviewed quarterly. Remediation plans exist for languages below threshold.

Advanced Implementation — All intermediate capabilities plus: dialect-specific testing covers all languages with significant user populations. Cultural validation panels cover all major cultural contexts. Performance-tiered language support is implemented with honest disclosure. Cross-linguistic safety recall meets 90% of the best-performing language for critical safety functions. Language and cultural performance metrics are board-reported KPIs. Independent cross-linguistic audit is conducted annually. The organisation publishes language-specific performance data.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: Cross-Linguistic Performance Parity

Test 8.2: Cross-Linguistic Safety Recall Parity

Test 8.3: Dialect Performance Differential

Test 8.4: Cultural Bias in Decision-Making

Test 8.5: Service Quality Equivalence Across Languages

Test 8.6: Safety Filter Language Coverage Logging

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
Equality Act 2010Section 9 (Race Including Ethnic Origin), Section 19 (Indirect Discrimination)Direct requirement
EU AI ActArticle 10 (Data and Data Governance — Representativeness)Direct requirement
Welsh Language (Wales) Measure 2011Welsh Language StandardsDirect requirement
EU Charter of Fundamental RightsArticle 21 (Non-Discrimination Including Language)Supports compliance
FCA Consumer DutyPS22/9 (Good Outcomes for All Customers)Supports compliance
GDPRArticle 12 (Transparent Information in Clear and Plain Language)Supports compliance
NIST AI RMFMAP 2.3, MEASURE 2.6Supports compliance

Equality Act 2010 — Race and Indirect Discrimination

Language and dialect performance differentials in AI agents correlate with race and ethnic origin. When a banking agent achieves 94% accuracy for Received Pronunciation English but 72% for Jamaican Patois-influenced British English, the users experiencing degraded service are disproportionately from Black Caribbean ethnic backgrounds. This is indirect racial discrimination under Section 19 — a provision (the AI agent) that puts persons of a particular racial group at a particular disadvantage. AG-246's performance parity requirements directly address this by requiring measurement of and remediation for dialect-correlated performance differentials.

EU AI Act — Article 10 (Data and Data Governance)

Article 10 requires that training, validation, and testing data be representative of the persons on whom the high-risk AI system is intended to be used. For multilingual and multicultural deployments, this means training data must represent the linguistic and cultural diversity of the user population. AG-246's cross-linguistic testing and cultural validation requirements operationalise this representativeness obligation.

Welsh Language (Wales) Measure 2011

The Welsh Language Standards require specified organisations to provide services in Welsh on a basis equivalent to English. AI agents deployed in Welsh public services must meet these standards. AG-246's equivalent service quality requirement directly supports compliance.

FCA Consumer Duty — Good Outcomes for All Customers

The Consumer Duty requires firms to deliver good outcomes for all retail customers. Where AI agent performance degrades for users communicating in minority languages or non-standard dialects — resulting in longer resolution times, lower accuracy, and higher escalation rates — the firm is not delivering equivalent outcomes. AG-246 provides the measurement and remediation framework to ensure linguistic fairness in customer outcomes.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusCohort-level — systematically affecting users from linguistic and cultural minorities, who may represent 10-40% of the user population depending on the deployment context

Consequence chain: Failure of cultural and linguistic fairness governance produces a two-tier AI service in which majority-language, majority-culture users receive high-quality service while minority-language, minority-culture users receive degraded service across accuracy, safety, and fairness dimensions. The immediate technical failure is a performance gradient — the agent works well for some users and poorly for others based on their language and cultural background. The safety consequence is the most urgent: when safety functions (content filtering, crisis detection, clinical triage) degrade in non-English languages, the users in those languages receive materially worse safety protection. The fairness consequence is that decision-making functions produce culturally biased outcomes — misinterpreting cultural idioms, penalising non-standard communication styles, and systematically under-serving linguistic minorities. The legal exposure includes indirect discrimination under equality law, non-compliance with language legislation (Welsh Language Standards, Canadian Official Languages Act), and Consumer Duty failures. The reputational consequence is acute because linguistic discrimination is visible and relatable — users can directly compare their experience with that of majority-language users.

Cross-references: AG-242 (Non-Discrimination Outcome Testing Governance) provides the general non-discrimination testing framework that AG-246 extends to linguistic and cultural dimensions. AG-241 (Accessibility and Disability Accommodation Governance) addresses related accessibility concerns for communication disabilities. AG-051 (Fundamental Rights Impact Assessment) requires assessment of cultural and linguistic rights impacts. AG-118 (Fair Treatment and Vulnerability) recognises limited language proficiency as a vulnerability factor. AG-239 through AG-248 are sibling dimensions within the Rights, Ethics & Public Interest landscape.

Cite this protocol
AgentGoverning. (2026). AG-246: Cultural and Linguistic Fairness Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-246