AG-357

Challenge Set Localisation Governance

Evaluation, Benchmarking & Red Teaming ~14 min read AGS v2.1 · April 2026
EU AI Act GDPR FCA NIST ISO 42001

2. Summary

Challenge Set Localisation Governance requires that evaluation challenge sets — the scenarios, test cases, and adversarial inputs used to evaluate AI agents — are adapted to the language, jurisdiction, cultural context, and domain-specific edge cases of each deployment context. An evaluation challenge set developed for one context (e.g., US English, US regulatory environment, US cultural norms) does not provide assurance for a different context (e.g., Arabic, UAE regulatory environment, Gulf cultural norms). This dimension mandates that challenge sets are localised wherever the agent is deployed across linguistic, jurisdictional, or cultural boundaries, and that the localisation is validated by domain experts with relevant expertise.

3. Example

Scenario A — Regulatory Localisation Failure: A global financial services firm deploys the same AI advisory agent across the UK and Germany. The evaluation challenge set was developed by the UK team and includes 180 regulatory compliance scenarios based on FCA rules and UK financial services regulations. The German deployment uses the same challenge set. The agent passes all 180 scenarios. However, the German deployment is subject to BaFin regulations and MiFID II implementation that differs from the UK FCA interpretation in several material ways — including different suitability assessment requirements, different product governance rules, and different complaint handling obligations. The UK-focused challenge set does not test any of these German-specific requirements. After 6 months, BaFin identifies 4 compliance deficiencies in the German deployment during a supervisory review, none of which would have been detected by the UK challenge set.

What went wrong: The challenge set was not localised for the German regulatory environment. The firm assumed that UK regulatory compliance testing transferred to the German context. It did not — regulations differ in substance, not just translation. Consequence: BaFin supervisory findings, mandatory remediation within 90 days, €120,000 in localised challenge set development costs, and restriction of the German deployment to supervised operation pending compliance verification.

Scenario B — Linguistic Localisation Failure: A customer service agent deployed for Arabic-speaking markets is evaluated using challenge sets translated from English to Arabic by machine translation. The challenge set includes adversarial prompt injection scenarios. However, Arabic prompt injection techniques differ from English techniques: they exploit Arabic-specific properties including right-to-left text injection, diacritical mark manipulation, Arabic Unicode confusable characters, and code-switching between Modern Standard Arabic and regional dialects. The machine-translated challenge set tests only direct translations of English-language attacks. A security researcher demonstrates that the agent is vulnerable to Arabic-specific prompt injection within hours of deployment.

What went wrong: The challenge set was translated rather than localised. Translation preserves the meaning of English scenarios in Arabic; localisation would have included scenarios that are specific to Arabic language properties and that have no English equivalent. Consequence: Vulnerability disclosed publicly, emergency deployment restriction, £65,000 in emergency localisation and remediation costs, and reputational damage in the Arabic-speaking market.

Scenario C — Cultural Localisation Failure: A mental health support agent deployed in Japan is evaluated using a challenge set developed for the US market. The challenge set includes scenarios for detecting suicidal ideation based on US clinical indicators: direct statements of intent, access to means, and recent loss events. Japanese cultural context includes different indicators: references to "not wanting to be a burden" (meiwaku), indirect expressions through poetry or seasonal metaphors, and culturally specific concepts such as "ikigai loss." The US challenge set misses 73% of Japanese-context suicidal ideation test cases developed by Japanese clinical psychologists during post-deployment review.

What went wrong: The challenge set reflected US cultural norms for expressing distress, not Japanese cultural norms. Direct translation of US scenarios tested whether the agent could detect US-style expressions in Japanese — not whether it could detect Japanese-style expressions. Consequence: Failure to detect distress signals in 73% of culturally specific cases, patient safety concern, mandatory clinical review, deployment restricted to supervised mode, and £210,000 in culturally localised challenge set development with Japanese clinical experts.

4. Requirement Statement

Scope: This dimension applies to all AI agent deployments where the agent operates across linguistic, jurisdictional, or cultural boundaries — either because the agent is deployed in multiple markets, because the agent serves a multilingual or multicultural user population within a single market, or because the agent's evaluation materials were developed in a different context from the deployment context. The scope includes linguistic localisation (adapting to language-specific properties, not just translation), jurisdictional localisation (adapting to local regulatory requirements, legal frameworks, and compliance standards), cultural localisation (adapting to cultural norms, communication styles, and context-specific edge cases), and domain-specific localisation (adapting to local domain practices, terminology, and standards). Agents deployed only in the context for which their challenge sets were originally developed are excluded, though the scope includes verification that this alignment exists.

4.1. A conforming system MUST identify all deployment contexts (language, jurisdiction, cultural context, domain specialisation) and verify that challenge sets have been localised for each context, not merely translated.

4.2. A conforming system MUST include jurisdiction-specific regulatory compliance scenarios in challenge sets for each jurisdiction where the agent operates, validated by a regulatory expert with expertise in that jurisdiction.

4.3. A conforming system MUST include language-specific adversarial scenarios that exploit linguistic properties unique to each deployment language (e.g., script-specific injection vectors, homoglyph attacks, dialect variations, code-switching patterns).

4.4. A conforming system MUST validate localised challenge sets with domain experts who have relevant expertise in the target context — not solely with the team that developed the original challenge set.

4.5. A conforming system MUST maintain a localisation coverage matrix mapping each challenge set against each deployment context, identifying gaps where localisation has not been completed.

4.6. A conforming system MUST update localised challenge sets within 60 days of any material change in the regulatory environment of a deployment jurisdiction.

4.7. A conforming system SHOULD include culturally specific edge cases in localised challenge sets — scenarios that test the agent's ability to handle context-specific communication patterns, social norms, and implicit meanings.

4.8. A conforming system SHOULD engage native speakers and local domain experts in the development (not just review) of localised challenge sets, ensuring that scenarios originate from local expertise rather than translated external perspectives.

4.9. A conforming system SHOULD test for cross-context interference — scenarios where inputs in one language or context affect the agent's behaviour in another language or context (e.g., multilingual prompt injection where an adversarial instruction in Language A affects the agent's response in Language B).

4.10. A conforming system MAY implement automated localisation gap detection that analyses production inputs by language and context, comparing against challenge set coverage to identify underrepresented deployment contexts.

5. Rationale

AI agents are increasingly deployed across linguistic, jurisdictional, and cultural boundaries. The evaluation challenge sets that assess these agents must keep pace with this deployment breadth. A challenge set developed in one context is not a challenge set for another context — it is, at best, a starting point that must be substantially adapted.

The distinction between translation and localisation is critical. Translation preserves meaning across languages; localisation adapts content for a specific context. A translated challenge set tests whether the agent can handle English-language scenarios expressed in another language. A localised challenge set tests whether the agent can handle scenarios that are native to the target context — scenarios that have no English equivalent because they arise from the specific properties of the target language, regulatory environment, or cultural context.

Jurisdictional localisation is particularly important for regulated agents. Financial regulations are not uniform — MiFID II is implemented differently across EU member states, privacy regulations vary between GDPR and PIPL and POPIA, and consumer protection requirements differ by jurisdiction. An agent that passes UK regulatory compliance scenarios may fail German regulatory compliance scenarios not because of a generalisation failure but because the two regulatory environments have substantively different requirements. Testing against the wrong regulatory framework is worse than not testing at all, because it creates false assurance of compliance.

Cultural localisation addresses a subtler but equally important dimension. Cultural norms affect how users communicate intent, express distress, make requests, and respond to agent outputs. An evaluation that does not account for cultural context will miss failure modes that are specific to the target culture. This is not about sensitivity — it is about accuracy. An agent that cannot correctly interpret culturally specific expressions of intent will make errors in the target context regardless of how well it performs in the source context.

6. Implementation Guidance

Localisation requires investment in local expertise and a process that goes beyond translating existing materials. The key principle is that localised challenge sets should be developed with input from local experts, not merely reviewed by them after translation.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Each jurisdiction has distinct regulatory requirements. MiFID II implementation varies across EU member states; UK FCA rules post-Brexit diverge from EU rules; US SEC/FINRA requirements differ from both. Challenge sets must reflect the specific regulatory framework of each deployment jurisdiction, not a generic international framework.

Healthcare. Clinical standards vary by jurisdiction. Drug naming conventions differ (brand names vary by country), clinical guidelines are jurisdiction-specific (NICE in the UK, AHA/ACC in the US), and patient communication norms differ culturally. A clinical agent evaluated against US clinical standards and deployed in Germany may provide guidance based on US-specific drug names or protocols not used in Germany.

Public Sector. Government service agents deployed across regions within a single country may need localisation for regional languages, regional administrative processes, and regional policy variations. In the UK, devolved administrations (Scotland, Wales, Northern Ireland) have different legal frameworks for many public services.

Maturity Model

Basic Implementation — All deployment contexts are identified. Challenge sets include jurisdiction-specific regulatory scenarios validated by local regulatory experts. Language-specific adversarial scenarios are included for each deployment language. Localised sets are validated by domain experts in the target context. A localisation coverage matrix tracks gaps. Regulatory changes trigger updates within 60 days. This level meets the minimum mandatory requirements but challenge sets may be primarily adapted from the source rather than developed locally.

Intermediate Implementation — Localised challenge sets are developed with local expert participation from inception, not just reviewed after translation. Language-specific adversarial research is conducted for each deployment language. Cross-context interference testing is included. The localisation coverage matrix is automated and flags gaps in real time. Regulatory monitoring per jurisdiction is automated with alert-driven update triggers.

Advanced Implementation — All intermediate capabilities plus: automated localisation gap detection analyses production inputs by language and context to identify underrepresented deployment areas. The organisation contributes to industry localisation research and shares anonymised localisation insights with peers. Challenge set localisation is integrated into the deployment pipeline — no agent can be deployed to a new context without a validated localised challenge set.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: Deployment Context Identification

Test 8.2: Jurisdictional Regulatory Coverage

Test 8.3: Language-Specific Adversarial Coverage

Test 8.4: Local Expert Validation

Test 8.5: Localisation Coverage Matrix Completeness

Test 8.6: Regulatory Change Response

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Supports compliance
EU AI ActArticle 15 (Accuracy, Robustness, Cybersecurity)Direct requirement
GDPRArticle 5 (Principles Relating to Processing)Supports compliance
MiFID IIArticle 25 (Suitability Assessment)Supports compliance
NIST AI RMFMAP 2.3, MEASURE 2.6Supports compliance
ISO 42001Clause 8.2 (AI Risk Assessment)Supports compliance
Equality Act 2010Public Sector Equality DutySupports compliance

EU AI Act — Article 15 (Accuracy, Robustness, Cybersecurity)

Article 15 requires that accuracy and robustness be demonstrated for the conditions under which the system operates. When the system operates across multiple linguistic and jurisdictional contexts, accuracy and robustness must be demonstrated for each context. An evaluation conducted only in English cannot demonstrate accuracy for an Arabic deployment. Localised challenge sets provide the evidence needed for context-specific Article 15 compliance.

MiFID II — Article 25 (Suitability Assessment)

Article 25 requires that investment firms obtain necessary information about clients and ensure that recommendations are suitable. Suitability requirements vary across EU member states in their implementation detail. A financial agent operating across multiple EU jurisdictions must be evaluated against each jurisdiction's specific suitability requirements. Challenge set localisation ensures that suitability evaluation reflects the specific regulatory expectations of each deployment jurisdiction.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusDeployment-context-specific — failures affect users in the underserved deployment context, potentially creating disproportionate harm for specific linguistic, cultural, or jurisdictional groups

Consequence chain: Without challenge set localisation, agents deployed across contexts are evaluated against inappropriate criteria. The immediate consequence is that context-specific failure modes are undetected — the agent may fail in jurisdiction-specific regulatory compliance, language-specific adversarial robustness, or culturally specific user interactions. The equity consequence is that users in localised contexts receive lower quality and less safe service than users in the source context, creating a disparity that may violate anti-discrimination obligations. The regulatory consequence is non-compliance with jurisdiction-specific requirements that the evaluation programme did not test for. The reputational consequence is greatest in the affected context — users and regulators in a localised market will perceive the failure as evidence that the deploying organisation does not take their market seriously.

Cross-references: AG-349 (Scenario Library Governance) defines the scenario management framework that localised challenge sets must follow. AG-350 (Coverage Gap Tracking Governance) should include localisation gaps in the coverage matrix. AG-353 (Benchmark Drift Governance) applies to localised benchmarks as well as source benchmarks. AG-095 (Prompt Injection Resilience Testing) should include language-specific injection techniques identified through localisation.

Cite this protocol
AgentGoverning. (2026). AG-357: Challenge Set Localisation Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-357