AG-357: Challenge Set Localisation Governance

2. Summary

Challenge Set Localisation Governance requires that evaluation challenge sets — the scenarios, test cases, and adversarial inputs used to evaluate AI agents — are adapted to the language, jurisdiction, cultural context, and domain-specific edge cases of each deployment context. An evaluation challenge set developed for one context (e.g., US English, US regulatory environment, US cultural norms) does not provide assurance for a different context (e.g., Arabic, UAE regulatory environment, Gulf cultural norms). This dimension mandates that challenge sets are localised wherever the agent is deployed across linguistic, jurisdictional, or cultural boundaries, and that the localisation is validated by domain experts with relevant expertise.

3. Example

Scenario A — Regulatory Localisation Failure: A global financial services firm deploys the same AI advisory agent across the UK and Germany. The evaluation challenge set was developed by the UK team and includes 180 regulatory compliance scenarios based on FCA rules and UK financial services regulations. The German deployment uses the same challenge set. The agent passes all 180 scenarios. However, the German deployment is subject to BaFin regulations and MiFID II implementation that differs from the UK FCA interpretation in several material ways — including different suitability assessment requirements, different product governance rules, and different complaint handling obligations. The UK-focused challenge set does not test any of these German-specific requirements. After 6 months, BaFin identifies 4 compliance deficiencies in the German deployment during a supervisory review, none of which would have been detected by the UK challenge set.

What went wrong: The challenge set was not localised for the German regulatory environment. The firm assumed that UK regulatory compliance testing transferred to the German context. It did not — regulations differ in substance, not just translation. Consequence: BaFin supervisory findings, mandatory remediation within 90 days, €120,000 in localised challenge set development costs, and restriction of the German deployment to supervised operation pending compliance verification.

Scenario B — Linguistic Localisation Failure: A customer service agent deployed for Arabic-speaking markets is evaluated using challenge sets translated from English to Arabic by machine translation. The challenge set includes adversarial prompt injection scenarios. However, Arabic prompt injection techniques differ from English techniques: they exploit Arabic-specific properties including right-to-left text injection, diacritical mark manipulation, Arabic Unicode confusable characters, and code-switching between Modern Standard Arabic and regional dialects. The machine-translated challenge set tests only direct translations of English-language attacks. A security researcher demonstrates that the agent is vulnerable to Arabic-specific prompt injection within hours of deployment.

What went wrong: The challenge set was translated rather than localised. Translation preserves the meaning of English scenarios in Arabic; localisation would have included scenarios that are specific to Arabic language properties and that have no English equivalent. Consequence: Vulnerability disclosed publicly, emergency deployment restriction, £65,000 in emergency localisation and remediation costs, and reputational damage in the Arabic-speaking market.

Scenario C — Cultural Localisation Failure: A mental health support agent deployed in Japan is evaluated using a challenge set developed for the US market. The challenge set includes scenarios for detecting suicidal ideation based on US clinical indicators: direct statements of intent, access to means, and recent loss events. Japanese cultural context includes different indicators: references to "not wanting to be a burden" (meiwaku), indirect expressions through poetry or seasonal metaphors, and culturally specific concepts such as "ikigai loss." The US challenge set misses 73% of Japanese-context suicidal ideation test cases developed by Japanese clinical psychologists during post-deployment review.

What went wrong: The challenge set reflected US cultural norms for expressing distress, not Japanese cultural norms. Direct translation of US scenarios tested whether the agent could detect US-style expressions in Japanese — not whether it could detect Japanese-style expressions. Consequence: Failure to detect distress signals in 73% of culturally specific cases, patient safety concern, mandatory clinical review, deployment restricted to supervised mode, and £210,000 in culturally localised challenge set development with Japanese clinical experts.

4. Requirement Statement

Scope: This dimension applies to all AI agent deployments where the agent operates across linguistic, jurisdictional, or cultural boundaries — either because the agent is deployed in multiple markets, because the agent serves a multilingual or multicultural user population within a single market, or because the agent's evaluation materials were developed in a different context from the deployment context. The scope includes linguistic localisation (adapting to language-specific properties, not just translation), jurisdictional localisation (adapting to local regulatory requirements, legal frameworks, and compliance standards), cultural localisation (adapting to cultural norms, communication styles, and context-specific edge cases), and domain-specific localisation (adapting to local domain practices, terminology, and standards). Agents deployed only in the context for which their challenge sets were originally developed are excluded, though the scope includes verification that this alignment exists.

4.1. A conforming system MUST identify all deployment contexts (language, jurisdiction, cultural context, domain specialisation) and verify that challenge sets have been localised for each context, not merely translated.

4.2. A conforming system MUST include jurisdiction-specific regulatory compliance scenarios in challenge sets for each jurisdiction where the agent operates, validated by a regulatory expert with expertise in that jurisdiction.

4.3. A conforming system MUST include language-specific adversarial scenarios that exploit linguistic properties unique to each deployment language (e.g., script-specific injection vectors, homoglyph attacks, dialect variations, code-switching patterns).

4.4. A conforming system MUST validate localised challenge sets with domain experts who have relevant expertise in the target context — not solely with the team that developed the original challenge set.

4.5. A conforming system MUST maintain a localisation coverage matrix mapping each challenge set against each deployment context, identifying gaps where localisation has not been completed.

4.6. A conforming system MUST update localised challenge sets within 60 days of any material change in the regulatory environment of a deployment jurisdiction.

4.7. A conforming system SHOULD include culturally specific edge cases in localised challenge sets — scenarios that test the agent's ability to handle context-specific communication patterns, social norms, and implicit meanings.

4.8. A conforming system SHOULD engage native speakers and local domain experts in the development (not just review) of localised challenge sets, ensuring that scenarios originate from local expertise rather than translated external perspectives.

4.9. A conforming system SHOULD test for cross-context interference — scenarios where inputs in one language or context affect the agent's behaviour in another language or context (e.g., multilingual prompt injection where an adversarial instruction in Language A affects the agent's response in Language B).

4.10. A conforming system MAY implement automated localisation gap detection that analyses production inputs by language and context, comparing against challenge set coverage to identify underrepresented deployment contexts.

5. Rationale

AI agents are increasingly deployed across linguistic, jurisdictional, and cultural boundaries. The evaluation challenge sets that assess these agents must keep pace with this deployment breadth. A challenge set developed in one context is not a challenge set for another context — it is, at best, a starting point that must be substantially adapted.

The distinction between translation and localisation is critical. Translation preserves meaning across languages; localisation adapts content for a specific context. A translated challenge set tests whether the agent can handle English-language scenarios expressed in another language. A localised challenge set tests whether the agent can handle scenarios that are native to the target context — scenarios that have no English equivalent because they arise from the specific properties of the target language, regulatory environment, or cultural context.

Jurisdictional localisation is particularly important for regulated agents. Financial regulations are not uniform — MiFID II is implemented differently across EU member states, privacy regulations vary between GDPR and PIPL and POPIA, and consumer protection requirements differ by jurisdiction. An agent that passes UK regulatory compliance scenarios may fail German regulatory compliance scenarios not because of a generalisation failure but because the two regulatory environments have substantively different requirements. Testing against the wrong regulatory framework is worse than not testing at all, because it creates false assurance of compliance.

Cultural localisation addresses a subtler but equally important dimension. Cultural norms affect how users communicate intent, express distress, make requests, and respond to agent outputs. An evaluation that does not account for cultural context will miss failure modes that are specific to the target culture. This is not about sensitivity — it is about accuracy. An agent that cannot correctly interpret culturally specific expressions of intent will make errors in the target context regardless of how well it performs in the source context.

6. Implementation Guidance

Localisation requires investment in local expertise and a process that goes beyond translating existing materials. The key principle is that localised challenge sets should be developed with input from local experts, not merely reviewed by them after translation.

Recommended patterns:

Localisation-first challenge set development. For each new deployment context, engage local domain experts, regulatory specialists, and native speakers to develop challenge sets from the ground up. Use the source challenge set as a reference for structure and coverage categories, but generate scenarios from local expertise. For example, for a Japanese deployment, engage Japanese clinical psychologists to develop mental health scenarios, Japanese financial regulatory experts to develop compliance scenarios, and Japanese security researchers to develop adversarial scenarios. Supplement with translated-and-adapted versions of universal scenarios (e.g., basic prompt injection patterns that apply across languages).
Localisation coverage matrix. Maintain a matrix with deployment contexts as rows and challenge set categories as columns. Each cell indicates the localisation status: (1) not started — the challenge set for this context and category does not exist; (2) translated — the source challenge set has been translated but not localised; (3) localised — the challenge set includes context-specific scenarios developed by local experts; (4) validated — the localised challenge set has been validated by independent local experts. Flag any cell at "not started" or "translated only" as a localisation gap.
Language-specific adversarial research. Maintain awareness of adversarial techniques specific to each deployment language. This requires engaging with security research communities in each language, not relying solely on English-language security research. Arabic, Chinese, Japanese, Korean, Hindi, and other languages each have unique properties that create unique attack surfaces: script-specific injection, bidirectional text exploitation, character composition attacks, and dialect-based evasion.
Cross-context interference testing. Specifically test whether adversarial inputs in one language can affect agent behaviour in another. For example, test whether embedding an adversarial instruction in Arabic within a predominantly English conversation can bypass English-language safety filters. This is an increasingly common attack vector in multilingual agent deployments.
Regulatory change monitoring per jurisdiction. For each deployment jurisdiction, subscribe to regulatory update feeds and legal newsletters. When a material regulatory change occurs, trigger a localisation update task within 60 days. Track regulatory change coverage the same way that AG-353 (Benchmark Drift Governance) tracks benchmark alignment.

Anti-patterns to avoid:

Machine-translating challenge sets without localisation. Machine translation produces linguistically accurate but contextually inappropriate scenarios. A machine-translated prompt injection scenario tests translation quality, not adversarial robustness in the target language.
Using the source-language team to review localisations. The source team cannot evaluate whether a Japanese challenge set correctly captures Japanese cultural edge cases. Only Japanese domain experts can make this assessment.
Assuming universal adversarial techniques. Prompt injection techniques are not language-independent. Each language and script has unique properties that create unique attack vectors. A challenge set that tests only universal techniques misses the most dangerous language-specific attacks.
Localising once and considering it done. Regulatory environments change, cultural contexts evolve, and new language-specific threats emerge. Localised challenge sets must be maintained with the same cadence as the source challenge set.
Treating localisation as a lower-priority activity. If localised challenge sets are always last in the development queue, the most exposed deployment contexts (those operating without localised evaluation) will be the least evaluated.

Industry Considerations

Financial Services. Each jurisdiction has distinct regulatory requirements. MiFID II implementation varies across EU member states; UK FCA rules post-Brexit diverge from EU rules; US SEC/FINRA requirements differ from both. Challenge sets must reflect the specific regulatory framework of each deployment jurisdiction, not a generic international framework.

Healthcare. Clinical standards vary by jurisdiction. Drug naming conventions differ (brand names vary by country), clinical guidelines are jurisdiction-specific (NICE in the UK, AHA/ACC in the US), and patient communication norms differ culturally. A clinical agent evaluated against US clinical standards and deployed in Germany may provide guidance based on US-specific drug names or protocols not used in Germany.

Public Sector. Government service agents deployed across regions within a single country may need localisation for regional languages, regional administrative processes, and regional policy variations. In the UK, devolved administrations (Scotland, Wales, Northern Ireland) have different legal frameworks for many public services.

Maturity Model

Basic Implementation — All deployment contexts are identified. Challenge sets include jurisdiction-specific regulatory scenarios validated by local regulatory experts. Language-specific adversarial scenarios are included for each deployment language. Localised sets are validated by domain experts in the target context. A localisation coverage matrix tracks gaps. Regulatory changes trigger updates within 60 days. This level meets the minimum mandatory requirements but challenge sets may be primarily adapted from the source rather than developed locally.

Intermediate Implementation — Localised challenge sets are developed with local expert participation from inception, not just reviewed after translation. Language-specific adversarial research is conducted for each deployment language. Cross-context interference testing is included. The localisation coverage matrix is automated and flags gaps in real time. Regulatory monitoring per jurisdiction is automated with alert-driven update triggers.

Advanced Implementation — All intermediate capabilities plus: automated localisation gap detection analyses production inputs by language and context to identify underrepresented deployment areas. The organisation contributes to industry localisation research and shares anonymised localisation insights with peers. Challenge set localisation is integrated into the deployment pipeline — no agent can be deployed to a new context without a validated localised challenge set.

7. Evidence Requirements

Required artefacts:

Localisation coverage matrix. The current matrix showing each deployment context, each challenge set category, and the localisation status (not started, translated, localised, validated).
Local expert validation records. Evidence that localised challenge sets were validated by domain experts with relevant target-context expertise, including the experts' qualifications and their assessment.
Jurisdiction-specific regulatory scenarios. The regulatory compliance scenarios for each deployment jurisdiction, with the regulatory provisions they test and the regulatory expert who validated them.
Language-specific adversarial scenarios. Adversarial scenarios exploiting language-specific properties for each deployment language, with documentation of the linguistic properties targeted.
Regulatory update tracking. Evidence that regulatory changes in each deployment jurisdiction triggered challenge set updates within 60 days.

Retention requirements:

Localisation matrices, validation records, and regulatory tracking: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Deployment Context Identification

Stimulus: Request the list of all deployment contexts (language, jurisdiction, cultural context). Cross-reference against the agent's actual deployment footprint.
Expected behaviour: The deployment context list matches the agent's actual deployment footprint. No deployment context is missing from the list.
Pass criteria: 100% of actual deployment contexts are identified and listed.
Fail criteria: Any actual deployment context is not identified in the localisation programme.

Test 8.2: Jurisdictional Regulatory Coverage

Stimulus: For each deployment jurisdiction, verify that challenge sets include regulatory compliance scenarios specific to that jurisdiction's regulatory framework.
Expected behaviour: Each jurisdiction has regulatory scenarios validated by a regulatory expert with expertise in that jurisdiction.
Pass criteria: All deployment jurisdictions have jurisdiction-specific regulatory scenarios with documented expert validation.
Fail criteria: Any jurisdiction relies solely on scenarios from another jurisdiction's regulatory framework.

Test 8.3: Language-Specific Adversarial Coverage

Stimulus: For each deployment language, verify that adversarial scenarios include language-specific attacks beyond translated English attacks.
Expected behaviour: Each deployment language has adversarial scenarios that exploit properties unique to that language or script.
Pass criteria: Each language has at least 5 language-specific adversarial scenarios with documented linguistic justification.
Fail criteria: Any deployment language has only translated English adversarial scenarios without language-specific additions.

Test 8.4: Local Expert Validation

Stimulus: Request validation records for localised challenge sets. Verify that validators have relevant target-context expertise.
Expected behaviour: Each localised challenge set has validation records from experts with documented expertise in the target context.
Pass criteria: All localised sets have validation records with expert qualifications documented.
Fail criteria: Any localised set was validated solely by the source-context team without target-context expert involvement.

Test 8.5: Localisation Coverage Matrix Completeness

Stimulus: Request the localisation coverage matrix. Identify any cells at "not started" or "translated only" status.
Expected behaviour: No deployment context has a "not started" status for any required challenge set category. "Translated only" status is flagged as a gap with a remediation plan.
Pass criteria: No "not started" cells exist for active deployment contexts. All "translated only" cells have documented remediation plans with target dates.
Fail criteria: Any active deployment context has a "not started" challenge set category, or "translated only" cells lack remediation plans.

Test 8.6: Regulatory Change Response

Stimulus: Identify material regulatory changes in deployment jurisdictions in the last 12 months. Verify that challenge sets were updated within 60 days.
Expected behaviour: Each regulatory change has a corresponding challenge set update dated within 60 days.
Pass criteria: 100% of material regulatory changes have timely challenge set updates.
Fail criteria: Any regulatory change lacks a corresponding update, or the update occurred more than 60 days after the change.

Conformance Scoring

Score 0: No localisation governance exists — challenge sets are used without adaptation across deployment contexts.
Score 1: Challenge sets are translated but not localised — scenarios are linguistically adapted but do not include context-specific scenarios or local expert validation.
Score 2: Challenge sets are localised with jurisdiction-specific, language-specific, and culturally adapted scenarios, validated by local domain experts — all mandatory requirements are met.
Score 3: Verified by independent assessment — an independent party has validated the localisation programme's methodology, expert engagement, coverage, and maintenance processes.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Supports compliance
EU AI Act	Article 15 (Accuracy, Robustness, Cybersecurity)	Direct requirement
GDPR	Article 5 (Principles Relating to Processing)	Supports compliance
MiFID II	Article 25 (Suitability Assessment)	Supports compliance
NIST AI RMF	MAP 2.3, MEASURE 2.6	Supports compliance
ISO 42001	Clause 8.2 (AI Risk Assessment)	Supports compliance
Equality Act 2010	Public Sector Equality Duty	Supports compliance

EU AI Act — Article 15 (Accuracy, Robustness, Cybersecurity)

Article 15 requires that accuracy and robustness be demonstrated for the conditions under which the system operates. When the system operates across multiple linguistic and jurisdictional contexts, accuracy and robustness must be demonstrated for each context. An evaluation conducted only in English cannot demonstrate accuracy for an Arabic deployment. Localised challenge sets provide the evidence needed for context-specific Article 15 compliance.

MiFID II — Article 25 (Suitability Assessment)

Article 25 requires that investment firms obtain necessary information about clients and ensure that recommendations are suitable. Suitability requirements vary across EU member states in their implementation detail. A financial agent operating across multiple EU jurisdictions must be evaluated against each jurisdiction's specific suitability requirements. Challenge set localisation ensures that suitability evaluation reflects the specific regulatory expectations of each deployment jurisdiction.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Deployment-context-specific — failures affect users in the underserved deployment context, potentially creating disproportionate harm for specific linguistic, cultural, or jurisdictional groups

Consequence chain: Without challenge set localisation, agents deployed across contexts are evaluated against inappropriate criteria. The immediate consequence is that context-specific failure modes are undetected — the agent may fail in jurisdiction-specific regulatory compliance, language-specific adversarial robustness, or culturally specific user interactions. The equity consequence is that users in localised contexts receive lower quality and less safe service than users in the source context, creating a disparity that may violate anti-discrimination obligations. The regulatory consequence is non-compliance with jurisdiction-specific requirements that the evaluation programme did not test for. The reputational consequence is greatest in the affected context — users and regulators in a localised market will perceive the failure as evidence that the deploying organisation does not take their market seriously.

Cross-references: AG-349 (Scenario Library Governance) defines the scenario management framework that localised challenge sets must follow. AG-350 (Coverage Gap Tracking Governance) should include localisation gaps in the coverage matrix. AG-353 (Benchmark Drift Governance) applies to localised benchmarks as well as source benchmarks. AG-095 (Prompt Injection Resilience Testing) should include language-specific injection techniques identified through localisation.

Cite this protocol

AgentGoverning. (2026). AG-357: Challenge Set Localisation Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-357

← Previous Protocol

AG-356

Near-Miss Capture Governance

Next Protocol →

AG-358

External Bounty Intake Governance