Embodied Validation — Sim-to-Real, SOTIF and Physical Red-Teaming

Embodied AI, Humanoids & Robot Fleets ~6 min read AGS v2.1 · 2026-06-06

EU AI Act NIST AI RMF ISO 42001

AGS Embodied AI (Group L) | Embodied AI, Humanoids & Robot Fleets | Version 3.0

1. Definition

Embodied Validation governs the evidence required before an AI-driven physical agent operates around people: simulation/world-model validation with documented sim-to-real transfer, safety-of-the-intended-functionality (SOTIF) analysis for perception-driven autonomy, physical and predictive red-teaming of the agent's learned policy, and a structured safety-assurance case.

Learning-enabled physical agents fail in ways that component-failure analysis misses — through perception limits, distributional shift, and emergent policy behaviour. This dimension requires the specific validation evidence that those failure modes demand.

2. Scope

In scope: sim-to-real validation evidence; SOTIF analysis (performance limitations, triggering conditions) for perception/autonomy; physical and predictive red-teaming of the policy; structured safety-assurance case for the embodied agent.

Out of scope: force/speed limiting (AG-835) and fail-safe-stop (AG-836); this dimension validates that those and the agent's behaviour are adequate. It governs *validation evidence for embodied AI*.

3. Why This Matters

An embodied agent validated only in simulation, or only against component failures, can behave unsafely in the real world when perception degrades, conditions shift, or the learned policy meets a situation it generalises badly. Without sim-to-real evidence, SOTIF analysis, and physical red-teaming, an organisation is deploying an unproven physical system around people. This validation turns "it worked in the lab" into evidenced safety for real operation.

4. Requirements

R1: Before operating around people, an embodied agent's behaviour MUST be validated in simulation/world-models, with documented evidence of sim-to-real transfer (where simulation-trained behaviour holds in reality).
R2: A SOTIF analysis MUST identify performance limitations and triggering conditions of the perception/autonomy stack (e.g. lighting, occlusion, novel objects) and the mitigations for them.
R3: The learned policy MUST be physically and/or predictively red-teamed to surface unsafe behaviours (policy violations, hazardous motions) before they manifest in operation.
R4: A structured safety-assurance case (argument plus evidence) MUST be maintained for the embodied agent, covering the learning-enabled failure modes, not only component reliability.
R5: Validation MUST cover the realistic operating envelope, including degraded conditions and human-interaction scenarios, and MUST be repeated on material change to the policy, hardware, or environment.
R6: Hardware-in-the-loop testing SHOULD be used to validate safety-relevant behaviour that simulation cannot fully capture.
R7: Validation evidence and the assurance case MUST be retained and made available to relevant safety authorities.
R8: Residual risks identified in validation MUST be documented, mitigated or accepted at an accountable level, and monitored in operation.

5. Maturity Model

Basic: The agent is tested in simulation and basic real-world trials before deployment, with results recorded.
Intermediate: Documented sim-to-real transfer, SOTIF analysis, policy red-teaming, and a structured safety-assurance case.
Advanced: Degraded-condition and human-interaction coverage, hardware-in-the-loop validation, change-triggered revalidation, and authority-available evidence.

6. Test Criteria

Test 6.1: Sim-to-Real Evidence

Stimulus: Review validation evidence for an embodied agent.
Expected: Documented sim-to-real transfer evidence exists for safety-relevant behaviour.
Fail: Only simulation results, with no real-world transfer evidence.

Test 6.2: SOTIF Coverage

Stimulus: Review the SOTIF analysis.
Expected: Perception/autonomy performance limits and triggering conditions are identified and mitigated.
Fail: No analysis of perception-driven failure modes.

Test 6.3: Policy Red-Teaming

Stimulus: Review red-teaming of the learned policy.
Expected: Physical/predictive red-teaming surfaced and addressed unsafe behaviours pre-deployment.
Fail: The policy was not red-teamed for unsafe behaviour.

7. Scoring

Score	Criteria
0	Embodied agent deployed around people without sim-to-real, SOTIF, or red-teaming evidence
1	Some simulation/trial testing but no SOTIF, transfer evidence, or policy red-teaming
2	Sim-to-real evidence, SOTIF analysis, policy red-teaming, structured assurance case
3	Degraded/human-interaction coverage, hardware-in-the-loop, change-triggered revalidation, authority-available

8. Failure Scenarios

Scenario A — Sim-Only Confidence: A humanoid validated only in simulation behaves unsafely on real surfaces and lighting its training never represented. Sim-to-real transfer evidence and SOTIF analysis would have surfaced the gap.

Scenario B — Untested Edge Case: The policy meets a novel object/occlusion in operation and produces a hazardous motion. Predictive red-teaming would have found the failure before deployment.

Scenario C — Reliability ≠ Safety: The assurance case proves component reliability but never addresses learned-policy failure modes; an emergent behaviour causes harm the case never considered.

9. Regulatory Mapping

Requirement	EU AI Act	NIST AI RMF	ISO 42001
R1: Sim-to-real validation evidence	Art. 15 — Accuracy, robustness	MEASURE 2.6 — Safety evaluation	Clause 8.3 — Verification
R2: SOTIF analysis	Art. 9 — Risk management	MAP 5.1 — Impact identification	Clause 8.3 — Verification
R3: Policy red-teaming	Art. 15 — Robustness	MEASURE 2.7 — Security and resilience	Clause 8.3 — Verification
R4: Structured assurance case	Art. 9 — Risk management	MEASURE 2.6 — Safety evaluation	Clause 8.3 — Verification
R5: Operating-envelope + change revalidation	Art. 9 — Lifecycle risk management	MANAGE 4.1 — Post-deployment monitoring	Clause 8.3 — Verification
R6: Hardware-in-the-loop	Art. 15 — Robustness	MEASURE 2.6 — Safety evaluation	Clause 8.3 — Verification
R7: Authority-available evidence	Art. 11 — Technical documentation	GOVERN 1.1 — Legal/regulatory	Clause 7.5 — Documented information
R8: Residual-risk handling	Art. 9 — Risk acceptance	MANAGE 1.4 — Residual-risk documentation	Clause 6.1 — Actions to address risk

> Standards note: align to ISO 21448:2022 (SOTIF), UL 4600 (safety case for autonomous products), and emerging embodied-AI predictive-red-teaming and test-method practice; combine with the functional-safety basis of IEC 61508.

EU AI Act — Article 9 and Article 15

Article 9 (risk management) and Article 15 (accuracy/robustness) require evidence that an AI system performs safely in its real operating context; for embodied AI that evidence specifically includes sim-to-real, SOTIF, and policy red-teaming.

NIST AI RMF — MEASURE 2.6, MAP 5.1

MEASURE 2.6 (safety evaluation) and MAP 5.1 (impact identification) require validating the learning-enabled physical agent against its real failure modes.

ISO 42001 — Clause 8.3, A.6

Clause 8.3 (verification) and Annex A.6 (lifecycle) require lifecycle validation evidence proportionate to the physical-safety impact.

AG-835 (Embodied AI Safety-Class and Force/Speed Limiting) — validated by this dimension
AG-836 (Physical-Action Reversibility and Fail-Safe-to-Stop) — fail-safe behaviour validated here
AG-838 (Robot-Fleet Coordination and OTA-Update Governance) — updates trigger revalidation
AG-071 (Pre-Deployment Validation and Acceptance) — general acceptance gating
AG-074 (Performance Drift and Revalidation Threshold) — operational revalidation basis

Cite this protocol

AgentGoverning. (2026). AG-837: Embodied Validation — Sim-to-Real, SOTIF and Physical Red-Teaming. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-837

← Previous

AG-836

Physical Action Reversibility And Fail Safe To Stop

Next Protocol →

AG-838

Robot Fleet Coordination And Ota Update Governance