AG-351

Human-Subject Evaluation Ethics Governance

Evaluation, Benchmarking & Red Teaming ~16 min read AGS v2.1 · April 2026
EU AI Act GDPR FCA NIST

2. Summary

Human-Subject Evaluation Ethics Governance requires that organisations protect the welfare, dignity, and rights of any human participants involved in AI agent evaluation activities — including user studies, shadow-mode deployments, behavioural evaluation, A/B testing, red-team exercises involving human participants, and any evaluation where a human interacts with or is affected by an agent under test. This dimension establishes the ethical safeguards, consent mechanisms, risk assessment processes, and oversight structures needed to ensure that the pursuit of evaluation quality does not come at the expense of human welfare.

3. Example

Scenario A — Shadow-Mode Exposure Causes Psychological Harm: A mental health support organisation deploys a new AI agent in shadow mode alongside its existing human counsellors. In shadow mode, the agent processes real user sessions and generates responses that are logged but not delivered — human counsellors provide the actual responses. The organisation intends to evaluate the agent's accuracy and safety by comparing its shadow responses against the human responses. However, the shadow-mode evaluation was not reviewed by an ethics board. Six weeks into the evaluation, a data analyst reviewing shadow responses discovers that the agent generated an inappropriate response to a user expressing suicidal ideation — a response that, had it been delivered, could have been actively harmful. Further review reveals 14 similar cases. The organisation realises it has been processing sensitive mental health data from vulnerable individuals without their knowledge or consent, for a purpose (AI training evaluation) they did not agree to.

What went wrong: Shadow-mode evaluation was treated as a technical activity rather than a human-subject evaluation. No ethics review was conducted. Users were not informed that their sessions would be processed by an AI system. Vulnerable individuals — those seeking mental health support — were unknowingly subjected to AI evaluation without consent or safeguards. Consequence: Regulatory investigation under GDPR Article 9 (processing of special category data), ICO enforcement action, mandatory notification to all affected users, £320,000 in legal and remediation costs, suspension of the AI programme pending ethics review, and significant reputational damage within the mental health community.

Scenario B — Red-Team Exercise Exposes Participants to Harmful Content: A financial services firm conducts a red-team exercise on its customer-facing agent. The exercise involves 15 internal staff members acting as adversarial users, attempting to elicit harmful or non-compliant outputs from the agent. The exercise brief instructs participants to "try anything that might break the agent." Over three days, several participants craft increasingly extreme inputs, including scenarios involving financial distress, threats of self-harm, and abusive language. No pre-exercise briefing on participant welfare was provided, no debriefing was offered, and no support was available for participants who found the exercise distressing. Two participants report to HR that the exercise caused them significant stress, particularly the financial distress scenarios which resembled their personal circumstances.

What went wrong: The red-team exercise was designed to test the agent but not to protect the participants. No risk assessment was conducted for participant welfare. No consent process explained the nature of the content participants would encounter. No support mechanisms were in place. Consequence: Two HR complaints, temporary suspension of the red-team programme, mandatory workplace wellbeing assessment, and difficulty recruiting participants for future exercises.

Scenario C — A/B Test Creates Unequal Service Quality: A government benefits agency deploys an AI agent to assist with benefits eligibility queries. To evaluate the agent's effectiveness, the agency runs an A/B test: 50% of callers are served by the AI agent, 50% by human advisors. The A/B test runs for four months. During this period, the AI agent provides incorrect eligibility guidance in 3.2% of cases, compared to 0.8% for human advisors. The 3.2% error rate affects approximately 1,400 callers over four months, of whom an estimated 280 received incorrect benefits decisions based on the flawed guidance. No mechanism existed to identify and remediate individual harms caused by the evaluation.

What went wrong: The A/B test exposed a vulnerable population (benefits claimants) to a lower quality of service without their knowledge or consent. No interim analysis was planned to detect and halt the experiment if one arm significantly underperformed. No remediation process existed for individuals harmed during the evaluation period. Consequence: Judicial review challenge from a benefits advocacy group, adverse media coverage, retrospective review of 1,400 cases costing £210,000, compensation payments to affected claimants, and a Parliamentary question about AI experimentation on benefits recipients.

4. Requirement Statement

Scope: This dimension applies whenever human participants are involved in or directly affected by AI agent evaluation activities. This includes but is not limited to: user studies where participants interact with an agent under test; shadow-mode deployments where real user data is processed by an agent under evaluation; A/B tests where some users receive agent-generated outputs; red-team exercises involving human participants generating adversarial inputs or evaluating agent outputs; behavioural evaluations where humans assess agent behaviour; and any evaluation activity where human welfare, privacy, or rights could be affected. Purely automated evaluations that use synthetic data and involve no human participants are excluded, though evaluations that use historical data derived from human interactions may be in scope depending on the data's sensitivity and the participants' consent status.

4.1. A conforming system MUST conduct an ethics risk assessment before commencing any evaluation activity involving human participants, documenting the potential harms to participants, the likelihood and severity of those harms, and the mitigations in place.

4.2. A conforming system MUST obtain informed consent from all human participants in evaluation activities, clearly explaining: the purpose of the evaluation, what participation involves, what data will be collected and how it will be used, the right to withdraw at any time without penalty, and any foreseeable risks.

4.3. A conforming system MUST implement enhanced protections for vulnerable populations, including but not limited to: individuals in financial distress, individuals with mental health conditions, children, individuals with limited English proficiency, and individuals in dependency relationships with the deploying organisation.

4.4. A conforming system MUST establish stopping criteria for evaluation activities that define conditions under which the evaluation is halted — for example, when participant harm rates exceed a predefined threshold or when interim analysis reveals significant quality disparity between evaluation arms.

4.5. A conforming system MUST provide debriefing and support to participants in evaluation activities that involve exposure to potentially distressing content, including red-team exercises, adversarial testing, and evaluations involving sensitive topics.

4.6. A conforming system MUST implement a remediation process for any individual harm identified during or after an evaluation activity, including a mechanism for affected individuals to report harm and receive appropriate redress.

4.7. A conforming system SHOULD submit evaluation protocols involving human participants to an independent ethics review body (e.g., an internal ethics board or an external institutional review board) before commencement.

4.8. A conforming system SHOULD implement interim analysis checkpoints for long-running evaluations (exceeding 30 days), reviewing participant welfare indicators and evaluation quality metrics at predefined intervals.

4.9. A conforming system SHOULD maintain a participant registry that tracks consent status, participation dates, exposure to potentially harmful content, and any reported adverse events.

4.10. A conforming system MAY implement differential privacy or anonymisation techniques to protect participant data collected during evaluations, reducing the re-identification risk from evaluation datasets.

5. Rationale

AI agent evaluation increasingly involves real humans — as participants, subjects, evaluators, or affected parties. This creates an ethical obligation that many organisations fail to recognise because they frame evaluation as a technical activity rather than a human-subject activity. The history of research ethics, from the Nuremberg Code through the Declaration of Helsinki to modern institutional review board requirements, establishes a clear principle: when humans are involved in or affected by an evaluative activity, their welfare takes precedence over the goals of the evaluation.

The specific risks of AI agent evaluation differ from traditional research but are no less real. Shadow-mode deployments process real user data without user awareness. A/B tests expose some users to potentially inferior service. Red-team exercises expose participants to adversarial and potentially distressing content. Behavioural evaluations may reveal sensitive information about participant preferences or vulnerabilities. Each of these activities can cause harm if conducted without ethical safeguards.

The vulnerability dimension is particularly important. AI agents are often deployed in contexts where users are vulnerable: seeking financial advice during a crisis, accessing healthcare information, navigating government services, or seeking mental health support. Evaluation activities in these contexts must apply heightened protections because the potential for harm is elevated and the affected individuals may have limited capacity to protect themselves.

The stopping criteria requirement (4.4) draws from clinical trial methodology, where interim analysis can trigger early termination if one treatment arm is significantly inferior. The same principle applies to AI evaluation: if an A/B test reveals that one arm is causing measurably more harm, continuing the test is ethically indefensible regardless of the statistical power desired. The organisation must define in advance what constitutes an unacceptable harm rate and commit to halting the evaluation if that threshold is crossed.

The remediation requirement (4.6) acknowledges that even well-designed evaluations can cause individual harm. The ethical obligation does not end when the evaluation ends — it extends to identifying and addressing any harm that occurred during the evaluation period.

6. Implementation Guidance

Implementing human-subject evaluation ethics requires both structural safeguards (consent mechanisms, ethics review, stopping criteria) and cultural change (treating evaluation as a human-affecting activity, not merely a technical one).

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Healthcare. Evaluations involving patient data or clinical scenarios require heightened ethics review, often equivalent to clinical research ethics. Shadow-mode deployments in clinical settings must comply with clinical data governance requirements. A/B tests affecting clinical care delivery may require research ethics committee approval. The Caldicott Principles apply to all patient data used in evaluation.

Financial Services. A/B tests involving financial product recommendations or advice must comply with FCA requirements for fair treatment of customers. Evaluations must not result in some customers receiving systematically inferior financial guidance. Vulnerable customer protections under FCA guidance apply to evaluation contexts.

Public Sector. Evaluations involving public service recipients must comply with the Public Sector Equality Duty. A/B tests that result in unequal service quality across protected characteristics are legally and ethically problematic. The power imbalance between government and citizens requires enhanced consent protections.

Maturity Model

Basic Implementation — Ethics risk assessments are conducted for all evaluation activities involving human participants. Informed consent is obtained. Stopping criteria are defined for A/B tests and extended evaluations. Debriefing is provided for red-team participants. A remediation process exists for identified harms. This level meets the minimum mandatory requirements but ethics review may be internal and informal, and participant welfare monitoring may be retrospective rather than real-time.

Intermediate Implementation — An ethics review board reviews moderate and high-risk evaluations before commencement. Consent mechanisms are tiered by evaluation type and risk level. Stopping criteria are monitored in real time with automated alerts. Red-team participant welfare includes pre-briefing, real-time support, debriefing, and follow-up. A participant registry tracks consent, exposure, and adverse events. Interim analysis checkpoints are implemented for evaluations exceeding 30 days.

Advanced Implementation — All intermediate capabilities plus: independent external ethics review for high-risk evaluations. Predictive harm modelling estimates participant risk before the evaluation begins. Differential privacy protects evaluation datasets. The organisation publishes transparency reports on evaluation ethics practices. Lessons learned from evaluation ethics incidents are shared across the organisation and, where appropriate, across the industry.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: Ethics Risk Assessment Completeness

Test 8.2: Informed Consent Verification

Test 8.3: Vulnerable Population Protections

Test 8.4: Stopping Criteria Enforcement

Test 8.5: Debriefing Compliance

Test 8.6: Remediation Process Functionality

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
GDPRArticles 6, 7, 9 (Lawful Basis, Consent, Special Category Data)Direct requirement
EU AI ActArticle 9 (Risk Management System)Supports compliance
EU AI ActArticle 10 (Data and Data Governance)Supports compliance
Declaration of HelsinkiPrinciples 1-37 (Ethical Principles for Medical Research)Supports compliance
Equality Act 2010Public Sector Equality DutySupports compliance
FCA PRINPrinciple 6 (Treating Customers Fairly)Supports compliance
NIST AI RMFGOVERN 1.2, MAP 1.5Supports compliance

GDPR — Articles 6, 7, 9

GDPR Article 6 requires a lawful basis for processing personal data. Article 7 specifies conditions for consent. Article 9 imposes additional restrictions on processing special category data (health, biometric, political opinions, etc.). Evaluation activities that process personal data — shadow-mode deployments, A/B tests using real user data, user studies — must establish a lawful basis. Where consent is the lawful basis, it must meet the GDPR standard: freely given, specific, informed, and unambiguous. Shadow-mode processing of health data in a healthcare evaluation requires explicit consent under Article 9(2)(a) or another Article 9 exception.

EU AI Act — Articles 9, 10

Article 9 requires risk management measures that are proportionate to the risks. Evaluation activities that create risks to human subjects require corresponding risk management measures — the ethics risk assessment process directly supports this. Article 10 addresses data governance, including requirements for data quality, representativeness, and appropriate handling — all of which apply to evaluation data involving human subjects.

FCA PRIN — Principle 6

Principle 6 requires firms to treat customers fairly. A/B tests that expose some customers to lower quality service, or evaluations that use customer data without appropriate consent, risk non-compliance. The FCA has indicated that firms cannot use fair treatment obligations as exceptions for testing and evaluation — customers must receive fair treatment during evaluation periods as well as normal operation.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusIndividual to organisational — harm ranges from individual participant impact to organisation-wide regulatory and reputational consequences

Consequence chain: Without human-subject evaluation ethics governance, organisations risk causing direct harm to individuals who participate in or are affected by evaluation activities. The immediate consequences are individual: a participant distressed by red-team content, a user who received incorrect guidance during an A/B test, a vulnerable individual whose sensitive data was processed without consent. The organisational consequences follow: regulatory investigations (GDPR, ICO, FCA), enforcement actions, compensation requirements, and reputational damage. The systemic consequence is erosion of trust — if participants and affected individuals cannot trust that evaluation activities are conducted ethically, organisations will face increasing difficulty recruiting red-team participants, obtaining consent for evaluations, and maintaining public support for AI deployment.

Cross-references: AG-349 (Scenario Library Governance) defines the scenario specifications that evaluation activities execute. AG-354 (Hidden Test Integrity Governance) must be balanced against participant transparency — blinding must not override informed consent. AG-355 (Continuous Red-Team Scheduling Governance) must incorporate participant welfare scheduling to prevent overexposure. AG-095 (Prompt Injection Resilience Testing) involves adversarial content that may affect red-team participants.

Cite this protocol
AgentGoverning. (2026). AG-351: Human-Subject Evaluation Ethics Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-351