AG-351: Human-Subject Evaluation Ethics Governance

2. Summary

Human-Subject Evaluation Ethics Governance requires that organisations protect the welfare, dignity, and rights of any human participants involved in AI agent evaluation activities — including user studies, shadow-mode deployments, behavioural evaluation, A/B testing, red-team exercises involving human participants, and any evaluation where a human interacts with or is affected by an agent under test. This dimension establishes the ethical safeguards, consent mechanisms, risk assessment processes, and oversight structures needed to ensure that the pursuit of evaluation quality does not come at the expense of human welfare.

3. Example

Scenario A — Shadow-Mode Exposure Causes Psychological Harm: A mental health support organisation deploys a new AI agent in shadow mode alongside its existing human counsellors. In shadow mode, the agent processes real user sessions and generates responses that are logged but not delivered — human counsellors provide the actual responses. The organisation intends to evaluate the agent's accuracy and safety by comparing its shadow responses against the human responses. However, the shadow-mode evaluation was not reviewed by an ethics board. Six weeks into the evaluation, a data analyst reviewing shadow responses discovers that the agent generated an inappropriate response to a user expressing suicidal ideation — a response that, had it been delivered, could have been actively harmful. Further review reveals 14 similar cases. The organisation realises it has been processing sensitive mental health data from vulnerable individuals without their knowledge or consent, for a purpose (AI training evaluation) they did not agree to.

What went wrong: Shadow-mode evaluation was treated as a technical activity rather than a human-subject evaluation. No ethics review was conducted. Users were not informed that their sessions would be processed by an AI system. Vulnerable individuals — those seeking mental health support — were unknowingly subjected to AI evaluation without consent or safeguards. Consequence: Regulatory investigation under GDPR Article 9 (processing of special category data), ICO enforcement action, mandatory notification to all affected users, £320,000 in legal and remediation costs, suspension of the AI programme pending ethics review, and significant reputational damage within the mental health community.

Scenario B — Red-Team Exercise Exposes Participants to Harmful Content: A financial services firm conducts a red-team exercise on its customer-facing agent. The exercise involves 15 internal staff members acting as adversarial users, attempting to elicit harmful or non-compliant outputs from the agent. The exercise brief instructs participants to "try anything that might break the agent." Over three days, several participants craft increasingly extreme inputs, including scenarios involving financial distress, threats of self-harm, and abusive language. No pre-exercise briefing on participant welfare was provided, no debriefing was offered, and no support was available for participants who found the exercise distressing. Two participants report to HR that the exercise caused them significant stress, particularly the financial distress scenarios which resembled their personal circumstances.

What went wrong: The red-team exercise was designed to test the agent but not to protect the participants. No risk assessment was conducted for participant welfare. No consent process explained the nature of the content participants would encounter. No support mechanisms were in place. Consequence: Two HR complaints, temporary suspension of the red-team programme, mandatory workplace wellbeing assessment, and difficulty recruiting participants for future exercises.

Scenario C — A/B Test Creates Unequal Service Quality: A government benefits agency deploys an AI agent to assist with benefits eligibility queries. To evaluate the agent's effectiveness, the agency runs an A/B test: 50% of callers are served by the AI agent, 50% by human advisors. The A/B test runs for four months. During this period, the AI agent provides incorrect eligibility guidance in 3.2% of cases, compared to 0.8% for human advisors. The 3.2% error rate affects approximately 1,400 callers over four months, of whom an estimated 280 received incorrect benefits decisions based on the flawed guidance. No mechanism existed to identify and remediate individual harms caused by the evaluation.

What went wrong: The A/B test exposed a vulnerable population (benefits claimants) to a lower quality of service without their knowledge or consent. No interim analysis was planned to detect and halt the experiment if one arm significantly underperformed. No remediation process existed for individuals harmed during the evaluation period. Consequence: Judicial review challenge from a benefits advocacy group, adverse media coverage, retrospective review of 1,400 cases costing £210,000, compensation payments to affected claimants, and a Parliamentary question about AI experimentation on benefits recipients.

4. Requirement Statement

Scope: This dimension applies whenever human participants are involved in or directly affected by AI agent evaluation activities. This includes but is not limited to: user studies where participants interact with an agent under test; shadow-mode deployments where real user data is processed by an agent under evaluation; A/B tests where some users receive agent-generated outputs; red-team exercises involving human participants generating adversarial inputs or evaluating agent outputs; behavioural evaluations where humans assess agent behaviour; and any evaluation activity where human welfare, privacy, or rights could be affected. Purely automated evaluations that use synthetic data and involve no human participants are excluded, though evaluations that use historical data derived from human interactions may be in scope depending on the data's sensitivity and the participants' consent status.

4.1. A conforming system MUST conduct an ethics risk assessment before commencing any evaluation activity involving human participants, documenting the potential harms to participants, the likelihood and severity of those harms, and the mitigations in place.

4.2. A conforming system MUST obtain informed consent from all human participants in evaluation activities, clearly explaining: the purpose of the evaluation, what participation involves, what data will be collected and how it will be used, the right to withdraw at any time without penalty, and any foreseeable risks.

4.3. A conforming system MUST implement enhanced protections for vulnerable populations, including but not limited to: individuals in financial distress, individuals with mental health conditions, children, individuals with limited English proficiency, and individuals in dependency relationships with the deploying organisation.

4.4. A conforming system MUST establish stopping criteria for evaluation activities that define conditions under which the evaluation is halted — for example, when participant harm rates exceed a predefined threshold or when interim analysis reveals significant quality disparity between evaluation arms.

4.5. A conforming system MUST provide debriefing and support to participants in evaluation activities that involve exposure to potentially distressing content, including red-team exercises, adversarial testing, and evaluations involving sensitive topics.

4.6. A conforming system MUST implement a remediation process for any individual harm identified during or after an evaluation activity, including a mechanism for affected individuals to report harm and receive appropriate redress.

4.7. A conforming system SHOULD submit evaluation protocols involving human participants to an independent ethics review body (e.g., an internal ethics board or an external institutional review board) before commencement.

4.8. A conforming system SHOULD implement interim analysis checkpoints for long-running evaluations (exceeding 30 days), reviewing participant welfare indicators and evaluation quality metrics at predefined intervals.

4.9. A conforming system SHOULD maintain a participant registry that tracks consent status, participation dates, exposure to potentially harmful content, and any reported adverse events.

4.10. A conforming system MAY implement differential privacy or anonymisation techniques to protect participant data collected during evaluations, reducing the re-identification risk from evaluation datasets.

5. Rationale

AI agent evaluation increasingly involves real humans — as participants, subjects, evaluators, or affected parties. This creates an ethical obligation that many organisations fail to recognise because they frame evaluation as a technical activity rather than a human-subject activity. The history of research ethics, from the Nuremberg Code through the Declaration of Helsinki to modern institutional review board requirements, establishes a clear principle: when humans are involved in or affected by an evaluative activity, their welfare takes precedence over the goals of the evaluation.

The specific risks of AI agent evaluation differ from traditional research but are no less real. Shadow-mode deployments process real user data without user awareness. A/B tests expose some users to potentially inferior service. Red-team exercises expose participants to adversarial and potentially distressing content. Behavioural evaluations may reveal sensitive information about participant preferences or vulnerabilities. Each of these activities can cause harm if conducted without ethical safeguards.

The vulnerability dimension is particularly important. AI agents are often deployed in contexts where users are vulnerable: seeking financial advice during a crisis, accessing healthcare information, navigating government services, or seeking mental health support. Evaluation activities in these contexts must apply heightened protections because the potential for harm is elevated and the affected individuals may have limited capacity to protect themselves.

The stopping criteria requirement (4.4) draws from clinical trial methodology, where interim analysis can trigger early termination if one treatment arm is significantly inferior. The same principle applies to AI evaluation: if an A/B test reveals that one arm is causing measurably more harm, continuing the test is ethically indefensible regardless of the statistical power desired. The organisation must define in advance what constitutes an unacceptable harm rate and commit to halting the evaluation if that threshold is crossed.

The remediation requirement (4.6) acknowledges that even well-designed evaluations can cause individual harm. The ethical obligation does not end when the evaluation ends — it extends to identifying and addressing any harm that occurred during the evaluation period.

6. Implementation Guidance

Implementing human-subject evaluation ethics requires both structural safeguards (consent mechanisms, ethics review, stopping criteria) and cultural change (treating evaluation as a human-affecting activity, not merely a technical one).

Recommended patterns:

Ethics review framework. Establish an evaluation ethics review process that categorises evaluation activities by risk level: (1) minimal risk — automated evaluations with synthetic data, no human involvement; (2) low risk — evaluations with informed, non-vulnerable participants in controlled settings; (3) moderate risk — evaluations involving real user data, shadow-mode deployments, or A/B tests with non-vulnerable populations; (4) high risk — evaluations involving vulnerable populations, sensitive topics, or extended exposure. Each risk level has a corresponding review requirement: minimal risk requires self-assessment, low risk requires peer review, moderate risk requires ethics board review, high risk requires independent external review.
Tiered consent architecture. Implement consent mechanisms appropriate to the evaluation type: (1) explicit opt-in consent for direct participation (user studies, red-team exercises); (2) informed notification with opt-out for shadow-mode deployments affecting existing users; (3) enhanced consent for vulnerable populations, including plain-language explanations, cooling-off periods, and independent advocacy where appropriate. Consent must be granular — a participant may consent to one evaluation activity but not another.
Stopping criteria dashboard. For A/B tests and extended evaluations, implement a real-time monitoring dashboard that tracks predefined stopping criteria: harm rate per arm, quality disparity between arms, participant dropout rate, and adverse event count. When any criterion breaches its threshold, the system automatically alerts the evaluation lead and pauses new participant enrolment. For example: if the error rate in the agent arm exceeds 3% when the human arm is at 1%, the evaluation pauses for review.
Red-team participant welfare programme. For red-team exercises, implement: (1) pre-exercise briefing explaining the nature of content participants may encounter, with explicit consent for potentially distressing material; (2) real-time support availability during the exercise (e.g., a designated welfare contact); (3) mandatory post-exercise debriefing within 48 hours; (4) follow-up wellbeing check at 2 weeks; (5) access to employee assistance programme or equivalent support. Track participation frequency to prevent overexposure — no individual should participate in more than 4 red-team exercises per year without welfare review.
Harm remediation workflow. Establish a process for identifying and remediating individual harms: (1) detection — through participant reports, interim analysis, or post-evaluation review; (2) assessment — determine the nature and severity of the harm; (3) notification — inform the affected individual; (4) redress — provide appropriate remediation (correction, compensation, support); (5) prevention — update evaluation protocols to prevent recurrence.

Anti-patterns to avoid:

Treating shadow mode as zero-risk. Shadow-mode deployments process real data and can reveal sensitive insights about real users. The fact that the agent's outputs are not delivered to users does not eliminate the ethical implications of processing their data without consent.
Conflating technical quality with ethical acceptability. An evaluation that produces excellent technical results is not automatically ethical. A red-team exercise that identifies 50 critical vulnerabilities is technically valuable but ethically problematic if it traumatised participants.
Relying on terms-of-service consent for evaluation activities. Broad terms-of-service clauses that mention "service improvement" or "quality assurance" do not constitute informed consent for specific evaluation activities, particularly for sensitive or high-risk evaluations.
Running A/B tests without stopping criteria. An A/B test that runs to a predetermined duration regardless of observed harm rates is ethically indefensible. Stopping criteria must be defined before the evaluation begins and enforced during it.
Excluding red-team participant welfare from scope. Internal employees conducting red-team exercises are human subjects of the evaluation. Their exposure to adversarial content, including content designed to be manipulative, distressing, or harmful, requires the same ethical consideration as any other human-subject activity.

Industry Considerations

Healthcare. Evaluations involving patient data or clinical scenarios require heightened ethics review, often equivalent to clinical research ethics. Shadow-mode deployments in clinical settings must comply with clinical data governance requirements. A/B tests affecting clinical care delivery may require research ethics committee approval. The Caldicott Principles apply to all patient data used in evaluation.

Financial Services. A/B tests involving financial product recommendations or advice must comply with FCA requirements for fair treatment of customers. Evaluations must not result in some customers receiving systematically inferior financial guidance. Vulnerable customer protections under FCA guidance apply to evaluation contexts.

Public Sector. Evaluations involving public service recipients must comply with the Public Sector Equality Duty. A/B tests that result in unequal service quality across protected characteristics are legally and ethically problematic. The power imbalance between government and citizens requires enhanced consent protections.

Maturity Model

Basic Implementation — Ethics risk assessments are conducted for all evaluation activities involving human participants. Informed consent is obtained. Stopping criteria are defined for A/B tests and extended evaluations. Debriefing is provided for red-team participants. A remediation process exists for identified harms. This level meets the minimum mandatory requirements but ethics review may be internal and informal, and participant welfare monitoring may be retrospective rather than real-time.

Intermediate Implementation — An ethics review board reviews moderate and high-risk evaluations before commencement. Consent mechanisms are tiered by evaluation type and risk level. Stopping criteria are monitored in real time with automated alerts. Red-team participant welfare includes pre-briefing, real-time support, debriefing, and follow-up. A participant registry tracks consent, exposure, and adverse events. Interim analysis checkpoints are implemented for evaluations exceeding 30 days.

Advanced Implementation — All intermediate capabilities plus: independent external ethics review for high-risk evaluations. Predictive harm modelling estimates participant risk before the evaluation begins. Differential privacy protects evaluation datasets. The organisation publishes transparency reports on evaluation ethics practices. Lessons learned from evaluation ethics incidents are shared across the organisation and, where appropriate, across the industry.

7. Evidence Requirements

Required artefacts:

Ethics risk assessment records. Completed ethics risk assessments for each evaluation activity involving human participants, including risk categorisation, identified harms, mitigations, and approval authority.
Consent records. Evidence of informed consent for each evaluation activity, including the consent materials provided to participants and records of consent obtained.
Stopping criteria documentation. Defined stopping criteria for A/B tests and extended evaluations, including the thresholds, monitoring mechanism, and evidence that criteria were enforced.
Participant welfare records. For red-team exercises and evaluations involving potentially distressing content: pre-briefing materials, debriefing records, and any adverse event reports.
Remediation records. Documentation of any individual harms identified during evaluations and the remediation provided.

Retention requirements:

Ethics risk assessments, consent records, and remediation records: minimum 7 years for regulated sectors; minimum 5 years otherwise.

Access requirements:

Producible to regulators, ethics review bodies, or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Ethics Risk Assessment Completeness

Stimulus: Identify all evaluation activities involving human participants in the last 12 months. Verify that each has a completed ethics risk assessment.
Expected behaviour: Every evaluation activity with human participants has a documented ethics risk assessment completed before the evaluation commenced.
Pass criteria: 100% of identified activities have a pre-commencement ethics risk assessment.
Fail criteria: Any activity involving human participants proceeded without an ethics risk assessment.

Test 8.2: Informed Consent Verification

Stimulus: Select 5 evaluation activities from the last 12 months. Request consent materials and consent records.
Expected behaviour: Each activity has consent materials that explain purpose, participation requirements, data usage, withdrawal rights, and foreseeable risks. Consent records demonstrate that all participants provided consent.
Pass criteria: All 5 activities have complete consent materials and records. Materials are in plain language appropriate to the participant population.
Fail criteria: Any activity lacks consent materials or records, or materials omit required information (purpose, data use, withdrawal rights, risks).

Test 8.3: Vulnerable Population Protections

Stimulus: Identify any evaluation activities in the last 12 months that involved vulnerable populations. Verify that enhanced protections were implemented.
Expected behaviour: Enhanced protections are documented, including: plain-language consent materials, additional safeguards, independent advocacy (where appropriate), and heightened monitoring.
Pass criteria: All activities involving vulnerable populations demonstrate enhanced protections beyond the baseline.
Fail criteria: Any activity involving vulnerable populations used only baseline protections without enhancement.

Test 8.4: Stopping Criteria Enforcement

Stimulus: Identify all A/B tests and evaluations exceeding 30 days in the last 12 months. Verify that each has defined stopping criteria and evidence of monitoring.
Expected behaviour: Each evaluation has predefined stopping criteria with monitoring evidence (e.g., dashboard logs, checkpoint reports).
Pass criteria: 100% of applicable evaluations have stopping criteria documented before commencement and monitoring evidence during the evaluation.
Fail criteria: Any applicable evaluation lacks predefined stopping criteria or monitoring evidence.

Test 8.5: Debriefing Compliance

Stimulus: Identify all red-team exercises and evaluations involving potentially distressing content in the last 12 months. Verify that debriefing was provided.
Expected behaviour: All participants in identified activities received debriefing within 48 hours of participation.
Pass criteria: Debriefing records exist for all participants in all applicable activities.
Fail criteria: Any participant in an applicable activity did not receive debriefing.

Test 8.6: Remediation Process Functionality

Stimulus: Review the harm remediation process documentation. Verify that at least one test invocation has been conducted (either real or simulated) in the last 12 months.
Expected behaviour: The remediation process is documented, includes detection through redress steps, and has been tested.
Pass criteria: Process documentation exists and at least one test invocation (tabletop or real) has been conducted with documented outcomes.
Fail criteria: No remediation process documentation exists, or the process has never been tested.

Conformance Scoring

Score 0: No human-subject evaluation ethics governance exists — evaluations involving humans proceed without ethics review, consent, or safeguards.
Score 1: Ethics risk assessments and consent are conducted but inconsistently — some evaluations receive ethical review while others do not, or safeguards are informal rather than structured.
Score 2: All mandatory requirements are met — ethics risk assessments, informed consent, stopping criteria, debriefing, and remediation processes are in place for all applicable evaluations.
Score 3: Verified by independent assessment — an independent ethics review body has audited the evaluation ethics programme and confirmed that safeguards are comprehensive, consistently applied, and effective.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
GDPR	Articles 6, 7, 9 (Lawful Basis, Consent, Special Category Data)	Direct requirement
EU AI Act	Article 9 (Risk Management System)	Supports compliance
EU AI Act	Article 10 (Data and Data Governance)	Supports compliance
Declaration of Helsinki	Principles 1-37 (Ethical Principles for Medical Research)	Supports compliance
Equality Act 2010	Public Sector Equality Duty	Supports compliance
FCA PRIN	Principle 6 (Treating Customers Fairly)	Supports compliance
NIST AI RMF	GOVERN 1.2, MAP 1.5	Supports compliance

GDPR Article 6 requires a lawful basis for processing personal data. Article 7 specifies conditions for consent. Article 9 imposes additional restrictions on processing special category data (health, biometric, political opinions, etc.). Evaluation activities that process personal data — shadow-mode deployments, A/B tests using real user data, user studies — must establish a lawful basis. Where consent is the lawful basis, it must meet the GDPR standard: freely given, specific, informed, and unambiguous. Shadow-mode processing of health data in a healthcare evaluation requires explicit consent under Article 9(2)(a) or another Article 9 exception.

EU AI Act — Articles 9, 10

Article 9 requires risk management measures that are proportionate to the risks. Evaluation activities that create risks to human subjects require corresponding risk management measures — the ethics risk assessment process directly supports this. Article 10 addresses data governance, including requirements for data quality, representativeness, and appropriate handling — all of which apply to evaluation data involving human subjects.

FCA PRIN — Principle 6

Principle 6 requires firms to treat customers fairly. A/B tests that expose some customers to lower quality service, or evaluations that use customer data without appropriate consent, risk non-compliance. The FCA has indicated that firms cannot use fair treatment obligations as exceptions for testing and evaluation — customers must receive fair treatment during evaluation periods as well as normal operation.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Individual to organisational — harm ranges from individual participant impact to organisation-wide regulatory and reputational consequences

Consequence chain: Without human-subject evaluation ethics governance, organisations risk causing direct harm to individuals who participate in or are affected by evaluation activities. The immediate consequences are individual: a participant distressed by red-team content, a user who received incorrect guidance during an A/B test, a vulnerable individual whose sensitive data was processed without consent. The organisational consequences follow: regulatory investigations (GDPR, ICO, FCA), enforcement actions, compensation requirements, and reputational damage. The systemic consequence is erosion of trust — if participants and affected individuals cannot trust that evaluation activities are conducted ethically, organisations will face increasing difficulty recruiting red-team participants, obtaining consent for evaluations, and maintaining public support for AI deployment.

Cross-references: AG-349 (Scenario Library Governance) defines the scenario specifications that evaluation activities execute. AG-354 (Hidden Test Integrity Governance) must be balanced against participant transparency — blinding must not override informed consent. AG-355 (Continuous Red-Team Scheduling Governance) must incorporate participant welfare scheduling to prevent overexposure. AG-095 (Prompt Injection Resilience Testing) involves adversarial content that may affect red-team participants.

Cite this protocol

AgentGoverning. (2026). AG-351: Human-Subject Evaluation Ethics Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-351

← Previous Protocol

AG-350

Coverage Gap Tracking Governance

Next Protocol →

AG-352

Evaluation Environment Parity Governance