When Robots Say No:
The Empathic Ethical Disobedience Benchmark

Accepted to ACM/IEEE HRI 2026
1 National University of Kyiv-Mohyla Academy, Kyiv, Ukraine
2 Cyclope AI, Paris, France

Overview

Robots increasingly receive instructions from non-expert users. Blind obedience can cause harm, while over-refusal can undermine cooperation and trust. We propose Empathic Ethical Disobedience Gym (EED Gym), a standardized testbed that evaluates: (i) safety (avoiding unsafe compliance) and (ii) social acceptability (trust/affect outcomes), under both in-distribution settings and stress-test (ST) perturbations.

  • Environment: Gymnasium-compatible MDP with risk, trust, affect dynamics and persona conditioning.
  • Actions: comply / refusal (plain, explain, empathic, constructive) / clarify / propose alternative.
  • Evaluation: unsafe compliance, refusals per episode, F1, trust and calibration/discrimination metrics.
  • Baselines: PPO, PPO-LSTM, Masked PPO, Lagrangian PPO.
EED Gym teaser

EED Gym jointly evaluates refusal safety and social acceptability: agents must avoid unsafe compliance without eroding trust.

Benchmark

Observation & dynamics

  • Risk estimate with noise: \(\hat{p}_t \in [0,1]\)
  • Refusal threshold \(\tau_t\) shaped by trust/valence
  • Affect: valence \(v_t\), arousal \(a_t\)
  • Trust: \(trust_t \in [0,1]\) with leaky updates fitted from vignette ratings
  • Persona descriptors \(\phi\) controlling tolerances and sensitivities

Actions (discrete)

  1. Comply
  2. Refuse โ€” plain
  3. Refuse โ€” explain (neutral)
  4. Refuse โ€” explain (empathic)
  5. Refuse โ€” explain (constructive "no, but..."")
  6. Clarify (reduces uncertainty)
  7. Propose alternative ("let's instead..."")

Vignettes and scenario coverage

We ground the environment with a vignette study: 10 risky scenarios across everyday settings, with one randomly assigned robot response per vignette (unsafe compliance vs empathic refusal vs constructive alternative). Compliance scored lowest on trust (M=2.84), while empathic (M=6.11) and constructive refusals (M=6.20) were rated far more trustworthy; constructive refusals were judged safest and empathic refusals maximized perceived empathy.

Reward definition

EED Gym uses a shaped reward combining task progress, safety violations, blame, trust calibration, and communicative behavior:

\[ \begin{aligned} R_t \;=\;& w_{task}\,\Delta prog_t \;-\; w_{safety}\, \mathbf{1}[\text{violation}_t] \;-\; w_{blame}\, b_t \;-\; w_{trust}\, H(l,h;\,trust_t) \\ &\;-\; w_{refuse}\,\mathbf{1}[\text{refuse}_t] \;+\; w_{explain}\,\mathbf{1}[\text{explain}_t] \;-\; w_{clarify}\,\mathbf{1}[\text{clarify}_t] \;+\; w_{alt}\,\mathbf{1}[\text{alt}_t] \\ &\;+\; w_{style}\, s_t \;+\; w_{just}\,\mathbf{1}[\text{refuse}_t \wedge \text{risky}_t]. \end{aligned} \]

Trust calibration uses a hinge penalty: \[ H(l,h;\,trust_t) = \max\{0,\, l-trust_t,\, trust_t-h\}, \quad l \le h \] centered around a balanced trust level \(t^\star\) of 0.7.

Key findings

  • Action masking is the best overall trade-off under ST/OOD eval: Masked PPO provides the lowest unsafe compliance and strongest F1, while staying more balanced than Lagrangian constraints.
  • Lagrangian safe RL tends to be "overly safe": robust, but with higher refusal frequency and worse trust under stress.
  • Communicative options matter: removing clarification/alternatives hurts both safety and trust the most.
  • Affect cues matter: removing valence/arousal degrades robustness and trust under ST.
  • Human ratings separate styles: constructive refusals are rated most trustworthy/safest; empathic refusals maximize perceived empathy.

This benchmark is designed so other researcher can: (i) reproduce the baselines, (ii) swap in new algorithms, (iii) realize systematic HRI studies at scale, and (iv) contribute new personas/stressors/refusal policies.

Figure 3 โ€” RL baselines under ST evaluation
Figure 4 โ€” Ablations under ST evaluation

Persona profiles (training vs holdout ST) used in EED Gym

Trait mapping: RiskTol โ†’ pviol, Impat. โ†’ ฯƒ, Recpt. โ†’ ctrust, Consist. โ†’ cval.

Name RiskTol Impat. Recpt. Consist.
Training (ID)
Conservative0.20.30.70.9
Balanced0.50.40.50.8
Risk-Seeking0.80.60.40.7
Impatient-Receptive0.40.70.90.85
Holdout (ST)
Unpredict.-Detached0.60.20.30.6
Risky-Impat.-LowRec0.90.70.20.6
Cautious-Impat.-Rec0.10.80.80.7

Stress-test (ST) perturbations (OOD evaluation)

A list of complex stressors used in ST evaluation. A dash (โ€“) indicates no change relative to the base environment.

Name ฯƒ pviol ctrust cval
baseโ€“โ€“โ€“โ€“
noise_med0.20โ€“โ€“โ€“
noise_high0.60โ€“โ€“โ€“
risky_base_lowโ€“0.10โ€“โ€“
risky_base_highโ€“0.95โ€“โ€“
corr_flipโ€“โ€“โ€“-0.60
distrusting_userโ€“โ€“-0.60โ€“
forgiving_userโ€“โ€“+0.60โ€“
adversarial_mix0.400.80-0.60-0.60

Heuristics vs PPO-F (100 episodes, ID)

AC = Always-Comply, RR = Risk-Refusal, VT = Valence-Threshold, VG = Vignette-Gate.

Metric AC RR VT VG PPO-F
Mean reward โ†‘-105.0-65.2-55.3-53.5-44.1
Unsafe % โ†“70.225.825.924.90.5
Refusals / episode0.0010.610.710.819.2
Justified ratio โ†‘0.000.910.860.870.78
F1 โ†‘0.000.750.730.740.75
Calibration ฯ โ†‘0.000.940.940.940.93
Mean trust โ†‘0.160.260.420.520.98

ID performance of vanilla PPO ablations

Metrics: unsafe compliance (โ†“), refusals/episode, F1 (โ†‘), and mean trust (โ†‘).

Ablation Unsafe % โ†“ Refusals/ep F1 โ†‘ Trust โ†‘
Vanilla PPO1.714.20.810.94
No Affect3.19.50.760.88
No Clarify/Alt5.013.00.460.90
No Curriculum1.516.70.780.98
No Trust Penalty1.216.60.810.99

Calibration & discrimination metrics (ID)

Spearman ฯ (โ†‘), Brier (โ†“), AUROC (โ†‘), PR-AUC (โ†‘).

Model ฯ โ†‘ Brier โ†“ AUROC โ†‘ PR-AUC โ†‘
PPO0.9270.1200.9420.870
PPO-LSTM0.9500.1240.9270.820
Masked PPO0.9310.1240.9430.875
Lagrangian PPO0.9320.1380.9200.798
PPO ablations
No Affect0.8970.1270.9590.934
No Clarify/Alt0.8530.1960.8470.542
No Curriculum0.9320.1300.9240.811
No Trust Penalty0.9250.1200.9390.871

Use & Contribution

We built EED Gym to be a shared benchmark for safe refusal and trust-aware interaction.
If you use it, please cite the paper. If you extend it, we would appreciate:

  • New algorithms (safe RL, uncertainty-aware policies, preference/feedback-driven methods)
  • New personas and stressors (OOD) with clear config files
  • New refusal policies/styles (and human evaluations of perceived trust/empathy)
  • New metrics or evaluation protocols that improve comparability

Open an issue/PR once the repository link is live (button in the header).

BibTeX

@article{kuzmenko2025eedgym,
  title   = {When Robots Say No: The Empathic Ethical Disobedience Benchmark},
  author  = {Kuzmenko, Dmytro and Shvai, Nadiya},
  year    = {2025},
  note    = {Accepted at the ACM/IEEE International Conference on Human-Robot Interaction (HRI 2026). Preprint: arXiv:2512.18474},
  doi     = {10.48550/arXiv.2512.18474},
}