Mar 17, 2026arXiv:2603.16734

Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

AI Summary

This paper investigates how mental health disclosures in user profiles affect the harmful behavior of LLM agents in completing malicious tasks. Using the AgentHarm benchmark, the study evaluates frontier and open-source LLMs under varying user-context personalization and jailbreak injections. The results show that mental health disclosures can modestly reduce harmful task completion but also increase over-refusal on benign tasks, and that jailbreak prompts can override these protective effects.

Key Contribution

Mental health disclosures in user profiles can *increase* LLM agent refusal rates on both harmful and benign tasks, revealing a fragile safety-utility trade-off easily overridden by jailbreaks.

Abstract

Large language models (LLMs) are increasingly deployed as tool-using agents, shifting safety concerns from harmful text generation to harmful task completion. Deployed systems often condition on user profiles or persistent memory, yet agent safety evaluations typically ignore personalization signals. To address this gap, we investigated how mental health disclosure, a sensitive and realistic user-context cue, affects harmful behavior in agentic settings. Building on the AgentHarm benchmark, we evaluated frontier and open-source LLMs on multi-step malicious tasks (and their benign counterparts) under controlled prompt conditions that vary user-context personalization (no bio, bio-only, bio+mental health disclosure) and include a lightweight jailbreak injection. Our results reveal that harmful task completion is non-trivial across models: frontier lab models (e.g., GPT 5.2, Claude Sonnet 4.5, Gemini 3-Pro) still complete a measurable fraction of harmful tasks, while an open model (DeepSeek 3.2) exhibits substantially higher harmful completion. Adding a bio-only context generally reduces harm scores and increases refusals. Adding an explicit mental health disclosure often shifts outcomes further in the same direction, though effects are modest and not uniformly reliable after multiple-testing correction. Importantly, the refusal increase also appears on benign tasks, indicating a safety--utility trade-off via over-refusal. Finally, jailbreak prompting sharply elevates harm relative to benign conditions and can weaken or override the protective shift induced by personalization. Taken together, our results indicate that personalization can act as a weak protective factor in agentic misuse settings, but it is fragile under minimal adversarial pressure, highlighting the need for personalization-aware evaluations and safeguards that remain robust across user-context conditions.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

Related Papers