Search papers, labs, and topics across Lattice.
This paper investigates how LLMs respond to user prompts exhibiting Dark Triad traits (Machiavellianism, Narcissism, and Psychopathy) using a newly curated dataset. The study analyzes the LLMs' behavior in terms of corrective versus reinforcing responses, considering the severity level of the prompts and the sentiment of the LLM's output. Results indicate that while LLMs generally exhibit corrective behavior, they sometimes reinforce negative traits, with the specific response varying based on the model, severity, and sentiment.
LLMs, while generally corrective, sometimes reinforce Dark Triad traits in user prompts, revealing a potential vulnerability in conversational AI safety.
Large Language Models (LLMs) often exhibit highly agreeable and reinforcing conversational styles, also known as AI-sycophancy. Although this behavior is encouraged, it may become problematic when interacting with user prompts that reflect negative social tendencies. Such responses risk amplifying harmful behavior rather than mitigating it. In this study, we examine how LLMs respond to user prompts expressing varying degrees of Dark Triad traits (Machiavellianism, Narcissism, and Psychopathy) using a curated dataset. Our analysis reveals differences across models, whereby all models predominantly exhibit corrective behavior, while showing reinforcing output in certain cases. Model behavior also depends on the severity level and differs in the sentiment of the response. Our findings raise implications for designing safer conversational systems that can detect and respond appropriately when users escalate from benign to harmful requests.