Search papers, labs, and topics across Lattice.
This paper introduces a method for neuron-level emotion control in speech-generative large audio-language models (LALMs) by identifying and manipulating emotion-sensitive neurons (ESNs). ESNs are discovered through success-filtered activation aggregation, ensuring both emotion realization and content preservation during speech generation. The approach enables training-free emotion steering at inference time across three different LALMs, demonstrating emotion-specific gains that generalize to unseen speakers, as validated by both automatic and human evaluations.
Control the emotional tone of generated speech without any training by directly manipulating specific neurons within large audio-language models.
Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength. Our results establish a mechanistic framework for training-free emotion control in speech generation.