Search papers, labs, and topics across Lattice.
CCR-Bench is introduced as a benchmark to evaluate LLMs on complex instructions involving entangled content/formatting, intricate control flows, and real-world scenarios. The benchmark features tasks requiring deep reasoning, conditional logic, and procedural planning, all derived from industrial applications. Experiments reveal that even state-of-the-art LLMs struggle with CCR-Bench, highlighting a gap between current capabilities and real-world demands.
LLMs still fail to follow complex instructions that entangle content, formatting, control flow, and real-world constraints, despite progress on simpler benchmarks.
Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of content and format, logical workflow control, and real-world applications. This leads to a significant gap between current evaluation practices and practical demands. To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions. CCR-Bench is characterized by: (1) deep entanglement of content and formatting requirements in task specifications; (2) instructions that involve intricate task decomposition, conditional reasoning, and procedural planning; and (3) evaluation samples derived entirely from real-world industrial scenarios. Extensive experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substantial performance deficiencies, clearly quantifying the gap between current LLM capabilities and the demands of realworld instruction understanding. We believe that CCR-Bench offers a more rigorous and realistic evaluation framework, advancing the development of LLMs toward the next generation of models capable of understanding and executing complex tasks in industrial applications.