Search papers, labs, and topics across Lattice.
The paper introduces BiManiBench, a hierarchical benchmark designed to evaluate the bimanual coordination capabilities of Multimodal Large Language Models (MLLMs) across spatial reasoning, action planning, and end-effector control. It addresses the limitations of existing benchmarks that primarily focus on single-arm manipulation, failing to capture the complexities of bimanual tasks. Experiments on 30+ state-of-the-art MLLMs reveal deficiencies in dual-arm spatial grounding and control, leading to interference and sequencing errors, indicating a lack of understanding of kinematic constraints.
MLLMs can ace the high-level strategy for two-handed robot tasks, but still fumble basic coordination like not smashing the robot's arms together.
Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.