Search papers, labs, and topics across Lattice.
Tongji University, Shanghai, China, Texas A&M University, M+ trajectories; RoboCOIN [26] collects over 180,000 demonstrations for 421 tasks. However, these datasets and tasks often focus on a few common tasks and behaviors. After removing duplicates and categorizing them based on their semantic meanings, most tasks concentrate on very common behaviors such as “pick and hold”, while lacking coverage of complex and long-tail tasks. This singular task design leads to significant biases in the trained models, limiting their applicability in real-world scenarios as pre-trained models, except for a few common tasks. Similarly, current evaluation tasks suffer from analogous issues. Most studies, when proposing new methods, tend to test only on a few common tasks, without a unified task design standard, making fair comparisons across different works difficult. To address these issues, we introduce the Great March 100 (GM-100) as the first step towards a robot learning Olympics. GM-100 consists of 100 carefully designed tasks that cover a wide range of interactions and long-tail behaviors, aiming to provide a diverse and challenging set of tasks to comprehensively evaluate the capabilities of robotic agents and promote diversity and complexity in robot dataset task designs. These tasks are developed through systematic analysis and expansion of existing task designs, combined with insights from human action understanding. We collect a large amount of trajectory data on two different robotic platforms and evaluate several baseline models. Experimental results demonstrate that the GM-100 tasks are 1) feasible to execute and 2) sufficiently challenging to effectively differentiate the performance of various methods. Besides, in the task design process, we do not rely on the utility for real-world tasks as the standard to avoid human bias, but follow the physical common sense and low-level manipulation knowledge (the how-level affordance) as the only standards to generate and select the final tasks. To summarize, in this report, we make the following contributions: • We identify the limitations of existing robot task designs and evaluations, highlighting the need for more diverse and complex tasks. • We propose GM-100, a task list consisting of 100 detail-oriented tasks that cover a wide range of interactions and long-tail behaviors. • We collect a medium-sized dataset on robotic platforms and evaluate several baseline models, demonstrating the challenge and effectiveness of GM-100. Our data and code are available at https://rhos.ai/research/gm-100. 2 Related Work 2.1 Imitation Learning Imitation learning underpins embodied intelligence by teaching agents to map sensory inputs to actions via expert demonstrations. Early methods include Behavioural Cloning [20], interactive aggregation as in DAgger [22], adversarial approaches like GAIL [10]. More recently, diffusion-based policies such as ACT [31], Diffusion Policy [6], and, Z_{t_{2}}, capturing both semantic content and uncertainty arising from ambiguity across views, report sections, or image-text relationships. 3.4 Model Training During training, the model computes probabilistic distances between all relevant pairs of distributions, including image-text pairs, image-image pairs, and text-text pairs, using Equation 2 to compare Gaussian embeddings. To prevent the model from predicting unbounded variances and to regularize the distributional space, we additionally apply a variational information bottleneck (VIB) penalty (Equation 3), computing KL divergence between each embedding distribution and a unit Gaussian prior. The final training objective is a weighted combination of inter-modal NLL, intra-modal NLL terms, and KL regularization: ℒtotal=\displaystyle\mathcal{L}_{\text{total}}= ℒinter+λIℒintra-I+λTℒintra-T+\displaystyle\mathcal{L}_{\text{inter}}+\lambda_{I}\mathcal{L}_{\text{intra-I}}+\lambda_{T}\mathcal{L}_{\text{intra-T}}+ (4) βIKLimg+βTKLtext,\displaystyle\beta_{I}\mathrm{KL}_{\text{img}}+\beta_{T}\mathrm{KL}_{\text{text}}, where ℒinter\mathcal{L}_{\text{inter}} inter-modal probabilistic NLL, averaged over the four image-text pairs, ℒintra-I\mathcal{L}_{\text{intra-I}} and ℒintra-T\mathcal{L}_{\text{intra-T}} are intra-modal symmetry losses between the image views and text inputs, KLimg\mathrm{KL}_{\text{img}} and KLtext\mathrm{KL}_{\text{text}} are variational information bottleneck (VIB) KL divergences for image and text embeddings, λI\lambda_{I}, λT\lambda_{T}, βI\beta_{I}, and βT\beta_{T} are weight scalars. This multi-view, multi-loss formulation provides richer supervision and produces probabilistic embeddings that are both semantically aligned and uncertainty-calibrated, ultimately improving cross-modal retrieval performance. 3.5 Implementation Details We implement MedProbCLIP in PyTorch [21] following the model architecture introduced previously. For image encoding, we use a ViT, Lingyi Cai is with the Research Center of 6G Mobile Communications, School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China, and also with the College of Computing and Data Science, Nanyang Technological University, Singapore (e-mail: lingyicai@hust.edu.cn).Yu Zhang and Tao Jiang are with the Research Center of 6G Mobile Communications, School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China (e-mail: yuzhang123@hust.edu.cn; Tao.jiang@ieee.org).Ruichen Zhang, Yinqiu Liu, and Dusit Niyato are with the College of Computing and Data Science, Nanyang Technological University, Singapore (e-mails: ruichen.zhang@ntu.edu.sg; yinqiu001@e.ntu.edu.sg; dniyato@ntu.edu.sg).Wei Ni is with the School of Engineering, Edith Cowan University, Perth, WA 6027, and the School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, NSW 2033, Australia (e-mail: Wei.Ni@ieee.org).Abbas Jamalipour is with the School of Electrical and Computer Engineering, University of Sydney, Australia, and with the Graduate School of Information Sciences, Tohoku University, Japan (e-mail: a.jamalipour@ieee.org)
1
2
3
4
By jointly reinforcing informative visual tokens and suppressing irrelevant ones, DuCAR significantly reduces hallucinations in LVLMs, outperforming prior single-modality focused approaches.