Mar 1, 2026arXiv:2603.01110

Compact Task-Aligned Imitation Learning for Laboratory Automation

Kanata Suzuki, Hanon Nakamurama, Kana Miyamoto, Tetsuya Ogata

AI Summary

This paper introduces TVF-DiT, a compact imitation learning framework for laboratory automation that combines a self-supervised vision foundation model with a vision-language model via a small adapter, integrated with a Diffusion Transformer for action prediction. By aligning vision and language modalities and leveraging diffusion-based policy learning, the method achieves high success rates (86.6% on average) across three real-world lab tasks while using fewer than 500M parameters, enabling deployment on resource-constrained hardware. The study demonstrates the effectiveness of task-specific prompts in improving vision-language alignment and overall task performance.

Key Contribution

Achieve surprisingly strong imitation learning for robotic lab automation using a model under 500M parameters, demonstrating that you don't need massive models for real-world tasks.

Abstract

Robotic laboratory automation has traditionally relied on carefully engineered motion pipelines and task-specific hardware interfaces, resulting in high design cost and limited flexibility. While recent imitation learning techniques can generate general robot behaviors, their large model sizes often require high-performance computational resources, limiting applicability in practical laboratory environments. In this study, we propose a compact imitation learning framework for laboratory automation using small foundation models. The proposed method, TVF-DiT, aligns a self-supervised vision foundation model with a vision-language model through a compact adapter, and integrates them with a Diffusion Transformer-based action expert. The entire model consists of fewer than 500M parameters, enabling inference on low-VRAM GPUs. Experiments on three real-world laboratory tasks - test tube cleaning, test tube arrangement, and powder transfer - demonstrate an average success rate of 86.6%, significantly outperforming alternative lightweight baselines. Furthermore, detailed task prompts improve vision-language alignment and task performance. These results indicate that small foundation models, when properly aligned and integrated with diffusion-based policy learning, can effectively support practical laboratory automation with limited computational resources.

Inference & Quantization Robotics & Embodied AI Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Compact Task-Aligned Imitation Learning for Laboratory Automation

Related Papers