CASiscas.ac.cnMay 28, 2026arXiv:2605.29559

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

Xiaoxuan Peng, Kai Zhang, Kaiqi Zhang, Xinyu Lu, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

AI Summary

The paper introduces LiteCoder-Terminal-Gen, a zero-dependency pipeline for generating executable terminal environments from domain specifications, addressing limitations of existing scraped datasets. They create two datasets, LiteCoder-Terminal-SFT (expert trajectories) and LiteCoder-Terminal-RL (verifiable environments), and fine-tune Qwen models on them. Results show significant performance gains on Terminal Bench benchmarks after supervised fine-tuning and further improvements with Direct Multi-turn Preference Optimization (DMPO), demonstrating the effectiveness of synthetic environments for training language agents in complex command-line tasks.

Key Contribution

Forget scraping – this work shows you can generate high-quality, executable terminal environments from scratch to train language agents that outperform models trained on scraped data.

Abstract

Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal training environments directly from domain specifications. Using this framework, we construct two large-scale resources: LiteCoder-Terminal-SFT, comprising 11,255 expert trajectories across 10 domains, and LiteCoder-Terminal-RL, featuring 602 verifiable environments for trajectory-level preference optimization. Supervised fine-tuning of Qwen-family models on our SFT dataset yields agents that significantly outperform their base counterparts. Notably, our 32B variant achieves 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro, respectively. Furthermore, applying Direct Multi-turn Preference Optimization (DMPO) on our RL environments yields additional performance gains. These results systematically demonstrate that fully synthetic, executable environments offer a scalable and verifiable supervision signal for mastering complex, real-world command-line workflows.

Code Generation & Program Synthesis Data Curation & Synthetic Data Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

Related Papers