NVIDIAFeb 24, 2026arXiv:2602.21193

On Data Engineering for Scaling LLM Terminal Capabilities

Renjie Pi, Renjie Pi, Grace Lam, Grace Lam, Mohammad Shoeybi, M. Shoeybi, Pooya Jannaty, Pooya Jannaty, Bryan Catanzaro, Bryan Catanzaro, Wei Ping, Wei Ping

AI Summary

This paper investigates data engineering strategies for training large language models to improve their terminal capabilities, focusing on synthetic data generation and training techniques. They introduce Terminal-Task-Gen, a pipeline for generating synthetic terminal tasks, and Terminal-Corpus, a large-scale dataset created using this pipeline. Training Nemotron-Terminal models (8B, 14B, 32B) on this dataset, initialized from Qwen3, significantly improves performance on Terminal-Bench 2.0, demonstrating the effectiveness of their data engineering approach.

Key Contribution

Forget hand-crafted datasets: a new synthetic data pipeline lets smaller LLMs beat giants at terminal tasks.

Abstract

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.

Data Curation & Synthetic Data Tool Use & Agents Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

On Data Engineering for Scaling LLM Terminal Capabilities

Related Papers