Search papers, labs, and topics across Lattice.
The authors introduce SciAgentGym, a new interactive environment with 1,780 domain-specific tools across four natural science disciplines, and SciAgentBench, a tiered evaluation suite, to benchmark multi-step scientific tool use in LLM agents. They found that even advanced models like GPT-5 struggle with long-horizon workflows in this environment, with success rates dropping significantly. To mitigate this, they propose SciForge, a data synthesis method that models the tool action space as a dependency graph, and demonstrate that fine-tuning a SciAgent-8B model on data generated by SciForge outperforms the much larger Qwen3-VL-235B-Instruct model.
GPT-5's scientific reasoning skills plummet by nearly 50% when tackling multi-step workflows, revealing a critical gap in current LLM agents' ability to orchestrate complex tool use.
Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.