
Allen Institute for AI (AI2)
Non-profit research institute founded by Paul Allen. Known for Semantic Scholar, OLMo, and AI for science.
allenai.org5
120
24
Top Researchers
Recent Papers
The paper introduces Olmix, a framework designed to address challenges in data mixing for language model training, specifically focusing on understanding the configuration space of mixing methods and efficiently adapting to evolving domain sets. Through an empirical study, the authors identify key design choices for effective mixing methods and propose "mixture reuse," a technique that leverages past mixture ratios to efficiently recompute mixtures after domain set updates. Experiments show that mixture reuse achieves comparable performance to full recomputation with significantly reduced compute (74% less) and outperforms training without mixing by 11.6% on downstream tasks.
Introduces and validates "mixture reuse," a novel technique for efficiently adapting data mixtures in language model training when the domain set evolves.
The authors introduce MolmoSpaces, a large-scale, open-source ecosystem comprising over 230k diverse indoor environments and 130k richly annotated object assets, designed to address the limitations of existing robot benchmarks in capturing the long tail of real-world scenarios. This simulator-agnostic ecosystem supports a wide range of embodied tasks, including navigation, manipulation, and long-horizon planning, and includes MolmoSpaces-Bench, a benchmark suite of 8 tasks. Experiments demonstrate strong sim-to-real correlation and highlight sensitivities to factors like prompt phrasing and camera occlusion, establishing MolmoSpaces as a valuable resource for scalable robot learning research.
Introduces a large-scale, simulator-agnostic, and open-source ecosystem for robot learning, featuring diverse indoor environments and richly annotated objects, to facilitate more robust and generalizable robot policies.
The paper introduces Soft-Verified Efficient Repository Agents (SERA), a supervised finetuning method for efficiently training coding agents specialized to private codebases. SERA leverages Soft Verified Generation (SVG) to create thousands of synthetic trajectories from a single repository, enabling rapid and cost-effective specialization. The resulting SERA models achieve state-of-the-art performance among fully open-source models, matching the performance of models like Devstral-Small-2 at a fraction of the cost compared to reinforcement learning or previous synthetic data methods.
Introduces Soft Verified Generation (SVG), a novel method for generating synthetic code trajectories that enables efficient supervised finetuning of coding agents specialized to private codebases.
The authors introduce Action Reasoning Models (ARMs) for robotics, which integrate perception, planning, and control in a three-stage pipeline to improve adaptability and grounding. They present MolmoAct, a 7B parameter ARM, that encodes observations and instructions into depth-aware perception tokens, generates spatial plans as trajectory traces, and predicts low-level actions. MolmoAct achieves state-of-the-art performance in simulation and real-world settings, demonstrating improved zero-shot accuracy, long-horizon task success, and out-of-distribution generalization compared to existing models like Pi-0 and ThinkAct.
Introduces Action Reasoning Models (ARMs), a novel class of robotic foundation models that explicitly incorporate spatial planning as an intermediate reasoning step between perception and action.
The paper introduces RewardBench 2, a new benchmark for evaluating reward models across multiple skills, featuring challenging data derived from novel human prompts. It addresses the gap between reward model evaluation and their effectiveness in downstream tasks by providing a more rigorous assessment of reward model accuracy. The benchmark demonstrates a strong correlation with downstream performance in both inference-time scaling and RLHF training, while showing a significant performance drop compared to the original RewardBench.
Introduces a novel multi-skill reward modeling benchmark, RewardBench 2, using new human prompts to improve the rigor and relevance of reward model evaluation.

