CMU Machine Learning

Carnegie Mellon's Machine Learning Department. Home to foundational work in statistical ML, deep learning, and robotics.

www.ml.cmu.edu

Total papers

Total citations

Avg citations

Top Researchers

Tom MitchellGraham NeubigRuslan Salakhutdinov

Recent Papers

Feb 18, 2026

UC Berkeleyjust now·affiliated lab: CMU Machine Learning

Edison 3.0: A Multimodal RAG System for Large-Scale Educational Q&A with Human-in-the-Loop Oversight

In the demo, this work will showcase Edison's multimodal capabilities across mathematics, computer science, and data science, demonstrating how its modular architecture enables rapid deployment and customization for different educational contexts while maintaining instructional effectiveness in answering student questions.

Meenakshi Mittal, Rishi Khare, Mihran Miroyan +2

Feb 11, 2026

3d ago

GameDevBench: Evaluating Agentic Capabilities Through Game Development

The paper introduces GameDevBench, a new benchmark for evaluating multimodal agents in game development, a domain requiring complex code manipulation and multimodal asset handling. The benchmark comprises 132 tasks derived from tutorials, demanding significantly more code and file changes compared to existing software development benchmarks. Experiments reveal that current agents struggle with game development tasks, particularly those involving 2D graphics, but performance can be improved by incorporating image and video-based feedback mechanisms.

Introduces GameDevBench, a novel benchmark designed to evaluate and advance multimodal agents in the challenging domain of game development.

Wayne Chi, Arnav Yayavaram, Siddharth Yayavaram +62602.11103

Eval Frameworks & BenchmarksCode Generation & Program SynthesisTool Use & Agents

Jan 12, 2026

Tuning-free Visual Effect Transfer across Videos

The paper introduces RefVFX, a framework for transferring complex temporal visual effects from a reference video to a target video or image in a feed-forward manner. To train the model, the authors created a large-scale dataset of video triplets using a novel automated pipeline that preserves input motion while applying repeatable effects, augmented with LoRA-derived and programmatically generated data. Experiments demonstrate that RefVFX generalizes to unseen effects, produces temporally coherent edits, and outperforms text-prompt baselines.

Introduces RefVFX and a corresponding large-scale dataset to enable tuning-free transfer of complex temporal visual effects across videos.

Maxwell Jones, Rameen Abdal, Or Patashnik +42601.07833

Jan 6, 2026

AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports

The paper introduces AfriEconQA, a new benchmark dataset for African economic analysis constructed from 236 World Bank reports, designed to evaluate numerical reasoning and temporal disambiguation capabilities of models. The dataset comprises 8,937 question-answer pairs, filtered from a larger synthetic pool to ensure high-quality evidence-answer alignment and temporal provenance. Benchmarking experiments using GPT-5 Mini, GPT-4o, and Qwen 32B in zero-shot and RAG configurations reveal a significant performance gap, highlighting the dataset's challenge for current LLMs and the need for domain-specific IR and RAG advancements.

Introduces AfriEconQA, a novel benchmark dataset specifically designed to evaluate the performance of information retrieval and question answering systems on African economic analysis using World Bank reports.

Edward Ajayi2601.15297

Eval Frameworks & BenchmarksData Curation & Synthetic DataNatural Language Processing

Nov 12, 2025

Institute of Foundation ModelsNov 12, 2025·affiliated lab: CMU Machine Learning

PAN: A World Model for General, Interactable, and Long-Horizon World Simulation

The paper introduces PAN, a general world model capable of predicting future world states through high-quality video simulation conditioned on history and natural language actions. PAN uses a Generative Latent Prediction (GLP) architecture, combining an autoregressive latent dynamics backbone based on a large language model (LLM) for grounding simulation in text-based knowledge, with a video diffusion decoder for reconstructing detailed visual observations. Trained on large-scale video-action pairs, PAN demonstrates strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning across diverse domains.

The paper pioneers a general world model, PAN, that unifies latent space reasoning with realizable world dynamics to achieve open-domain, action-conditioned video simulation with coherent, long-term consistency.

Pan Team Institute of Foundation Models Jiannan Xiang, Yi Gu, Zihan Liu +3052511.09057

World Models & PlanningMultimodal ModelsComputer Vision

Oct 8, 2025

Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

The paper introduces summarization-based context management for RL fine-tuning of LLMs in long-horizon multi-turn tool use, addressing the context length bottleneck. They formulate a policy gradient representation that allows standard LLM RL infrastructures to optimize both tool-use behaviors and summarization strategies end-to-end. The proposed algorithm, \texttt{SUPO}, demonstrates improved success rates and maintained or reduced context length on interactive function calling and searching tasks, even scaling beyond training-time summarization rounds at test time.

Introduces and validates a summarization-based context management approach that enables RL agents to scale beyond fixed context length limits in long-horizon multi-turn tasks.

Miao Lu, Weiwei Sun, Weihua Du +482510.06727

RLHF & Preference LearningTool Use & AgentsNatural Language Processing

Sep 30, 2025

Megagon LabsSep 30, 2025·affiliated lab: CMU Machine Learning

Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

The paper introduces FuncBenchGen, a synthetic benchmark for evaluating multi-step tool-use in language models by framing tool use as traversal over a function-dependency DAG. This framework allows for controlled task difficulty and avoids data contamination, addressing limitations in existing TaLM benchmarks. Experiments reveal performance degradation with increasing dependency depth and the difficulty posed by connected distractor functions, while also demonstrating that explicitly restating prior variable values significantly improves performance.

Introduces FuncBenchGen, a contamination-free and controllable framework for evaluating multi-step tool-use in language models via synthetic task generation.

Seiji Maekawa, Jackson Hassell, Pouya Pezeshkpour +22509.26553

Eval Frameworks & BenchmarksTool Use & AgentsData Curation & Synthetic Data

May 29, 2025

Leveraging generative AI for cross-regional small object detection in satellite imagery

This paper addresses the challenge of small object detection in satellite imagery across different geographical regions by using generative AI to create synthetic training data. A Stable Diffusion model is fine-tuned on both source (Selwyn, New Zealand) and target (Utah, USA) regions, leveraging cross- and self-attention mechanisms and CLIPSeg for image segmentation. The approach demonstrates a 20% improvement in detection accuracy on the target dataset compared to a baseline trained solely on source data, highlighting the effectiveness of generative data augmentation for cross-regional generalization.

Demonstrates a generative AI pipeline that synthesizes realistic satellite imagery for improved cross-regional small object detection, achieving significant accuracy gains.

Zheyang Qin, Stanislav Panev, Celso de Melo +3

Computer VisionData Curation & Synthetic DataMultimodal Models

Apr 11, 2025

FlexDataset: Crafting Annotated Dataset Generation for Diverse Applications

The paper introduces FlexDataset, a composition-to-data (C2D) framework that generates high-fidelity, pixel-level annotated synthetic datasets for tasks like salient object detection, depth estimation, and segmentation. It addresses limitations of existing text-to-data methods in generating complex scenes by offering precise positional and categorical control through a composition-to-image (C2I) framework. The proposed Versatile Annotation Generation (VAG) Plan A leverages tuned perception decoders to exploit rich latent representations, achieving a nearly fivefold reduction in annotation time and enabling unlimited generation of customized, multi-instance and multi-category (MIMC) annotated data.

Pioneers a composition-to-data (C2D) framework, FlexDataset, for generating high-fidelity annotated datasets with precise control over object composition and efficient annotation generation.

Ellen Yi-Ge, Leo Shawn5

Data Curation & Synthetic DataComputer VisionMultimodal Models

Jan 13, 2025

LLM360 K2: Building a 65B 360-Open-Source Large Language Model from Scratch

The paper details the training process of LLM360 K2-65B, a 65 billion-parameter language model, emphasizing a 360-degree open-source approach to provide full transparency and access to training resources. K2 DIAMOND, the first model in the K2 project, achieves performance surpassing LLaMA-65B and rivaling LLaMA2-70B with fewer FLOPs and tokens. The work contributes a longitudinal analysis of K2 DIAMOND's capabilities throughout training and outlines future models in the TXT360 series.

Presents a fully transparent, end-to-end account of training a 65B parameter LLM, including implementation details and longitudinal performance analysis, to address the lack of transparency in training large-scale models.

Zhengzhong Liu, Bowen Tan, Hongyi Wang +2292501.07124

Training Efficiency & OptimizationOpen-Source Models & WeightsScaling Laws & Emergent AbilitiesDistributed Systems & Hardware

Jan 2, 2025

Object-level Visual Prompts for Compositional Image Generation

The paper introduces a novel method for composing object-level visual prompts within text-to-image diffusion models to generate semantically coherent compositions across diverse scenes and styles. To preserve object identity while enabling compositional flexibility, they propose a KV-mixed cross-attention mechanism that uses keys from a small-bottleneck encoder for layout control and values from a larger-bottleneck encoder for detailed appearance. Object-level compositional guidance during inference further enhances identity preservation and layout accuracy, leading to improved diversity and quality in generated compositions.

Introduces a KV-mixed cross-attention mechanism that disentangles layout control from appearance details to enable compositional image generation with object-level visual prompts while preserving object identity.

Gaurav Parmar, Or Patashnik, K. Wang +5152501.01424

Multimodal ModelsComputer Vision

Lattice is designed for desktop

CMU Machine Learning

Top Researchers

Recent Papers

Search