CMU Machine Learning
Carnegie Mellon's Machine Learning Department. Home to foundational work in statistical ML, deep learning, and robotics.
www.ml.cmu.edu11
44
4
Top Researchers
Recent Papers
In the demo, this work will showcase Edison's multimodal capabilities across mathematics, computer science, and data science, demonstrating how its modular architecture enables rapid deployment and customization for different educational contexts while maintaining instructional effectiveness in answering student questions.
In the demo, this work will showcase Edison's multimodal capabilities across mathematics, computer science, and data science, demonstrating how its modular architecture enables rapid deployment and customization for different educational contexts while maintaining instructional effectiveness in answering student questions.
The paper introduces GameDevBench, a new benchmark for evaluating multimodal agents in game development, a domain requiring complex code manipulation and multimodal asset handling. The benchmark comprises 132 tasks derived from tutorials, demanding significantly more code and file changes compared to existing software development benchmarks. Experiments reveal that current agents struggle with game development tasks, particularly those involving 2D graphics, but performance can be improved by incorporating image and video-based feedback mechanisms.
Introduces GameDevBench, a novel benchmark designed to evaluate and advance multimodal agents in the challenging domain of game development.
The paper introduces RefVFX, a framework for transferring complex temporal visual effects from a reference video to a target video or image in a feed-forward manner. To train the model, the authors created a large-scale dataset of video triplets using a novel automated pipeline that preserves input motion while applying repeatable effects, augmented with LoRA-derived and programmatically generated data. Experiments demonstrate that RefVFX generalizes to unseen effects, produces temporally coherent edits, and outperforms text-prompt baselines.
Introduces RefVFX and a corresponding large-scale dataset to enable tuning-free transfer of complex temporal visual effects across videos.
The paper introduces AfriEconQA, a new benchmark dataset for African economic analysis constructed from 236 World Bank reports, designed to evaluate numerical reasoning and temporal disambiguation capabilities of models. The dataset comprises 8,937 question-answer pairs, filtered from a larger synthetic pool to ensure high-quality evidence-answer alignment and temporal provenance. Benchmarking experiments using GPT-5 Mini, GPT-4o, and Qwen 32B in zero-shot and RAG configurations reveal a significant performance gap, highlighting the dataset's challenge for current LLMs and the need for domain-specific IR and RAG advancements.
Introduces AfriEconQA, a novel benchmark dataset specifically designed to evaluate the performance of information retrieval and question answering systems on African economic analysis using World Bank reports.
The paper introduces PAN, a general world model capable of predicting future world states through high-quality video simulation conditioned on history and natural language actions. PAN uses a Generative Latent Prediction (GLP) architecture, combining an autoregressive latent dynamics backbone based on a large language model (LLM) for grounding simulation in text-based knowledge, with a video diffusion decoder for reconstructing detailed visual observations. Trained on large-scale video-action pairs, PAN demonstrates strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning across diverse domains.
The paper pioneers a general world model, PAN, that unifies latent space reasoning with realizable world dynamics to achieve open-domain, action-conditioned video simulation with coherent, long-term consistency.
The paper introduces summarization-based context management for RL fine-tuning of LLMs in long-horizon multi-turn tool use, addressing the context length bottleneck. They formulate a policy gradient representation that allows standard LLM RL infrastructures to optimize both tool-use behaviors and summarization strategies end-to-end. The proposed algorithm, \texttt{SUPO}, demonstrates improved success rates and maintained or reduced context length on interactive function calling and searching tasks, even scaling beyond training-time summarization rounds at test time.
Introduces and validates a summarization-based context management approach that enables RL agents to scale beyond fixed context length limits in long-horizon multi-turn tasks.
The paper introduces FuncBenchGen, a synthetic benchmark for evaluating multi-step tool-use in language models by framing tool use as traversal over a function-dependency DAG. This framework allows for controlled task difficulty and avoids data contamination, addressing limitations in existing TaLM benchmarks. Experiments reveal performance degradation with increasing dependency depth and the difficulty posed by connected distractor functions, while also demonstrating that explicitly restating prior variable values significantly improves performance.
Introduces FuncBenchGen, a contamination-free and controllable framework for evaluating multi-step tool-use in language models via synthetic task generation.
This paper addresses the challenge of small object detection in satellite imagery across different geographical regions by using generative AI to create synthetic training data. A Stable Diffusion model is fine-tuned on both source (Selwyn, New Zealand) and target (Utah, USA) regions, leveraging cross- and self-attention mechanisms and CLIPSeg for image segmentation. The approach demonstrates a 20% improvement in detection accuracy on the target dataset compared to a baseline trained solely on source data, highlighting the effectiveness of generative data augmentation for cross-regional generalization.
Demonstrates a generative AI pipeline that synthesizes realistic satellite imagery for improved cross-regional small object detection, achieving significant accuracy gains.
The paper introduces FlexDataset, a composition-to-data (C2D) framework that generates high-fidelity, pixel-level annotated synthetic datasets for tasks like salient object detection, depth estimation, and segmentation. It addresses limitations of existing text-to-data methods in generating complex scenes by offering precise positional and categorical control through a composition-to-image (C2I) framework. The proposed Versatile Annotation Generation (VAG) Plan A leverages tuned perception decoders to exploit rich latent representations, achieving a nearly fivefold reduction in annotation time and enabling unlimited generation of customized, multi-instance and multi-category (MIMC) annotated data.
Pioneers a composition-to-data (C2D) framework, FlexDataset, for generating high-fidelity annotated datasets with precise control over object composition and efficient annotation generation.
The paper details the training process of LLM360 K2-65B, a 65 billion-parameter language model, emphasizing a 360-degree open-source approach to provide full transparency and access to training resources. K2 DIAMOND, the first model in the K2 project, achieves performance surpassing LLaMA-65B and rivaling LLaMA2-70B with fewer FLOPs and tokens. The work contributes a longitudinal analysis of K2 DIAMOND's capabilities throughout training and outlines future models in the TXT360 series.
Presents a fully transparent, end-to-end account of training a 65B parameter LLM, including implementation details and longitudinal performance analysis, to address the lack of transparency in training large-scale models.
The paper introduces a novel method for composing object-level visual prompts within text-to-image diffusion models to generate semantically coherent compositions across diverse scenes and styles. To preserve object identity while enabling compositional flexibility, they propose a KV-mixed cross-attention mechanism that uses keys from a small-bottleneck encoder for layout control and values from a larger-bottleneck encoder for detailed appearance. Object-level compositional guidance during inference further enhances identity preservation and layout accuracy, leading to improved diversity and quality in generated compositions.
Introduces a KV-mixed cross-attention mechanism that disentangles layout control from appearance details to enable compositional image generation with object-level visual prompts while preserving object identity.

