CMU Machine Learning

×Computer Vision

22 papers from CMU Machine Learning on Computer Vision

Mar 9, 2026

Ulsan National Institute of Science and Technology3w ago·also CMU ML

Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model

Mamba's superior sequence modeling lets you generate longer, more realistic dance sequences than clunky Transformers ever could.

Sangjune Park, Sangjun Park, Inhyeok Choi +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Speech & Audio

Mar 8, 2026

CMU ML3w ago·also NII, UQ

PanoDP: Learning Collision-Free Navigation with Panoramic Depth and Differentiable Physics

Panoramic depth perception and differentiable physics unlock surprisingly robust collision avoidance, even generalizing to unseen simulation environments.

Hao Zhong, Pei Chi, Jiang Zhao +4

Computer Vision Robotics & Embodied AI World Models & Planning

Mar 6, 2026

CMU ML3w ago·also BAIR, NUS

Training-free Latent Inter-Frame Pruning with Attention Recovery

Accelerate video generation by 45% without retraining, simply by pruning redundant latent patches and cleverly recovering attention scores.

Dennis Menn, Dennis Menn, Yuedong Yang +15

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Mar 5, 2026

Department of Computational and Data3w ago·also CMU ML, Center of Data for Public Good (CDPG), IISc, Robert Bosch Center for Cyber Physical

Scaling Real-Time Traffic Analytics on Edge-Cloud Fabrics for City-Scale Camera Networks

Achieve real-time traffic analytics across city-scale camera networks by offloading DNN inference to edge devices and using cloud-based GNNs for forecasting, all while dynamically adapting to changing conditions with federated learning.

Akash Sharma, Pranjal Naman, P. Naman +14

Architecture Design (Transformers, SSMs, MoE)Computer Vision Distributed Systems & Hardware

Mar 4, 2026

CMU MLMar 4, 2026·also BAIR, Meta AI, Brown, Lambda +1

Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

Unsupervised discovery of object keypoints and dynamics directly from video unlocks state-of-the-art world models applicable to decision-making.

Tal Daniel, Carl Qi, Dan Haramati +5

Computer Vision Robotics & Embodied AI World Models & Planning

CMU MLMar 4, 2026

How Professional Visual Artists are Negotiating Generative AI in the Workplace

Visual artists are overwhelmingly resisting generative AI in the workplace, deploying active "refusal" strategies against pressure from clients and bosses.

Harry H. Jiang, Jordan Taylor

Computer Vision Constitutional AI & AI Ethics Natural Language Processing

Mar 3, 2026

CMU MLMar 3, 2026·also Meta AI

DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction

By disentangling camera-space estimation from world-space refinement via dual diffusion models, DuoMo achieves state-of-the-art human motion reconstruction from noisy video, bypassing the limitations of parametric models.

Yufu Wang, Evonne Ng, Rawal Khirodkar +6

Computer Vision Multimodal Models Robotics & Embodied AI

CMU MLMar 3, 2026·also AIST, LMU, NAIST, UTokyo

Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

Forget fine-tuning: Prompting MLLMs with a dynamic interval-based decoding strategy lets them generate surprisingly human-like, pause-aware real-time game commentary.

Anum Afzal, Yuki Saito, Hiroya Takamura +5

Computer Vision Multimodal Models Natural Language Processing

Mar 2, 2026

University of TechnologyMar 2, 2026·also CMU ML, Northwestern, University of Science, Vietnam National University

Robust White Blood Cell Classification with Stain-Normalized Decoupled Learning and Ensembling

Stain normalization and decoupled learning can dramatically improve the robustness of white blood cell classification, even in the face of significant staining variations and class imbalances.

Luu Le, Hoang-Loc Cao, Ha-Hieu Pham +2

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

Mar 1, 2026

CMU MLMar 1, 2026

riMESA: Consensus ADMM for Real-World Collaborative SLAM

Achieve 7x accuracy gains in real-world collaborative SLAM by using a robust, distributed optimization algorithm resilient to communication limits and noisy data.

Daniel McGann, Michael Kaess

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

Feb 26, 2026

CMU MLFeb 26, 2026·also Microsoft Research, Beihang

pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation

Forget monolithic models: pMoE shows that ensembling diverse expert prompts within a single model framework yields surprisingly large gains in visual adaptation across a wide range of tasks.

Shentong Mo, Shentong Mo, Xufang Luo +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Feb 24, 2026

CMU MLFeb 24, 2026

CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects

Forget language and appearance: CAD models can now directly prompt accurate instance segmentation of industrial objects, even with diverse surface properties.

Zhenran Tang, Rohan Nagabhirava

Computer Vision Multimodal Models Robotics & Embodied AI

Feb 23, 2026

CMU MLFeb 23, 2026

Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning

Unlabeled monocular videos can now be used to train state-of-the-art 3D/4D reconstruction systems, thanks to a factored flow prediction approach that disentangles geometry and pose learning.

Zhongxiao Cong, Qitao Zhao, Minsik Jeon +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Feb 23, 2026·also CMU ML, ANU, NJU

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Forget cloud GPUs – a new model brings unified multimodal understanding and generation to your iPhone, running 6x faster than alternatives.

Abdelrahman M. Shaker, Abdelrahman Shaker, Ahmed Heakl +18

Computer Vision Inference & Quantization Multimodal Models

Feb 18, 2026

CMU MLFeb 18, 2026·also Dongguk University

Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing

Image-to-image editors silently weaken or ignore your edit instructions based on the subject's race, gender, and age, revealing surprising demographic biases.

Huichan Seo, Sieun Choi, Sieun Choi +3

Computer Vision Constitutional AI & AI Ethics Multimodal Models

Feb 17, 2026

Meta AIFeb 17, 2026·also BAIR, CMU ML, SJTU

SAM 3D Body: Robust Full-Body Human Mesh Recovery

Forget clunky skeletons: this new model lets you prompt your way to accurate 3D human meshes from single images, even in the wildest poses.

Xitong Yang, Xitong Yang, Devansh Kukreja +23

Computer Vision Multimodal Models Robotics & Embodied AI

CMU MLFeb 17, 2026

GMAIL: Generative Modality Alignment for generated Image Learning

Stop treating generated images like real ones: GMAIL aligns them as separate modalities in a shared latent space, unlocking significant gains in vision-language tasks.

Shentong Mo, Sukmin Yun

Computer Vision Data Curation & Synthetic Data Multimodal Models

Feb 16, 2026

IITFeb 16, 2026·also CMU ML

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

VLMs that ace RGB images completely fail at thermal imagery, revealing a critical gap in their ability to reason about temperature and physical properties.

Ayush Shrivastava, Kirtan Gangani, Laksh Jain +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Nov 12, 2025

Institute of Foundation ModelsNov 12, 2025·also CMU ML, Shenzhen University, UCSD

PAN: A World Model for General, Interactable, and Long-Horizon World Simulation

Forget rigid game environments – PAN lets you simulate open-world scenarios with language-specified actions and long-term visual coherence, opening the door to more realistic AI training.

Pan Team Institute of Foundation Models Jiannan Xiang, Yi Gu, Zihan Liu +305

Computer Vision Multimodal Models World Models & Planning

May 29, 2025

CMU MLMay 29, 2025·also Florida State

Leveraging generative AI for cross-regional small object detection in satellite imagery

Synthetic data generated by fine-tuning Stable Diffusion on multi-region satellite imagery boosts small object detection accuracy by 20%, even when real labeled data is scarce.

Zheyang Qin, Stanislav Panev, Celso de Melo +3

Computer Vision Data Curation & Synthetic Data Multimodal Models

Apr 11, 2025

CMU MLApr 11, 2025·also CAS

FlexDataset: Crafting Annotated Dataset Generation for Diverse Applications

Forget tedious manual annotation: FlexDataset crafts customized, high-fidelity annotated datasets with 5x faster annotation times using a composition-to-data approach.

Ellen Yi-Ge, Leo Shawn5

Computer Vision Data Curation & Synthetic Data Multimodal Models

Jan 2, 2025

CMU MLJan 2, 2025·also Snap Research, TAU

Object-level Visual Prompts for Compositional Image Generation

Achieve semantically coherent image compositions by mixing layout-focused and appearance-focused visual representations in a diffusion model's cross-attention.

Gaurav Parmar, Or Patashnik, K. Wang +515

Computer Vision Multimodal Models

Search

CMU Machine Learning