Graham Neubig

Decomposer achieves superior MIDI reconstruction fidelity and code readability compared to existing models, transforming how we approach symbolic music decompilation.

Yewon Kim, Apurva Gandhi, David Chung +3

Code Generation & Program Synthesis Speech & Audio

Jun 30, 2026

CMU ML1w ago·also UMass

PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks

Even top-performing AI models struggle with PowerPoint tasks, achieving only 45% success rates despite a robust evaluation framework that rewards nuanced performance.

Apurva Gandhi, Vishwas Suryanarayanan, Raja Hasnain Anwar +6

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Jun 19, 2026

CMU ML3w ago·also NUS

Discretizing Reward Models

Discretizing reward models can significantly enhance policy performance by reducing oversensitivity without sacrificing discriminative ability.

Vijay Viswanathan, Shiqi Wang, Devamanyu Hazarika +4

RLHF & Preference Learning

Apr 16, 2026

CMU MLApr 16, 2026

Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks

Stop wasting tokens on irrelevant questions: reward models that ask about task relevance and user answerability can slash question count by 41% while matching GPT-5's issue resolution rate.

S. Vijayvargiya, Sanidhya Vijayvargiya, Vijay Viswanathan +2

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Mar 19, 2026

Meta AIMar 19, 2026·also CMU ML, CAS, UESTC, UNC +1

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

On-policy reward modeling with LLM judges not only unlocks significant performance gains on complex mathematical reasoning tasks, but also generalizes to improve performance on simpler numerical and multiple-choice benchmarks.

Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim +20

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Mar 18, 2026

CMU MLMar 18, 2026·also INSA Rennes

CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents

Forget specialized tools: a standard Unix terminal and clever RL are all you need to beat much larger LLMs at code search.

Lintang Sutawika, Aditya Bharat Soni, R. BharathSriraamR +11

Code Generation & Program Synthesis Recommendation & Information Retrieval Tool Use & Agents

Feb 19, 2026

CMU MLFeb 19, 2026·also UW, Duke

Modeling Distinct Human Interaction in Web Agents

Stop guessing when humans want to take over: modeling user intervention styles in web agents boosts their usefulness by 26.5%.

Faria Huq, Faria Huq, Z. Wang +12

Data Curation & Synthetic Data RLHF & Preference Learning Tool Use & Agents