May 1, 2026arXiv:2605.00754

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Indraneil Paul, Glavavs Glavas, Glavaš Glavas, Iryna Gurevych

AI Summary

The authors introduce Themis-CodeRewardBench, a benchmark for evaluating code reward models (RMs) across five preference dimensions and eight programming languages, and Themis-CodePreference, a large dataset of code preferences. They then train Themis-RM, a suite of multilingual code RMs (600M-32B parameters) on this dataset, demonstrating strong cross-lingual transfer and the importance of multi-criteria training. The results show that current RMs are limited beyond scoring for functional correctness, highlighting the need for more comprehensive training data and evaluation metrics.

Key Contribution

Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.

Abstract

Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Related Papers