Microsoft ResearchMar 12, 2026arXiv:2603.12248

Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li, Nicolo Fusi, Yilun Du, S. Kakade, Sham M. Kakade, Carles Domingo-Enrich

AI Summary

This paper introduces energy-based fine-tuning (EBFT), a feature-matching objective for language model fine-tuning that optimizes sequence-level statistics of the completion distribution. EBFT uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and performs an on-policy policy-gradient update. Experiments across Q&A coding, unstructured coding, and translation show that EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy.

Key Contribution

Ditch the task-specific verifier: energy-based fine-tuning (EBFT) lets you directly optimize sequence-level behavior in LMs, beating SFT and matching RLVR in downstream tasks.

Abstract

Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose energy-based fine-tuning (EBFT), which uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and uses the resulting embeddings to perform an on-policy policy-gradient update. We present a theoretical perspective connecting EBFT to KL-regularized feature-matching and energy-based modeling. Empirically, across Q&A coding, unstructured coding, and translation, EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy than both methods.

Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

Related Papers