Apr 9, 2026arXiv:2604.08477

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

Ashima Suvarna, Ashima Suvarna, Kendrick Phan, Kendrick Phan, Mehrab Beikzadeh, Mehrab Beikzadeh, Hritik Bansal, Hritik Bansal, Saadia Gabriel, Saadia Gabriel

AI Summary

SUPERNOVA is a data curation framework that adapts instruction-tuning datasets for Reinforcement Learning with Verifiable Rewards (RLVR) to improve LLM general reasoning. The framework strategically selects and mixes source tasks from instruction-tuning datasets, using synthetic interventions to improve data quality for RLVR. Experiments show that SUPERNOVA-trained models outperform strong baselines like Qwen3.5 on reasoning benchmarks, achieving up to 52.8% relative improvement on BBEH.

Key Contribution

Forget brute-force scaling: targeted data curation for RLVR can unlock surprisingly large gains in LLM reasoning.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References44

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

Related Papers