Meta AIFeb 26, 2026arXiv:2602.22617

Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA

Hai Huang, Hai Huang, Yann LeCun, Randall Balestriero, Randall Balestriero

AI Summary

The paper introduces Semantic Tube Prediction (STP), a JEPA-style regularizer that constrains LLM hidden-state trajectories to a tubular neighborhood of geodesics on a semantic manifold, based on the proposed Geodesic Hypothesis. This approach aims to improve data efficiency by enhancing the signal-to-noise ratio and preventing trajectory collisions during inference. Experiments on the NL-RX-SYNTH dataset demonstrate that LLMs trained with STP achieve comparable accuracy to baseline models with 16x less training data, effectively challenging established scaling laws.

Key Contribution

LLMs can achieve the same accuracy with 16x less data by constraining their hidden-state trajectories to follow geodesics on a semantic manifold.

Abstract

Large Language Models (LLMs) obey consistent scaling laws -- empirical power-law fits that predict how loss decreases with compute, data, and parameters. While predictive, these laws are descriptive rather than prescriptive: they characterize typical training, not optimal training. Surprisingly few works have successfully challenged the data-efficiency bounds implied by these laws -- which is our primary focus. To that end, we introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear. Building on this principle, we propose a novel Semantic Tube Prediction (STP) task, a JEPA-style regularizer that confines hidden-state trajectories to a tubular neighborhood of the geodesic. STP generalizes JEPA to language without requiring explicit multi-view augmentations. We show this constraint improves signal-to-noise ratio, and consequently preserves diversity by preventing trajectory collisions during inference. Empirically, STP allows LLMs to match baseline accuracy with 16$\times$ less training data on the NL-RX-SYNTH dataset, directly violating the data term of Chinchilla-style scaling laws and demonstrating that principled geometric priors can surpass brute-force scaling. Code is available at https://github.com/galilai-group/llm-jepa#stp.

Data Curation & Synthetic Data Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References54

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA

Related Papers