
Meta AI (FAIR)
Meta's Fundamental AI Research lab. Known for LLaMA, PyTorch, and open-source contributions to AI research.
ai.meta.com2
8
4
Top Researchers
Recent Papers
The paper investigates the data requirements for reasoning in sub-billion parameter language models, challenging the assumption that massive datasets (>10T tokens) are necessary. They demonstrate that by carefully curating and resampling open-source datasets to ~2T tokens, strong reasoning abilities can emerge with significantly less data. The resulting MobileLLM-R1 models achieve state-of-the-art performance among open-source sub-billion parameter models, even surpassing larger models trained on much larger datasets.
Demonstrates that strong reasoning capabilities can emerge in sub-billion parameter language models with significantly less data than previously believed by carefully curating and resampling open-source datasets.
The paper introduces PAST, a new end-to-end framework for speech tokenization that jointly models phonetic information and signal reconstruction without relying on external pre-trained models. PAST leverages supervised phonetic data through auxiliary tasks to directly integrate phonetic domain knowledge into the tokenization process. The framework, including a streamable variant, demonstrates superior performance in phonetic representation, speech reconstruction, and as a speech representation for speech language models compared to existing baseline tokenizers.
Introduces an end-to-end trainable speech tokenizer, PAST, that integrates phonetic information directly via supervised learning, eliminating the need for pre-trained self-supervised models.

