Search papers, labs, and topics across Lattice.
TildeOpen LLM, a 30B parameter model, was trained on 34 European languages to address the underperformance of LLMs in low-resource languages. They employed a curriculum learning approach, alternating between uniform and natural language distributions, combined with dataset upsampling to mitigate data imbalance. Results show TildeOpen outperforms existing open-weight models, especially in Baltic, Finno-Ugric, and Slavic languages, with up to a tenfold reduction in linguistic errors.
A new 30B open-weight LLM trained on 34 European languages achieves state-of-the-art performance on low-resource languages with significantly less compute, proving that clever training beats brute force.
Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in training data. This paper presents TildeOpen LLM, a 30-billion-parameter open-weight foundational model trained for 34 European languages to promote linguistic equity and improve performance for low-resource languages. To address the data imbalance, we combine dataset upsampling with a curriculum-based training schedule that alternates between uniform and natural language distributions. The resulting model performs favorably compared to other multilingual LLMs despite being trained with significantly fewer computing resources. Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages. Human evaluations confirm an up to tenfold reduction in linguistic errors relative to leading baselines. The model and associated resources are fully open-weight and publicly available at huggingface.co/TildeAI/TildeOpen-30b. These outcomes demonstrate that careful data curation and balanced training strategies can substantially enhance multilingual model quality without increasing model size or training volume.