B-Think Polygl0t/Tucano2-qwen-3.Bonn-Aachen International Center for InformationHelmholtzLamarr Institute for Machine LearningMar 3, 2026arXiv:2603.03543

Tucano 2 Cool: Better Open Source LLMs for Portuguese

Nicholas Kluge Corrêa, Aniket Sen, Shiza Fatimah, Sophia Falk, Lennard Landgraf, Julia Kastner, Lucie Flek

AI Summary

The authors introduce Tucano 2, a suite of open-source LLMs (0.5-3.7B parameters) specifically designed for Portuguese, leveraging an expanded and improved dataset, GigaVerbo-v2, along with a new synthetic dataset, GigaVerbo-v2 Synth, to address data gaps. They also create two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, to enable training in areas like RAG, coding, and tool use. Through ablation studies, they optimize pretraining and continual pretraining recipes, achieving state-of-the-art results on Portuguese benchmarks and releasing all artifacts for reproducibility.

Key Contribution

Open-source Portuguese LLMs just got a major upgrade: Tucano 2 models outperform existing options thanks to a new recipe of curated and synthetic data, plus targeted post-training for RAG and tool use.

Abstract

We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, that allow Portuguese LLMs to be trained in domains like retrieval augmented generation, coding, tool use, chain-of-thought reasoning, and many other domains of interest. Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks. We also extend and refine the evaluation harness introduced in our earlier work, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes. All artifacts associated with Tucano 2 are openly released, including training recipes, logs, and source code, ensuring that our work is reproducible, accessible, and extendable by the broader Portuguese NLP community.

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Tucano 2 Cool: Better Open Source LLMs for Portuguese

Related Papers