May 5, 2026arXiv:2605.04005

Domain-Adaptive Dense Retrieval for Brazilian Legal Search

Jayr Pereira, Roberto A. Lotufo, L. Bonifacio

AI Summary

This paper investigates domain adaptation strategies for dense retrieval in the heterogeneous Brazilian legal domain, considering case law, legislation, and question-based search. They compare a base Qwen3-Embedding-4B model with versions fine-tuned on legal data only and a mix of legal and SQuAD-pt data. Results on six datasets show that legal-only training excels in specialized legal tasks, while mixed training achieves a better balance, particularly improving performance on question-based retrieval and overall average NDCG@10, MRR@10, and MAP@10.

Key Contribution

Fine-tuning dense retrievers on a mix of domain-specific and general question-answering data achieves surprisingly robust performance across diverse legal search tasks, outperforming models trained solely on legal data.

Abstract

Brazilian legal retrieval is heterogeneous, covering case law, legislation, and question-based search. This makes training dense retrievers a trade-off between stronger domain specialization and broader robustness across retrieval types of search. In this paper, we explore this trade-off using three training setups based on Qwen3-Embedding-4B: a base model with no fine-tuning, a version trained only on legal data, and a mixed setup that combines legal data with SQuAD-pt supervised dataset. We evaluate these models on five legal datasets from the JU\'A leaderboard, along with Quati dataset as an extra Portuguese retrieval benchmark to test out-of-domain generalization. The legal-only model performs best on the most specialized legal tasks. The mixed setup keeps strong performance on legal data while offering a better overall balance, improving average NDCG@10 from 0.414 to 0.447, MRR@10 from 0.586 to 0.595, and MAP@10 from 0.270 to 0.308 across all six datasets. The biggest improvement appears on Quati, where the mixed model clearly outperforms the legal-only one. Overall, the results show that legal-only and mixed training lead to different strengths: the first is better for specialization, while the second is more robust across different types of search, especially question-based ones. Both adapted models are available on Hugging Face

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...