Valeo AIJul 18, 2025arXiv:2507.14137

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Spyros Gidaris, Elias Ramzi, Andrei Bursuc, Yuki M. Asano

AI Summary

The paper introduces Franca, a fully open-source vision foundation model trained using Web-SSL principles on ImageNet-21K and a subset of ReLAION-2B, achieving state-of-the-art performance. To improve SSL clustering, they propose a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations that progressively refines features into finer clusters. They also introduce a positional disentanglement strategy to remove positional biases from dense representations, leading to improved downstream performance.

Key Contribution

Franca leapfrogs proprietary vision models like DINOv2 and CLIP, proving open-source can win on performance and transparency in visual representation learning.

Abstract

We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.

Computer Vision Multimodal Models Open-Source Models & Weights

Citation Metrics

Citations8

Influential citations1

References94

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Related Papers