Search papers, labs, and topics across Lattice.
The paper introduces TriLite, a weakly supervised object localization (WSOL) framework that uses a frozen, self-supervised Dinov2-pretrained Vision Transformer backbone and a lightweight TriHead module to disentangle foreground, background, and ambiguous regions. This approach addresses the limitations of prior WSOL methods, which often suffer from partial object coverage and high training costs due to multi-stage pipelines or full fine-tuning. TriLite achieves state-of-the-art localization performance on CUB-200-2011, ImageNet-1K, and OpenImages, while using fewer than 800K trainable parameters.
Forget full fine-tuning: TriLite achieves state-of-the-art weakly supervised object localization by freezing a DINOv2 ViT and training only 800K parameters.
Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be released soon.