Apr 27, 2026arXiv:2604.24997

DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

AI Summary

This paper introduces DouC, a training-free open-vocabulary segmentation framework that leverages a dual-branch CLIP architecture to improve both local token reliability and spatial coherence. One branch, OG-CLIP, uses token gating to enhance patch-level reliability, while the other, FADE-CLIP, incorporates structural priors via proxy attention guided by frozen vision foundation models. By fusing the outputs of these branches and optionally applying instance-aware correction, DouC achieves state-of-the-art zero-shot segmentation performance across eight benchmarks without requiring any additional training.

Key Contribution

Achieve SOTA zero-shot segmentation by simply fusing two CLIP branches, one focusing on local token reliability and the other on structural priors, all without training.

Abstract

Open-vocabulary semantic segmentation requires assigning pixel-level semantic labels while supporting an open and unrestricted set of categories. Training-free CLIP-based approaches preserve strong zero-shot generalization but typically rely on a single inference mechanism, limiting their ability to jointly address unreliable local tokens and insufficient spatial coherence. We propose DouC, a training-free dual-branch CLIP framework that decomposes dense prediction into two complementary components. OG-CLIP improves patch-level reliability via lightweight, inference-time token gating, while FADE-CLIP injects external structural priors through proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure-aware patch interactions to jointly influence final predictions, with optional instance-aware correction applied as post-processing. DouC introduces no additional learnable parameters, requires no retraining, and preserves CLIP's zero-shot generalization. Extensive experiments across eight benchmarks and multiple CLIP backbones demonstrate that DouC consistently outperforms prior training-free methods and scales favorably with model capacity.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References31

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

Related Papers