Search papers, labs, and topics across Lattice.
This paper introduces SemLT3D, a Semantic-Guided Expert Distillation framework aimed at addressing the long-tail imbalance in camera-only 3D object detection for autonomous driving. By leveraging a language-guided mixture-of-experts module and a semantic projection distillation pipeline, SemLT3D enhances the representation of underrepresented safety-critical categories while improving robustness against inter-class ambiguity and intra-class diversity. The key finding reveals that this approach not only mitigates the effects of long-tail distribution but also yields more coherent and discriminative features, leading to improved detection performance in challenging scenarios.
Rare but critical categories like children and emergency vehicles can be effectively detected using a novel semantic-guided expert framework that transforms long-tail learning in 3D perception.
Camera-only 3D object detection has emerged as a cost-effective and scalable alternative to LiDAR for autonomous driving, yet existing methods primarily prioritize overall performance while overlooking the severe long-tail imbalance inherent in real-world datasets. In practice, many rare but safety-critical categories such as children, strollers, or emergency vehicles are heavily underrepresented, leading to biased learning and degraded performance. This challenge is further exacerbated by pronounced inter-class ambiguity (e.g., visually similar subclasses) and substantial intra-class diversity (e.g., objects varying widely in appearance, scale, pose, or context), which together hinder reliable long-tail recognition. In this work, we introduce SemLT3D, a Semantic-Guided Expert Distillation framework designed to enrich the representation space for underrepresented classes through semantic priors. SemLT3D consists of: (1) a language-guided mixture-of-experts module that routes 3D queries to specialized experts according to their semantic affinity, enabling the model to better disentangle confusing classes and specialize on tail distributions; and (2) a semantic projection distillation pipeline that aligns 3D queries with CLIP-informed 2D semantics, producing more coherent and discriminative features across diverse visual manifestations. Although motivated by long-tail imbalance, the semantically structured learning in SemLT3D also improves robustness under broader appearance variations and challenging corner cases, offering a principled step toward more reliable camera-only 3D perception.