NIIMar 10, 2026arXiv:2603.09070

3D UAV Trajectory Estimation and Classification from Internet Videos via Language Model

Haoxiang Lei, Daotong Wang, Shenghai Yuan, Jianbo Su

AI Summary

This paper introduces a method for estimating 3D UAV trajectories and classifying UAV types directly from internet videos, bypassing the need for manual annotation. The approach uses language-driven video acquisition, vision-language reasoning for segment filtering, and a training-free cross-modal label generation module to infer 3D trajectory hypotheses and UAV types. A physics-informed refinement process ensures temporal smoothness and kinematic consistency, demonstrating strong zero-shot transfer performance on a public benchmark as the amount of online video data increases.

Key Contribution

Skip expensive manual annotation: this method extracts accurate 3D UAV trajectories and classifications directly from readily available internet videos.

Abstract

Reliable 3D trajectory estimation of unmanned aerial vehicles (UAVs) is a fundamental requirement for anti-UAV systems, yet the acquisition of large-scale and accurately annotated trajectory data remains prohibitively expensive. In this work, we present a novel framework that derives UAV 3D trajectories and category information directly from Internet-scale UAV videos, without relying on manual annotations. First, language-driven data acquisition is employed to autonomously discover and collect UAV-related videos, while vision-language reasoning progressively filters task-relevant segments. Second, a training-free cross-modal label generation module is introduced to infer 3D trajectory hypotheses and UAV type cues. Third, a physics-informed refinement process is designed to impose temporal smoothness and kinematic consistency on the estimated trajectories. The resulting video clips and trajectory annotations can be readily utilized for downstream anti-UAV tasks. To assess effectiveness and generalization, we conduct zero-shot transfer experiments on a public, well-annotated 3D UAV benchmark. Results reveal a clear data scaling behavior: as the amount of online video data increases, zero-shot transfer performance on the target dataset improves consistently, without any target-domain training. The proposed method closely approaches the current state-of-the-art, highlighting its robustness and applicability to real-world anti-UAV scenarios. Code and datasets will be released upon acceptance.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References20

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

3D UAV Trajectory Estimation and Classification from Internet Videos via Language Model

Related Papers