Search papers, labs, and topics across Lattice.
This paper presents a hybrid edge-based action detection system for public safety, combining skeleton-based motion analysis with vision-language models to address latency, privacy, and resource limitations. The system leverages skeleton data for continuous, low-overhead monitoring and vision-language models for contextual understanding and zero-shot reasoning. Evaluation on a GPU-enabled edge device demonstrates the complementary strengths of motion-centric and semantic approaches, motivating the hybrid architecture for real-time video analysis.
Achieve real-time, privacy-aware action detection on edge devices by intelligently fusing fast skeleton tracking with vision-language models, outperforming either approach alone.
Public spaces such as transport hubs, city centres, and event venues require timely and reliable detection of potentially violent behaviour to support public safety. While automated video analysis has made significant progress, practical deployment remains constrained by latency, privacy, and resource limitations, particularly under edge-computing conditions. This paper presents the design and demonstrator-based deployment of a hybrid edge-based action detection system that combines skeleton-based motion analysis with vision-language models for semantic scene interpretation. Skeleton-based processing enables continuous, privacy-aware monitoring with low computational overhead, while vision-language models provide contextual understanding and zero-shot reasoning capabilities for complex and previously unseen situations. Rather than proposing new recognition models, the contribution focuses on a system-level comparison of both paradigms under realistic edge constraints. The system is implemented on a GPU-enabled edge device and evaluated with respect to latency, resource usage, and operational trade-offs using a demonstrator-based setup. The results highlight the complementary strengths and limitations of motioncentric and semantic approaches and motivate a hybrid architecture that selectively augments fast skeletonbased detection with higher-level semantic reasoning. The presented system provides a practical foundation for privacy-aware, real-time video analysis in public safety applications.