Search papers, labs, and topics across Lattice.
This paper introduces SP-CLIP, a novel zero-shot action recognition framework that leverages structured semantic prompts to enhance frozen vision-language models. SP-CLIP uses multi-level semantic descriptions of actions (intent, motion, objects) to align video representations with enriched textual semantics via prompt aggregation and consistency scoring, without modifying the visual encoder or learning new parameters. Experiments on standard benchmarks demonstrate that SP-CLIP significantly improves zero-shot action recognition, especially for complex actions, while maintaining efficiency and generalization.
Semantic prompting alone can unlock surprisingly strong zero-shot action recognition, rivaling methods that require architectural changes or fine-tuning.
Zero-shot action recognition relies on transferring knowledge from vision-language models to unseen actions using semantic descriptions. While recent methods focus on temporal modeling or architectural adaptations to handle video data, we argue that semantic prompting alone provides a strong and underexplored signal for zero-shot action understanding. We introduce SP-CLIP, a lightweight framework that augments frozen vision-language models with structured semantic prompts describing actions at multiple levels of abstraction, such as intent, motion, and object interaction. Without modifying the visual encoder or learning additional parameters, SP-CLIP aligns video representations with enriched textual semantics through prompt aggregation and consistency scoring. Experiments across standard benchmarks show that semantic prompting substantially improves zero-shot action recognition, particularly for fine-grained and compositional actions, while preserving the efficiency and generalization of pretrained models.