Search papers, labs, and topics across Lattice.
The Kwai Keye-VL-2.0-30B-A3B model introduces a novel Mixture-of-Experts architecture that leverages DeepSeek Sparse Attention for efficient processing of ultra-long video contexts, achieving lossless 256K context handling while maintaining critical frame capture and long-range temporal dependencies. This model addresses the computational challenges associated with long videos through a highly optimized training infrastructure and innovative techniques like Cross-Modal Multi-Teacher On-Policy Distillation, enabling effective multi-task learning without catastrophic forgetting. Evaluations show that Keye-VL-2.0 outperforms existing models in fine-grained temporal localization and long-video comprehension tasks, setting a new benchmark in multimodal understanding.
Keye-VL-2.0 achieves lossless processing of 256K video contexts, revolutionizing long-video understanding and agent collaboration.
We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.