Silvio Savarese

Papers on Lattice

Total citations

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Multimodal Models (3)Speech & Audio (3)Architecture Design (Transformers, SSMs, MoE) (2)Training Efficiency & Optimization (2)

Frequent co-authors

Caiming Xiong (3)Jielin Qiu (3)Liangwei Yang (3)Ming Zhu (3)

Papers (7)

Jul 13, 2026

1w ago·also Stanford HAI, Salesforce AI

Evidence-Backed Video Question Answering

Video LLMs can significantly improve their QA performance by integrating spatio-temporal evidence, bridging the gap between accuracy and visual perception.

Shijie Wang, Shijie Wang, Honglu Zhou +13

Computer Vision Multimodal Models

Jul 2, 2026

3w ago

Mixture-of-Parallelisms: Towards Memory-Efficient Training Stack for Mixture-of-Experts Models

Achieving 4.7x to 8.2x higher throughput for trillion-parameter MoE models could redefine the limits of large-scale model training.

Xuan-Phi Nguyen, Shrey Pandit, Yiran Zhao +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Apr 12, 2026

Whisper-AuT: Domain-Adapted Audio Encoder for Efficient Audio-LLM Training

Whisper's speech-centric training leaves audio-LLMs tone-deaf to music and environmental sounds, but a simple fine-tune can fix that.

Jielin Qiu, Ming Zhu, Wenting Zhao +8

Multimodal Models Speech & Audio Training Efficiency & Optimization

Mar 5, 2026

NVIDIAMar 5, 2026·also Salesforce AI

Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

Forget slow, end-to-end models: building real-time voice agents hinges on a cascaded streaming pipeline, as demonstrated by a new tutorial achieving sub-second latency.

Jielin Qiu, Zixiang Chen, Liangwei Yang +11

Open-Source Models & Weights Speech & Audio Tool Use & Agents

Mar 4, 2026

NVIDIAMar 4, 2026

Position: Vector Prompt Interfaces Should Be Exposed to Enable Customization of Large Language Models

Forget text prompts: vector prompt interfaces are the key to unlocking scalable and stable LLM customization.

Liangwei Yang, Shiyu Wang, Rithesh Murthy +11

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Open-Source Models & Weights

Mar 2, 2026

Mosaic AIMar 2, 2026

VoiceAgengRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

Real-time voice agents can bypass slow vector DB lookups with a dual-agent architecture that pre-fetches relevant documents into a sub-millisecond semantic cache.

Jielin Qiu, Zixiang Chen, Liangwei Yang +13

Recommendation & Information Retrieval Speech & Audio Tool Use & Agents

Jan 15, 2026

Jan 15, 2026·also Stanford HAI, UNC, V baseline from CogVideoX [98]

Future Optical Flow Prediction Improves Robot Control&Video Generation

A unified Vision-Language Model and Diffusion architecture unlocks surprisingly effective optical flow forecasting from noisy web data, enabling language-conditioned robot control and video generation.

Kanchana Ranasinghe, Honglu Zhou, Yu Fang +7

Multimodal Models Robotics & Embodied AI

Search

Silvio Savarese

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (7)