Apple ML Research

×Multimodal Models

5 papers from Apple ML Research on Multimodal Models

Mar 17, 2026

UC Santa Cruz2w ago·also Apple ML, CUHK

Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

LVLMs can be made significantly less prone to hallucinations, without any training, by explicitly grounding them in visual evidence and iteratively self-refining their answers based on verified information.

Jiawei Mao, Hardy Chen, Haoqin Tu +5

Eval Frameworks & Benchmarks Multimodal Models

Mar 10, 2026

Apple ML3w ago

ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

Finally, a single model that can generate both your face and voice, convincingly controlled by text prompts and reference clips.

Aviad Dahan, Moran Yanuka, Noa Kraicer +2

Computer Vision Multimodal Models Speech & Audio

Mar 5, 2026

Apple ML3w ago·also Tel-Aviv Univercity

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Text-to-video generation gets a 1.58x speed boost with CalibAtt, a training-free method that exploits consistent sparsity patterns in attention layers.

Shai Yehezkel, Shahar Yadin, S. Yadin +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Feb 25, 2026

Apple MLFeb 25, 2026·also Berkeley University, Institut National de la Recherche

The Design Space of Tri-Modal Masked Diffusion Models

Tri-modal masked diffusion models can now be trained from scratch, achieving strong results in text generation, text-to-image, and text-to-speech, thanks to a systematic exploration of the design space and a novel SDE-based batch size reparameterization.

Louis Bethune, L. Béthune, Victor Turrisi +44

Multimodal Models Scaling Laws & Emergent Abilities Speech & Audio

Feb 13, 2026

Apple MLFeb 13, 2026

On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

RL fine-tuning can make vision-language models *less* reliable reasoners, as gains in benchmark accuracy come at the cost of faithfulness to the underlying visual grounding and chain-of-thought.

Anshul Shah, Xiaoyu Zhu, Xinke Deng +3

Multimodal Models Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Search

Apple ML Research