Search papers, labs, and topics across Lattice.
VAANI's open-sourced dataset offers unprecedented coverage of India's linguistic landscape, finally giving researchers the data needed to build truly inclusive speech models.
LLMs can navigate complex 3D environments more effectively and with far fewer tokens by using a hierarchical scene graph representation derived from omnidirectional sensor data.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
By disentangling camera-space estimation from world-space refinement via dual diffusion models, DuoMo achieves state-of-the-art human motion reconstruction from noisy video, bypassing the limitations of parametric models.
Forget fine-tuning: Prompting MLLMs with a dynamic interval-based decoding strategy lets them generate surprisingly human-like, pause-aware real-time game commentary.
VLA models struggle with physical reasoning, but Pri4R's simple trick of predicting 3D point tracks during training boosts performance by up to 40% on manipulation tasks, without adding any inference overhead.
Forget monolithic models: pMoE shows that ensembling diverse expert prompts within a single model framework yields surprisingly large gains in visual adaptation across a wide range of tasks.
Finally, digital humans can have realistic, socially aware conversations: DyaDiT generates dyadic gestures that users strongly prefer over existing methods.
By decomposing long-horizon manipulation into transport and object-centric interaction, LiLo-VLA achieves state-of-the-art zero-shot generalization and robustness, outperforming end-to-end VLA models by a large margin.
Forget language and appearance: CAD models can now directly prompt accurate instance segmentation of industrial objects, even with diverse surface properties.
Robots can now perform intricate assembly tasks and recover from errors in real-time, without any training, by fusing vision-language models with video-based kinematic priors for action planning.
Forget cloud GPUs – a new model brings unified multimodal understanding and generation to your iPhone, running 6x faster than alternatives.
AudioChat tackles the complexity of "audio stories" by using LLM-driven tool-calling agents to simulate user interactions, enabling audio foundation models to generate, edit, and understand complex multi-source acoustic scenes.
Image-to-image editors silently weaken or ignore your edit instructions based on the subject's race, gender, and age, revealing surprising demographic biases.
MLLMs struggle with multi-turn chart editing, forgetting context and accumulating errors, especially when the edits involve data transformations, not just styling.
An educational RAG system achieves 84% accuracy in answering student questions with minimal human editing, suggesting a practical path towards scalable AI-assisted teaching.
Forget clunky skeletons: this new model lets you prompt your way to accurate 3D human meshes from single images, even in the wildest poses.
Forget slow text-based communication: Vision Wormhole unlocks faster multi-agent reasoning by turning VLMs into telepathic hubs, slashing runtime without sacrificing fidelity.
Stop treating generated images like real ones: GMAIL aligns them as separate modalities in a shared latent space, unlocking significant gains in vision-language tasks.
VLMs that ace RGB images completely fail at thermal imagery, revealing a critical gap in their ability to reason about temperature and physical properties.
Key contribution not extracted.
Forget static datasets – RL-based co-training unlocks +20% real-world VLA performance by interactively leveraging simulation while preserving real-world capabilities.
RynnBrain leapfrogs existing embodied foundation models, offering a unified, open-source spatiotemporal model that excels at physically grounded reasoning and planning across a wide range of benchmarks.
Forget rigid game environments – PAN lets you simulate open-world scenarios with language-specified actions and long-term visual coherence, opening the door to more realistic AI training.
Synthetic data generated by fine-tuning Stable Diffusion on multi-region satellite imagery boosts small object detection accuracy by 20%, even when real labeled data is scarce.
Forget tedious manual annotation: FlexDataset crafts customized, high-fidelity annotated datasets with 5x faster annotation times using a composition-to-data approach.
Achieve semantically coherent image compositions by mixing layout-focused and appearance-focused visual representations in a diffusion model's cross-attention.