Search papers, labs, and topics across Lattice.
100 papers published across 7 labs.
Achieve high-precision multi-robot SLAM with minimal data transmission by selectively compressing and transmitting keyframes and non-keyframes in a cloud-edge-robot architecture.
LMMs can slash FLOPs by 89% without sacrificing accuracy, thanks to a frequency-modulated visual restoration technique that preserves crucial visual semantics even with fewer tokens.
Tactile robotic perception gets a boost with a new pretraining method that explicitly encodes force, geometry, and orientation, leading to a 52% reduction in regression error.
Ditch the pre-trained models: TacLoc achieves accurate robotic pose estimation from tactile sensing alone by framing it as a one-shot point cloud registration problem.
Achieve near-perfect audio steganography even under heavy MP3 compression by optimizing latent reconstruction and diffusion inversion errors.
LMMs can slash FLOPs by 89% without sacrificing accuracy, thanks to a frequency-modulated visual restoration technique that preserves crucial visual semantics even with fewer tokens.
Tactile robotic perception gets a boost with a new pretraining method that explicitly encodes force, geometry, and orientation, leading to a 52% reduction in regression error.
Ditch the pre-trained models: TacLoc achieves accurate robotic pose estimation from tactile sensing alone by framing it as a one-shot point cloud registration problem.
Achieve near-perfect audio steganography even under heavy MP3 compression by optimizing latent reconstruction and diffusion inversion errors.
Forget paired video-music training data: V2M-Zero aligns video and music by matching the *timing* of changes within each modality, not the content itself.
Self-supervised learning can dramatically improve online HD map construction, outperforming supervised methods even with limited labeled data by leveraging geospatial consistency in BEV feature representations.
VLA-controlled robots can now detect anomalies in under 100ms using a plug-and-play module, enabling real-time recovery from unexpected situations.
Automating museum video metadata curation is now possible with a locally deployable video language model, unlocking previously inaccessible audiovisual archives.
Stop wrestling with unstable action spaces: ResWM reframes visual RL by predicting incremental action adjustments, leading to smoother control and better performance.
Unlock bimanual-level cloth manipulation with a single robotic arm using a novel tactile gripper and vision-based perception framework.
Achieve real-time photorealistic image enhancement without sacrificing visual quality or semantic consistency, thanks to a novel hybrid training strategy for GANs.
Ditch the clunky controllers: this hand-shadowing pipeline lets you teleoperate a robot arm with just an RGB-D camera and some clever inverse kinematics.
Diffusion Transformers can be accelerated by up to 7x with nearly lossless performance using a training-free method that selectively computes on sparse anchor tokens, outperforming existing temporal acceleration techniques.
LVLMs can now provide depth-aware pedestrian navigation guidance by grounding language reasoning and segmentation, without needing user-provided cues or anchor points.
Robots lost in the vineyard? Not anymore: encoding row-level semantics into a particle filter enables robust localization in repetitive agricultural environments where LiDAR and vision alone fail.
Forget training on massive datasets: this new diffusion policy learns human-like 3D scanning strategies that generalize to unseen objects while being robust to noise.
Ditch the heuristic latent spaces: Geometric Autoencoders offer a principled way to inject VFM priors into diffusion models, yielding state-of-the-art image generation with better compression and semantic depth.
Even in feature-rich environments, LiDAR SLAM systems are vulnerable to a new spoofing attack (D-SLAMSpoof) that injects dynamically coordinated spurious point clouds, but can be defended against using inertial dead reckoning.
Robots can now adaptively decide whether to clear clutter or directly grasp, leading to significantly improved success rates in densely cluttered environments.
By adaptively weighting neighbor information based on uncertainty, distributed multi-object tracking can achieve significantly better performance in mobile robot networks with heterogeneous localization quality.
This new OCR model beats Gemini-3.1-Pro and Qwen3-VL-235B on key information extraction, thanks to its clever "Layout-as-Thought" process that recovers layout grounding in end-to-end OCR.
Achieve 2.5x higher success in UAV navigation by decoupling target generation from progress monitoring, enabling safer and more efficient zero-shot flight.
Forget fine-tuning: surprisingly, single neuron activations in VLMs can be directly probed to create classifiers that outperform the full model, with 5x speedups.
Generative AI's ability to reason about and refine images based on authenticity criteria inadvertently creates a powerful evasion strategy that renders current deepfake detectors ineffective.
Jointly training layered Gaussian splats boosts reconstruction quality by up to 2.6 dB, proving that coordinating optimization across layers is key for progressive 2D Gaussian splatting.
Monocular depth estimation can now run at 161 FPS on edge devices without sacrificing too much accuracy, thanks to a clever asynchronous architecture that reuses features from a foundation model.
Achieve high-precision multi-robot SLAM with minimal data transmission by selectively compressing and transmitting keyframes and non-keyframes in a cloud-edge-robot architecture.
A training-free visual distillation method boosts VLA model performance in cluttered environments by over 34%, proving that targeted noise reduction is more effective than brute-force scaling.
Imagine an XR experience where you can selectively isolate and enhance individual sound sources in real-time, making chaotic audio environments crystal clear.
Ditch the slow diffusion grind: Marigold-SSD delivers zero-shot depth completion in a single step, rivaling discriminative models in speed while retaining diffusion's accuracy.
Backdoor triggers in ViTs leave a surprisingly clear signature: a linear direction in activation space that can be directly manipulated to activate or deactivate the backdoor.
Multimodal LLMs still struggle to faithfully recreate webpages from videos, particularly in capturing fine-grained style and motion, despite advances in other areas.
Bypass the need for extensive on-site data collection when deploying pre-trained robot models by visually prompting them to adapt to new scenes.
Autonomous vehicles can now better "see" the world even when cameras fail, thanks to a new method that fills in the blanks by leveraging spatial overlaps and learned semantic priors.
Skip expensive manual annotation: this method extracts accurate 3D UAV trajectories and classifications directly from readily available internet videos.
Generate realistic and controllable videos of humans interacting with objects using only sparse motion cues, like wrist positions and object bounding boxes.
By converting point clouds into a format VLMs can understand, VLM-Loc significantly boosts text-to-point-cloud localization accuracy, outperforming prior methods that rely on shallower text-point cloud correspondences.
Disagreement between pathologists, quantified as "Whole Slide Difficulty," can be leveraged to significantly boost the accuracy of AI Gleason grading, particularly for challenging cases.
Sports expose surprising limitations in VLMs' spatial reasoning, as current models struggle to generalize from existing benchmarks despite fine-tuning gains on a new, large-scale dataset.
Fine-grained foot motion capture, a notoriously hard problem, gets a 30% accuracy boost by cleverly lifting 2D keypoints to 3D using motion capture data and contextual information, bypassing the need for direct image-3D annotation pairs.
Even the most advanced MLLMs like GPT-5 and Gemini struggle to spot the "odd one out" in simple visual grids, revealing a surprising weakness in fine-grained visual perception.
Forget manual labeling: STONE offers a massive, automatically-labeled dataset for off-road robot navigation, unlocking scalable training for robust 3D traversability prediction.
Generative drifting's empirical success is no longer a mystery: it's secretly score matching, but with frequency-dependent convergence bottlenecks that explain the preference for Laplacian kernels.
Achieve SOTA multi-modal object tracking by adaptively fusing modalities with a Mixture of Experts and decoupling temporal propagation with separate State Space Models.
Source-free test-time adaptation for image regression gets a boost with Predictive Spectral Calibration, which aligns target features within the source predictive support and calibrates residual spectral slack, leading to significant performance gains under distribution shifts.
By explicitly bridging the gap between on-body appearances and flat layouts, BridgeDiff achieves state-of-the-art virtual try-off results, generating more realistic and structurally sound flat-garment representations.
Unlock real-time semantic SLAM and multimodal interaction with 3D Gaussian Splatting using X-GS, a unified and extensible open framework.
Steer clear of catastrophic forgetting in VLMs with EvoPrompt, a new method that evolves prompts by preserving learned semantic directions while adapting their magnitude.
State-of-the-art skeleton-based action recognition is now possible through a game-theoretic contrastive learning framework that maximizes action-relevant information while minimizing encoding redundancy.
Large models are emerging as a promising new paradigm for translating complex-layout document images, as shown by the ICDAR 2025 DIMT competition.
BinaryAttention proves you can more than halve the runtime of attention in vision and diffusion transformers without sacrificing accuracy, simply by using the sign of queries and keys.
By explicitly modeling how abnormalities relate within and across different medical image views, GIIM achieves significantly higher diagnostic accuracy and robustness, even with incomplete data.
Stream 3D Gaussian Splatting scenes with higher visual quality and lower bandwidth by predicting user viewpoints and dynamically adapting bitrate using deep reinforcement learning.
A 4B-parameter model outperforms Gemini-3-Pro in autonomous driving by incorporating physics-informed constraints and style-aware training, suggesting specialized models can surpass larger, general-purpose models in domain-specific tasks.
A complete, GPU-accelerated bimanual mobile manipulation platform can be built for under $1300, opening up robotics research and education to a wider audience.
Achieve near-FP32 image restoration performance with an Int8 model that runs at 442 FPS on NVIDIA Jetson Orin, all thanks to a quantization-aware distillation framework that avoids decoder distillation.
VLMs still struggle to understand our planet, as revealed by a new geospatial benchmark spanning diverse Earth observation tasks and multi-source sensing data.
Forget blurry sketch-to-image outputs: this method uses component-aware self-attention and coordinate-preserving fusion to generate photorealistic images with unprecedented fidelity and spatial accuracy.
Despite diverse formulations, ToF NLOS imaging methods hit similar performance walls in resolution and noise sensitivity when hardware is held constant, suggesting diminishing returns from algorithmic improvements alone.
By computing the *difference* between attention maps, DCAU-Net achieves state-of-the-art medical image segmentation while dramatically reducing computational cost compared to standard self-attention.
By incorporating language guidance into federated learning, SurgFed tackles the long-standing problem of tissue and task heterogeneity in surgical video understanding, leading to improved segmentation and depth estimation across diverse surgical settings.
Finally, a GelSight-style sensor that doesn't force you to choose between pre-contact vision and high-fidelity tactile sensing.
Ditch the flat scene graphs: TopoOR models surgical environments as higher-order topological structures, unlocking superior performance in safety-critical tasks by preserving complex relationships and multimodal data.
Precisely steer text-to-image generation along cognitive dimensions like valence and memorability with CogBlender, a framework that lets you dial in psychological intent.
Latency in VR conferencing hurts social presence, but this study quantifies the perceptual and cognitive mechanisms at play to guide system optimization.
Event cameras can now estimate depth with significantly improved temporal consistency and accuracy thanks to a novel distillation approach from video foundation models, achieving a 53% reduction in depth error.
Task demands in remote AR collaboration dictate how much network delay users can tolerate before perceived fluency breaks down, paving the way for adaptive systems.
Unlock the power of web videos for embodied AI: implicit geometry representations let agents learn to navigate from real-world room tours without relying on fragile 3D reconstruction.
By representing visual inputs as 3D Gaussian primitives, GST-VLA unlocks a new level of geometric understanding for vision-language-action models, leading to substantial performance gains in robotic manipulation tasks.
Reverse image search, a key tool for visual fact-checking, often amplifies misinformation and irrelevant content, burying debunking information.
ConvNets strike back: a ConvNeXt-based diffusion model matches Transformer performance at half the FLOPs and 7x faster training, all on just 4 GPUs.
Achieve real-time super-resolution ultrasound without labeled data using CycleULM, a CycleGAN-based framework that boosts image contrast by 15.3 dB and localization precision by 46%.
Chamfer distance, the workhorse loss for point cloud tasks, can actually *increase* when you optimize it, unless you use non-local coupling to avoid gradient collapse.
Imagine writing a script and instantly seeing it come to life – Doki makes generative video authoring as intuitive as writing a text document.
Combining pre-trained and custom neural networks with data augmentation and transfer learning yields a robust autonomous driving system capable of accurately perceiving and reacting to its environment.
Finally, a single model that can generate both your face and voice, convincingly controlled by text prompts and reference clips.
Provably secure steganography can now withstand real-world image compression and processing thanks to a clever latent-space optimization technique.
Bridge the gap between sparse core samples and continuous wellbore data with a cGAN that synthesizes realistic subsurface images conditioned on well log porosity.
Forget retraining: this guideline-aware AI agent instantly adapts to new radiotherapy protocols, outperforming supervised models in clinical preference.
Reconstructing and simulating wind-driven dynamics from video is now possible with a new differentiable framework that enforces fluid dynamics laws.
Panoramic vision-language models can achieve a level of holistic scene understanding and robustness in adverse conditions that's impossible for traditional pinhole-based VLMs.
A new video-based reward model beats GPT-5.2 and Gemini-3 Pro at evaluating computer-using agents, offering a scalable, model-agnostic alternative to traditional methods.
Adapt your action anticipation model on-the-fly to new viewpoints (egocentric or exocentric) with a novel test-time adaptation method that leverages multi-label prototype growing and dual-clue consistency.
Achieve 45x compression of 3D Gaussian Splatting data while *improving* visual fidelity by over 10% with a streaming-friendly octree-based codec.
Ditch global embeddings for text-motion retrieval: this method uses joint-angle motion images and token-patch late interaction to achieve state-of-the-art accuracy and interpretability.
By explicitly modeling per-splat appearance variance, VarSplat enables more robust 3D Gaussian Splatting SLAM, particularly in low-texture or reflective environments where existing methods struggle.
A plug-and-play module, RESBev, fortifies BEV perception against sensor degradation and adversarial attacks by learning latent BEV state transitions, offering a practical route to more reliable autonomous driving systems.
Worsening of a specific lung abnormality called PPFE, easily measurable on routine lung cancer screening CT scans, strongly predicts earlier death and respiratory problems.
RiO-DETR makes real-time oriented object detection with transformers a reality by cleverly decoupling angle estimation and injecting angular diversity into dense supervision.
Existing vision-language models fall flat when it comes to spotting time-dependent robot errors, but TIMID nails it with weak supervision and a clever VAD architecture.
By explicitly modeling and mitigating the confounding effects of visual context, CIGPose achieves state-of-the-art whole-body pose estimation, outperforming previous methods even without relying on extra training data.
FetalAgents leapfrogs existing fetal ultrasound analysis tools by dynamically orchestrating specialized AI agents, outperforming monolithic models across diverse clinical tasks and delivering structured clinical reports from video streams.
DRIFT achieves state-of-the-art object detection performance on 4D radar point clouds by fusing local and global contexts with a novel dual-representation transformer architecture.
By fusing confidence-weighted point cloud projections with a Kalman-inspired update mechanism, ConfCtrl enables diffusion models to generate geometrically consistent novel views from sparse inputs, even under significant viewpoint shifts.
Ditch brittle point-guided line matching: this VIO system uses optimal transport on learned line descriptors for globally consistent correspondences, boosting robustness in challenging visual conditions.
By learning visual representations from scene-level semantics down to pixel-level details, C2FMAE overcomes the limitations of both contrastive learning and masked image modeling.
A single spatial token, learned via occupancy prediction on a massive dataset, is surprisingly effective at injecting crucial spatial awareness into vision-language navigation, leading to state-of-the-art performance.
MLLMs struggle with visually rendered text not because they can't reason, but because they can't *read* it, and a simple self-distillation fix closes the gap.
Unlock high-fidelity 3D reconstruction for curved visuotactile sensors with just a few simple contacts, thanks to a new physics-consistent calibration framework.