Search papers, labs, and topics across Lattice.
100 papers published across 4 labs.
Achieve both low-bitrate perceptual video compression and practical scalability with ProGVC, a framework that unifies progressive transmission, efficient entropy coding, and detail synthesis.
Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.
By explicitly reasoning in 3D, VolumeDP leaps ahead of 2D-based imitation learning methods, achieving a remarkable 14.8% improvement on the LIBERO benchmark and robust real-world generalization.
LLMs can be prompted to generate part-aware instructions that substantially improve open-vocabulary 3D affordance grounding by linking semantically similar affordances and refining geometric differentiation.
The field of video understanding is rapidly shifting from isolated pipelines to unified models capable of adapting to diverse downstream tasks, demanding a re-evaluation of current approaches.
Achieve both low-bitrate perceptual video compression and practical scalability with ProGVC, a framework that unifies progressive transmission, efficient entropy coding, and detail synthesis.
Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.
By explicitly reasoning in 3D, VolumeDP leaps ahead of 2D-based imitation learning methods, achieving a remarkable 14.8% improvement on the LIBERO benchmark and robust real-world generalization.
LLMs can be prompted to generate part-aware instructions that substantially improve open-vocabulary 3D affordance grounding by linking semantically similar affordances and refining geometric differentiation.
The field of video understanding is rapidly shifting from isolated pipelines to unified models capable of adapting to diverse downstream tasks, demanding a re-evaluation of current approaches.
Unleash creativity in text-to-image models with a single, reusable 64-token template, sidestepping costly iterative prompt engineering and reasoning.
Even with a 98:1 test-to-train ratio, PEFT methods like QLoRA can unlock surprisingly strong generalization from billion-parameter vision models for agricultural image classification, suggesting underfitting is the bigger risk than overfitting.
SAM3 disappoints in eye image segmentation, failing to surpass SAM2's performance despite its new concept prompting mode.
By treating 3D scene editing as goal-regressive planning rather than pure generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility that existing methods miss.
This model beats clinical reports in quantitative coronary angiography, opening the door to automated, comprehensive assessment of coronary artery disease at the point of care.
Achieve stable, real-time kilometer-scale autonomous driving simulations by generating vector-graph tiles incrementally using a novel diffusion flow approach.
Forget verbose instructions: this new VLN paradigm uses floor plans to guide navigation with concise commands, boosting success rates by 60%.
Existing 3D visual grounding methods crumble in complex scenes, but PC-CrossDiff's dual-level attention unlocks a +10% accuracy boost by parsing subtle spatial cues.
Video diffusion transformers exhibit a hidden "magnitude hierarchy" in their activations that can be exploited for training-free quality improvements via a simple steering method.
Forget geometric LODs: tokenizing 3D shapes by semantic salience unlocks SOTA reconstruction and efficient autoregressive generation with 10x-1000x fewer tokens.
Achieve state-of-the-art fine-grained visual recognition without training by adaptively invoking reasoning in a Large Vision-Language Model only when needed, significantly reducing computational overhead.
Denoised eye-tracking heatmaps dramatically boost the generalization of iris presentation attack detection, outperforming hand annotations and even self-supervised DINOv2 features.
Generate consistent stereo videos directly from RGB data, bypassing depth estimation and monocular-to-stereo conversion, with StereoWorld's novel camera-aware attention mechanisms.
Forget fixed layer counts: LaDe generates fully editable, layered media designs with a *flexible* number of semantically meaningful layers, outperforming existing methods in text-to-layer alignment.
Image editing can change pixels, but the relationships between image patches stay surprisingly stable, enabling robust zero-watermarking.
Class reweighting and anatomy-guided decoding can substantially improve the performance of video analysis pipelines for rare events in imbalanced gastrointestinal datasets.
Legged robots can now perform robust parkour with a 1-meter visual blind zone, thanks to a novel architecture that tightly couples vision, proprioception, and physics-based state estimation.
Forget training separate models for each compression level; this framework achieves state-of-the-art extreme image compression with flexible bitrate control using a single diffusion-based arbitrary-scale super-resolution model.
MLLMs' image segmentation prowess isn't a given: a critical adapter layer actually *hurts* performance, with the LLM having to recover via attention-mediated refinement.
Synthetic data and virtual environments are rapidly becoming indispensable for autonomous driving, but realizing their full potential requires tackling challenges like Sim2Real transfer and scalable safety validation.
Injecting "beneficial noise" into cross-attention mechanisms can significantly improve unsupervised domain adaptation by forcing models to focus on content rather than style distractions.
Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.
Injecting semantic information from related modalities early in the embedding process significantly boosts performance on multimodal medical image classification tasks.
Achieve state-of-the-art semantic 3D reconstruction from sparse views by intelligently pruning redundant Gaussians and blending 2D and 3D semantic cues.
Simply translating symbolic sign language notations into natural language unlocks significantly better motion generation when conditioning on phonological attributes with CLIP.
Finally, a unified framework lets you control both facial appearance and voice timbre for personalized audio-video generation across multiple identities.
Unlock automated creation of production-ready 3D assets from untextured meshes with TAPESTRY, which generates geometrically consistent turntable videos that can be back-projected into UV textures or used to supervise neural rendering.
Counterintuitively, the most *unreliable* samples in medical imaging datasets—those with fluctuating confidence and frequent forgetting during training—are the *most* informative for building accurate decision boundaries.
Interactive avatars can now exhibit more emotionally appropriate and contextually aware facial behaviors thanks to a novel architecture that disentangles audio-driven lip movements from user-driven non-lip facial expressions.
By disentangling semantic and contextual cues in vision-language models, PCA-Seg achieves state-of-the-art open-vocabulary segmentation with only 0.35M additional parameters per block.
Training video diffusion models with pixel-wise losses just got a whole lot cheaper: ChopGrad reduces memory complexity from linear to constant with video length.
Radiologist dictation, combined with foundation models and minimal parameter updates, can achieve state-of-the-art MRI brain tumor segmentation.
Forget prompt engineering: this new region proposal network spots objects across diverse datasets without *any* text or image prompts.
Flash photography reveals subtle material differences in fingerprints, enabling more robust spoof detection compared to traditional single-image methods.
Achieve high-fidelity transparent text animations from image-to-video models without retraining the VAE, sidestepping data scarcity and latent pattern mixing issues.
Animate 3D characters using bananas and plush toys – DancingBox turns everyday objects into motion capture proxies, making animation accessible to novices.
Forget fine-tuning: this method uses smart patch selection to adapt frozen LVLMs for deepfake detection, outperforming baselines without any training.
Facial micro-movements betray your cognitive load, revealing a new pathway to real-time workload monitoring using just a webcam.
Reconstructing realistic 3D human crowds from a single image is now possible, thanks to a new method that cleverly handles occlusions and appearance variations.
Ditch the overconfident posteriors: Structured SIR offers a memory-efficient way to capture complex, multi-modal uncertainty in high-dimensional image registration, outperforming variational inference.
Forget expensive 3D training data: Loc3R-VLM shows how to give 2D vision-language models strong 3D spatial reasoning by distilling knowledge from a pretrained 3D foundation model using only monocular video.
By cleverly using readily available video segmentation masks, this method boosts DINOv2's point tracking performance by over 14% – a surprisingly effective way to inject temporal awareness into static image-pretrained models.
Drones can now land safely in complex, unknown environments using only a camera, thanks to a new system that dynamically maps and reacts to surroundings in real-time.
Sound source localization gets a reliability upgrade: conformal prediction delivers uncertainty estimates, even when you don't know how many speakers are talking.
Overcome scarce data and boost material classification accuracy by generating synthetic training data and distilling knowledge from vision-language foundation models.
Current PII detection models are blind to the transaction-level identifiers and partially-filled forms that computer-use agents readily expose, but a new benchmark closes the gap.
Even when visual data is missing or noisy, EgoAdapt accurately determines who is talking to the camera wearer by adaptively integrating head orientation, lip movement, and robust audio features.
Unlock accurate monocular 3D object tracking with minimal annotation: Sparse3DTrack achieves state-of-the-art performance using only a handful of labels per track.
Robot world models can be significantly improved by directly rewarding them for generating videos that lead to physically plausible robot actions, even if the videos themselves contain visual artifacts.
A complete autonomy stack enables centimeter-level localization and mapping on the moon, even without GPS.
Image editing models leak fascinating hints about their world knowledge through "edit spillover"—unintended changes to semantically related regions—and this paper turns that leakage into a probe.
SpiderCam shatters power consumption barriers for FPGA-based 3D cameras, achieving sub-Watt operation while maintaining real-time performance.
A new prompting strategy closes the gap between general-purpose and specialized cell segmentation models, suggesting a path to more efficient adaptation.
Steganography gets smarter: this framework hides data more effectively by adapting the amount of information concealed in each pixel based on image complexity and payload size.
Unlock scalable aerial scene understanding with SegFly, a massive RGB-T dataset generated via a novel 2D-3D-2D label propagation technique that requires minimal manual annotation.
Ditch the diffusion vs. autoregressive debate: this VLA framework uses diffusion to *draft* actions and an autoregressive model to *verify* them, boosting real-world success by nearly 20%.
Quantizing large vision-language models just got a whole lot better: a new token-level sensitivity metric closes the accuracy gap with full-precision models by up to 1.6% in 3-bit weight-only quantization.
Achieve 4K image-to-video generation with diffusion models without training by cleverly fusing tiled denoising with a low-resolution latent prior, balancing detail and global coherence.
CLIP struggles with fine-grained details in cross-domain few-shot learning, but a cycle-consistency method can fix its vision-language alignment and boost performance.
Synthesizing realistic intermediate video frames just got a whole lot better, thanks to a novel attention mechanism that anchors to keyframes and text prompts for improved consistency and semantic alignment.
Achieve SE(3) equivariance and memory scalability in point cloud analysis with coordinate-based kernels, outperforming state-of-the-art equivariant methods on diverse tasks.
Achieve state-of-the-art anomaly detection in multi-class and continual learning scenarios with AdapTS, a teacher-student framework that slashes memory overhead by up to 149x compared to existing methods.
By probabilistically fusing visual context into text prompts, VirPro closes the semantic gap in weakly-supervised 3D detection, boosting performance by nearly 5% on KITTI.
Mamba, the darling of sequence modeling, now powers a GAN that beats StyleGAN2-ADA in image synthesis, thanks to a clever latent space routing trick.
By cleverly turning novel view synthesis into a self-supervised inpainting problem, VisionNVS eliminates the need for ground truth images of novel views, outperforming LiDAR-dependent baselines.
YOLO can learn faster and better by strategically skipping redundant images during training, achieving a 1.43x speedup and improved accuracy with a new Anti-Forgetting Sampling Strategy.
Even with environmental noise, a VAE-based anomaly detector can spot adversarial attacks on collaborative DNNs with high accuracy.
Unlock the power of MLLMs for structured data like human skeletons with a differentiable rendering approach that allows end-to-end training.
By fusing IMU-derived egomotion with visual data, Motion-MLLM lets MLLMs achieve SOTA 3D scene understanding with 40% less compute.
By unifying layout-to-image generation and image grounding with a novel cycle-consistent learning approach, EchoGen achieves state-of-the-art results in both tasks, proving that solving two problems at once can be better than solving them separately.
Forget finetuning: DynaEdit unlocks complex video edits like action modification and object insertion, all without training, using clever manipulation of pretrained text-to-video models.
Forget waiting minutes for iterative optimization – Omni-3DEdit performs diverse 3D editing tasks in a single forward pass.
Dashcam videos can now be directly linked to legal responsibility determinations via a novel multimodal dataset and legal reasoning framework, outperforming existing LLMs and agent-based systems.
By adaptively calibrating facts and augmenting emotions, FACE-net overcomes the factual-emotional bias that plagues emotional video captioning.
An AI model can estimate legal age from clavicle CT scans with higher accuracy than human experts, potentially revolutionizing forensic age assessment.
A new prompt-free medical image segmentation model achieves impressive zero-shot and cross-modal transfer performance by explicitly disentangling geometric and semantic anatomical knowledge.
By reorganizing 3D scenes into structurally-aware subscenes, S-VGGT offers a parallel geometric bridge for efficient processing, slashing global attention costs without compromising reconstruction fidelity.
Skip the costly training and go straight to open-vocabulary 3D reasoning with ReLaGS, which builds a 3D semantic scene graph from language-distilled Gaussians.
A new RGB-T dataset and frequency-aware network exposes the surprising limitations of existing UAV detectors when faced with real-world camouflage and complex backgrounds.
Anonymized faces don't have to be expressionless blobs: this method preserves realistic expressions and lighting while scrambling identity.
Overcome weather limitations in remote sensing with MM-OVSeg, a multimodal Optical-SAR fusion framework that enables robust open-vocabulary segmentation even under cloudy conditions.
AI spots a hidden pattern in lung scans of lupus patients, revealing that specific airway dilations in the upper lobes could be a telltale sign of interstitial lung disease.
Grabbing two keyframes per shot – one for the gist, one for the glitch – lets you compress videos for VLMs without missing critical anomalies.
RLHF for autoregressive video generation gets a boost with AR-CoPO, which overcomes the limitations of SDE-based methods by using chunk-level alignment and a semi-on-policy training strategy.
RIS models struggle with motion-based queries, but a new data augmentation and contrastive learning approach closes the gap without sacrificing performance on appearance-based descriptions.
Achieve competitive video generation with Stable Diffusion using only 2.9% additional parameters by adapting temporal attention based on motion content, outperforming methods with explicit temporal consistency losses.
NeRFs can now guide extraterrestrial rovers around unexpected obstacles, thanks to a novel planning framework that blends local observations with global terrain understanding.
Surprisingly, you can achieve smooth, controllable image editing in text-to-image models without any training, just by intelligently nudging the text embeddings.
Panoramic 3D reconstruction gets a boost with PanoVGGT, a Transformer that handles spherical distortions and global-frame ambiguity to deliver state-of-the-art accuracy in a single pass.
Gesture-aware pretraining unlocks significant improvements in 3D hand pose estimation, proving that semantic gesture information acts as a powerful inductive bias.
Differential attention and asymmetric loss functions can significantly improve the performance of BiomedCLIP on highly imbalanced video classification tasks like identifying rare pathologies in video capsule endoscopy.
Reconstructing complete, animatable 3D avatars from heavily occluded YouTube videos is now possible, thanks to a hallucination-as-supervision pipeline using diffusion models.
Medical vision-language models perform better when the modality gap is tuned to an intermediate level, challenging the assumption that minimizing it is always optimal.
By focusing on semantic differences between scans, DiffVP lets LLMs generate more accurate CT reports without needing explicit lesion localization.
An 8B parameter model, RideJudge, outperforms 32B baselines in ride-hailing dispute adjudication by aligning visual semantics with evidentiary protocols, achieving 88.41% accuracy.