Search papers, labs, and topics across Lattice.
100 papers published across 2 labs.
Deep learning can rescue VIO from textureless environments and rapid lighting changes.
Aligning diffusion models with just 100 carefully selected samples can beat state-of-the-art preference optimization methods trained on thousands, and converge up to 220x faster.
Ditch the feature extraction pipeline: GenMask directly generates segmentation masks with a diffusion transformer, achieving SOTA results by harmonizing mask and image generation in a single model.
Cost volumes might be overkill: WAFT-Stereo proves you can ditch them for a warping-based approach and still dominate stereo matching benchmarks with significantly improved efficiency.
Forget redrawing diagrams by hand: VFIG, a new vision-language model, can automatically convert rasterized figures into editable SVGs with near GPT-5.2 quality.
Ditch the feature extraction pipeline: GenMask directly generates segmentation masks with a diffusion transformer, achieving SOTA results by harmonizing mask and image generation in a single model.
Cost volumes might be overkill: WAFT-Stereo proves you can ditch them for a warping-based approach and still dominate stereo matching benchmarks with significantly improved efficiency.
Forget redrawing diagrams by hand: VFIG, a new vision-language model, can automatically convert rasterized figures into editable SVGs with near GPT-5.2 quality.
Forget random back-view hallucinations – Know3D lets you *prompt* the unseen side of 3D models using language, opening the door to controllable 3D asset creation.
Representation-Pivoted Autoencoders enable diffusion models to generate and edit images with higher fidelity by learning a compressed latent space that preserves the semantics of pre-trained visual representations.
Forget generating plausible-but-fake details: 3DreamBooth bakes a robust 3D prior into video generation models using only a single-frame optimization, enabling truly view-consistent customized subject videos.
Even with only 5% labeled data, Switch achieves ultrasound segmentation accuracy exceeding fully supervised methods, thanks to its clever multiscale and frequency-domain switching.
Explicitly reconstructing 3D scenes with Gaussian Splatting unlocks state-of-the-art BEV perception, proving that geometric understanding is key to accurate spatial reasoning.
Fine-tuning a visual geometry transformer with SEAR unlocks surprisingly accurate RGB-Thermal 3D reconstruction, even surpassing SOTA methods despite training on significantly less multimodal data.
Closed-loop feedback using VLMs can dramatically improve text-to-image generation quality, even without additional training.
Linear classification, a cornerstone of machine learning, is provably harder than we thought in high dimensions.
Unlock 4-15% faster Gaussian Splatting without retraining your existing datasets by swapping in a polynomial kernel.
CNNs still reign supreme in Burmese handwritten digit recognition, but physics-inspired PETNNs are hot on their heels, outperforming Transformers and KANs.
Forget waiting a minute for garment generation: SwiftTailor slashes inference times while boosting accuracy by representing 3D garments as geometry images.
Generative videos might look great, but a new metric reveals they often suffer from jarring 3D spatial inconsistencies that existing metrics miss.
Achieve state-of-the-art single image reflection removal by explicitly guiding a diffusion model with spatial intensity and high-frequency priors derived directly from the input image.
Forget brute-force scaling: intelligently selecting just 1% of video frames can actually *improve* video QA accuracy and cut compute by 93%.
Decomposing uncertainty into aleatoric and epistemic components in image segmentation is often misleading due to substantial entanglement, but ensembles offer a surprisingly robust and less entangled alternative.
Ditch one-hot vectors: representing facial action units as natural language unlocks more realistic and nuanced facial expression synthesis, especially when dealing with conflicting muscle movements.
Scribble prompts beat point prompts for interactive surgical segmentation, achieving state-of-the-art Dice scores with fewer interactions.
Object detectors in new visual domains suffer from "astigmatism," but mimicking the human eye's foveal vision can bring them into focus.
Forget hand-crafted assets and heuristics: V-Dreamer uses video generation models to automatically create diverse, physically plausible robotic simulation environments and trajectories directly from language.
Differentiable collision checking in configuration space, previously a major hurdle, is now achievable with zero-shot generalization thanks to CSSDF-Net.
Instruction-guided video editing can achieve impressive zero-shot performance simply by pre-training on motion-centric video restoration tasks *before* fine-tuning on paired editing data.
Achieve more physically realistic video generation by explicitly modeling 3D geometry and physical attributes across multiple viewpoints.
You can predict how engaged and attracted viewers are to a video lecture just by analyzing the speaker's face and voice, no audience data needed.
VLMs can now better detect when they're seeing something they shouldn't, even as the world changes around them, thanks to a new method that dynamically fuses visual and textual cues.
Current video object removal methods leave distracting visual artifacts behind, but EffectErase tackles this problem head-on by jointly removing objects and their pesky visual effects.
Get faithful and plausible natural language explanations for chest X-rays with as few as 5 human-annotated examples per diagnosis, and even boost classification accuracy in the process.
Unlock real-time 3D understanding: MonoArt achieves state-of-the-art monocular articulated object reconstruction without relying on multi-view data or external motion templates.
Achieve 9x lower trajectory error and 3x better FID in motion generation by using a diffusion-based discrete motion tokenizer that elegantly handles both semantic and kinematic constraints.
VLMs struggle with spatial reasoning, but a clever decomposition into sub-problems and probabilistic recombination unlocks significantly better metric-semantic grounding.
Unlocking fairer vision-language models may be as simple as intervening in the sparse latent space of a sparse autoencoder, enabling targeted bias removal without harming performance.
Get continuous level-of-detail rendering in 3D Gaussian Splatting without sacrificing top-end quality – no architectural changes needed.
Autonomous driving models can be made significantly more robust and safe by explicitly de-confounding their training via causal intervention, eliminating reliance on spurious correlations.
Forget generic textures – CustomTex lets you clone real-world object appearances onto your 3D scenes with uncanny fidelity.
Proactive VideoLLMs can finally be both accurate AND efficient thanks to a novel propose-match framework that decouples semantic understanding from streaming perception.
Encoding realism as a knowledge graph of interpretable traits unlocks zero-shot sim2real image translation that outperforms state-of-the-art diffusion methods.
Ditch the handcrafted noise schedules: spectral analysis unlocks per-image diffusion schedules that boost generative quality, especially when you're racing against the clock with few steps.
Autoregressive generative classifiers can beat diffusion models at image classification, but only if you marginalize over token order.
A new dataset and model specifically designed for traffic anomaly understanding in roundabouts could pave the way for more robust and efficient intelligent transportation systems.
Simpler fingerprint enhancement techniques can outperform complex state-of-the-art methods, especially on low-quality images.
Achieve state-of-the-art panoramic depth estimation without any training by cleverly exploiting the 3D consistency priors embedded within existing vision foundation models.
Unsupervised contrastive learning can now outperform supervised methods for 3D shape matching, while simultaneously slashing computational costs.
Text-to-image synthesis just got almost 4x faster without sacrificing image quality, thanks to a clever twist on Speculative Jacobi Decoding that keeps the generation process moving even when initial drafts are rejected.
Achieve topologically-aware image segmentation without cumbersome architectures or expensive computations: SCNP makes it easy.
Compact ViTs can now rival or surpass CNN-based architectures like YOLO for edge-based object detection, instance segmentation, and pose estimation, thanks to task-specialized distillation.
Ditch the training: SVOO achieves up to 1.93x speedup in video generation with sparse attention by exploiting the intrinsic, layer-specific sparsity patterns of attention without any fine-tuning.
CNNs still reign supreme for medical image segmentation on heterogeneous datasets, beating out hybrid transformer models despite the latter's theoretical advantages.
Deep learning can rescue VIO from textureless environments and rapid lighting changes.
Automating motor insurance from vehicle damage analysis to claims evaluation is now possible with a vertically integrated AI paradigm.
DriveTok achieves unified multi-view reconstruction and understanding by learning scene tokens that integrate semantic, geometric, and textural information, outperforming existing 2D tokenizers in autonomous driving scenarios.
State Space Models can outperform Vision Transformers as vision encoders in VLMs, particularly when model size is a constraint.
Achieve atomic-scale clarity in noisy HRTEM images with a novel denoising network that intelligently exploits statistical characteristics in both spatial and frequency domains.
Diffusion models can now generate rare concepts and execute complex edits with greater fidelity, thanks to a training-free prompt blending technique that leverages statistical properties of the diffusion process itself.
Ditch the finetuning: this training-free method uses attention scores to generate rare concepts in images with more precision and control than LLM-guided approaches.
DROID-SLAM achieves robust real-time RGB SLAM in dynamic environments by explicitly modeling per-pixel uncertainty, outperforming existing methods that struggle with unknown dynamic objects and cluttered scenes.
Even with malicious clients flipping labels, FedTrident recovers federated learning performance to near attack-free levels, outperforming existing defenses by up to 9.49% in critical metrics.
Aligning diffusion models with just 100 carefully selected samples can beat state-of-the-art preference optimization methods trained on thousands, and converge up to 220x faster.
Achieve near-perfect radio map reconstruction (SSIM 0.9752, PSNR 36.37 dB) from limited data by injecting electromagnetic theory into diffusion models.
You can get state-of-the-art performance on retinal fundus image tasks with an interpretable foundation model that's 16x smaller than the alternatives.
Over-reliance on neighborhood similarity in source-free domain adaptation hurts performance; ProCal offers a way to dynamically calibrate predictions and improve generalization.
MRI reconstruction can be made dramatically more robust to clinical domain shifts by eliminating the need for explicit coil sensitivity map estimation.
Achieve topologically coherent coronary vessel segmentation by directly optimizing for geometric structure, rather than pixel-wise accuracy, using preference-based learning.
Real-time robotic perception just got a major upgrade: OnlinePG achieves open-vocabulary panoptic mapping with 3D Gaussian Splatting, enabling robots to understand and interact with environments in a way that was previously impossible.
Synthesized PET scans from MRI can nearly match the diagnostic accuracy of real PET for Alzheimer's, potentially unlocking wider access to crucial functional insights.
Visual language models can now explicitly reason about object trajectories in videos, thanks to a simple yet effective method that augments training data and uses discrete motion tags.
LVLMs can gain a surprising amount of spatial reasoning ability by explicitly generating segmentation and depth tokens before answering questions.
LLMs can navigate more efficiently in unfamiliar environments by reasoning over a tree of possible paths, not just isolated waypoints, enabling them to consider en-route information gain and prune unpromising branches.
Radiometric disentanglement from a single image becomes tractable by exploiting the shared illumination constraint across multiple objects, enabling stochastic sampling of reflectance, texture, and illumination.
Detecting subtle building changes gets a boost: a new RGB-NIR dataset and network reveal the power of multi-modal fusion for teasing out fine-grained differences.
Ditch the mask decoder: a single segmentation token can unlock competitive image segmentation directly from MLLMs.
Reconstructing realistic hand-object interactions from video just got an order of magnitude faster, thanks to a novel Gaussian Splatting approach that ensures physical consistency.
Pixel-perfect geospatial reasoning is now possible, thanks to a vision-language model that adaptively fuses multi-modal and multi-temporal Earth observation data.
Diffusion models can generate segmentations that rival discriminative methods, but only if you reshape their vector fields with a distance-aware correction term that combats gradient vanishing.
Representing complex 3D biomedical graphs as learned tokens unlocks generative modeling and efficient analysis of anatomical structures.
Overcoming occlusion in hand-object pose estimation just got easier: GenHOI leverages hierarchical semantic knowledge and hand priors to achieve state-of-the-art results on challenging benchmarks.
Get GPT-4o-level long-video QA performance with 10x fewer FLOPs by using a hierarchical, training-free frame selector that combines multimodal experts and fuzzy logic.
End-to-end quantum image generation is now possible, even with limited qubits, thanks to a new method that bridges the gap between quantum circuits and pixel intensities.
Hybrid LiDAR-inertial-visual odometry (LIVO) robustly handles visually challenging conditions, outperforming sparse-direct methods by combining direct photometric methods with learning-based feature descriptors.
Smaller open-source models can outperform larger proprietary LVLMs on specific authenticity cues in AI-generated video detection, challenging the assumption that scale alone guarantees better performance.
By combining CNNs and State Space Models, DA-Mamba achieves efficient global-local feature alignment for domain adaptive object detection, outperforming prior CNN-only and Transformer-based approaches.
Achieve state-of-the-art joint audio-video generation with fewer resources by fixing key flaws in cross-modal context handling within dual-stream transformers.
Restoring faces across age gaps is now possible: MeInTime leverages diffusion models and age-aware guidance to create faithful restorations from cross-age references.
Token compression and multi-agent systems are enabling more efficient and interpretable multimodal reasoning in computational pathology, paving the way for trustworthy AI-assisted diagnosis.
High-dimensional discrete tokens, previously out of reach for generative models, can now be directly generated, unlocking a unified token prediction paradigm for multimodal architectures.
Text-to-3D generation gets a semantic upgrade: DreamPartGen creates 3D objects with parts that not only look right but also understand their relationships and align with textual descriptions.
Schrödinger Bridges elegantly unify diffusion models, score-based models, and flow matching under a single, powerful framework.
Spatial awareness is the secret ingredient to unlocking better visual in-context learning, boosting performance across diverse vision tasks.
The chaos of MTSAD research gets a little tamer with a new taxonomy that exposes the field's hidden convergence on Transformers and reconstruction, hinting at where the next breakthroughs will come from.
Ditch the slow per-scene optimization: SwiftGS meta-learns transferable priors for satellite surface reconstruction, enabling single-pass 3D recovery.
MLLMs can gain surprisingly strong 3D spatial reasoning abilities simply by tapping into the latent knowledge already present in video generation models.
Color image restoration gets a boost: exploiting saturation-value similarity in nonlocal methods yields significantly better results than relying on individual RGB channels.
Keyword-based concept unlearning is brittle: representing visual concepts with diverse prompts yields stronger erasure, better retention, and improved robustness against adversarial attacks.
Unlock geometry-precise 3D generation by directly conditioning diffusion models on readily available point cloud priors, outperforming existing image- or text-conditioned methods.
Dramatically speed up histopathology super-resolution by adaptively routing image tiles through a flow-matching network, achieving near-lossless quality at a fraction of the compute.
Injecting "historical attention" into vision transformers boosts accuracy by over 1% with minimal architectural changes, suggesting that current ViTs underutilize information learned in earlier layers.
Video diffusion models can be aggressively quantized down to 6-bit precision with minimal quality loss by dynamically adapting the bit-width of each layer based on its temporal stability.
Agents can now "hallucinate" optimal viewpoints for reasoning by storing and re-rendering scenes with 3D Gaussian Splatting, enabling recovery from initial observation failures.
Medical vision-language models are surprisingly brittle: clinically plausible image manipulations, like those introduced during routine acquisition and delivery, can drastically degrade their performance.