Search papers, labs, and topics across Lattice.
Image recognition, object detection, segmentation, video understanding, and visual generation.
#22 of 24
3
Achieve superior video generation quality and temporal coherence without expensive retraining by intelligently scaling and steering diffusion models at test time.
Achieve near-lossless 60% attention latency reduction in video editing by exploiting query sharpness to dynamically route attention.
Fine-tuning efficient few-step diffusion models no longer requires sacrificing their speed, thanks to a self-distillation approach that preserves inference capabilities.
Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.
Outlier tokens in Diffusion Transformers aren't just extreme values; they corrupt local patch semantics, and can be tamed with Dual-Stage Registers to boost image generation quality.
Learned image compression finally delivers on its promise: a codec that's not just perceptually superior, but also crushes traditional and learned alternatives in bitrate savings while running blazingly fast on mobile.
Task-aware 3D reconstruction slashes the number of views needed by focusing on the data that actually matters for downstream applications.
Discovering spatial regions and their temporal signatures in massive time series data just got much faster and easier, thanks to a new method that scales log-linearly with the number of time series.
Nanometer-accurate, full-chip CMP modeling is now possible with a fast, FCN-based approach that leapfrogs traditional, resource-intensive methods.
Decoupling radial and angular dynamics in vision-language model adaptation unlocks significant gains in few-shot performance, outperforming existing flow matching methods.
Training vision-language models to detect glaucoma fairly across demographics requires debiasing both text *and* images, which this paper achieves with a novel pretraining strategy.
MLLMs can overcome self-referential bias and improve visual grounding by actively exploring and correcting their cognitive deficiencies, guided by token-level epistemic uncertainty.
Diffusion models' reliance on global information isn't just a quirk – it's fundamentally linked to the moment they commit to a specific semantic outcome.
Turns out, all gaze estimation models stumble when robots look down, and complex architectures aren't the answer – data diversity is the real secret to robust human-robot interaction.
Achieve robust long-horizon visual control by adaptively balancing model-based lookahead with bootstrapping, enabling zero-shot transfer to real-world tasks with severe occlusions.
By embedding whole-slide images in a hybrid hyperbolic-Euclidean space, BatMIL unlocks superior classification performance compared to traditional Euclidean-only methods, revealing the importance of geometric awareness in capturing complex tissue organization.
Finally, a way to judge the *vibes* of your 3D Gaussian Splatting scenes, without needing to render a bunch of images.
End-to-end ML models get smoked in real-world mmWave vehicular connectivity: a hybrid vision-primed approach slashes outage rates by leveraging model-based reasoning and RF feedback.
Hallucinations in diffusion models aren't just mode interpolation gone wrong, but instabilities on the model's manifold, and squashing its local intrinsic dimension can fix them.
A single vision-language foundation model, DART, can perform a full rope inspection workflow, including damage classification, severity estimation, and few-shot recognition, all without task-specific fine-tuning.
Human crowdsourcing struggles to reliably identify audiovisual deepfakes, especially when both audio and video are manipulated, suggesting current detection methods may overestimate human capabilities.
Achieve near-perfect traffic congestion classification by fusing motion-guided visual attention with data-adaptive temporal decomposition, outperforming existing vision-based and signal-based methods.
Identity-preserving video generation just got a whole lot more faithful: FaithfulFaces maintains identity even under extreme pose variations and occlusions, a feat previous methods struggled with.
Achieve 80.5% Top-1 accuracy in zero-shot EEG-to-image retrieval by mimicking the human visual system, substantially outperforming existing methods.
Unsupervised object detection can now achieve category awareness, bridging the gap with supervised methods without needing any labeled data.
Ditching diffusion's noise-denoising, RLFSeg uses Rectified Flow to directly predict segmentation masks from text prompts, unlocking zero-shot performance gains.
Synthesizing high-resolution satellite imagery with geometric precision is now more efficient, thanks to a windowed cross-attention method that rivals existing techniques while better respecting geometric constraints.
Unlock zero-shot 3D scene understanding: Ilov3Splat lets you identify and segment arbitrary objects in 3D scenes using only natural language, no category supervision needed.
Freezing your VAE and permuting high-frequency visual signals unlocks a new SOTA for VLM prompt learning, boosting harmonic-mean accuracy to 81.51%.
Current image difference captioning benchmarks fail to capture semantic consistency and penalize hallucinations, but DiffCap-Bench offers a robust alternative that aligns with human expert judgments and predicts downstream utility for image editing.
VLMs can be easily tricked into "hallucinating" object relationships with simple image rotations or noise, revealing a surprising fragility in their multimodal reasoning.
Open-source image editing models can match or beat fine-tuned models on visual understanding tasks *without any task-specific training*.
Forget backprop and memory lookups: FAAST lets you adapt models at test time with a single forward pass, matching fine-tuning accuracy with massive speed and memory gains.
Vol-Mark offers a way to protect sensitive 3D medical data from tampering and unauthorized copying with a reversible watermarking technique that maintains diagnostic accuracy.
Freezing a text encoder and distilling prompts from vision-language models can stabilize semantics and boost performance in lifelong person re-identification, even across unseen domains.
Training on Syn4D could unlock breakthroughs in dynamic scene understanding, where current datasets fall short in providing dense, complete, and accurate geometric annotations.
Counterintuitively, moderately similar reference images are the key to unlocking accurate VLM-based anomaly localization in medical imaging.
Forget dataset-specific hacks: CPCANet achieves SOTA domain generalization by explicitly learning a structured, domain-invariant subspace with a differentiable CPCA layer.
By intelligently incorporating LiDAR-derived height information, HiPR overcomes limitations of fixed projection spaces, achieving state-of-the-art camera-LiDAR occupancy prediction with real-time performance.
Finally, a driving dataset that doesn't just assume perfectly paved roads, offering 6.5x more depth data than KITTI for realistic autonomous driving scenarios.
Even with limited data, a simple combination of pre-trained CNN features and nearest-centroid classification can achieve surprisingly strong results in monkeypox skin disease classification.
Discrete diffusion, with carefully designed transition matrices for commands and parameters, unlocks superior CAD generation compared to continuous diffusion baselines.
For more reliable animal identification, force your model to reconstruct masked skin patterns, and it will learn embeddings that better capture individual differences.
Generate CT-like images from ultrasound with a transformer-augmented network, potentially reducing the need for harmful radiation exposure.
Encoding cross-task relationships between building footprints and heights slashes height estimation error by 7% – more effective than just refining individual encoders.
Adult-trained human mesh recovery models can now handle kids, too, thanks to a new framework that enforces spatial consistency and leverages VLM-derived age and gender cues.
Synthesizing realistic duet dance motions gets a boost from explicitly modeling inter-dancer contact, leading to significantly improved interaction fidelity and rhythmic synchronization.
Steer LVLMs' attention with caption guidance and watch object hallucinations drop by 6%—no training required.
Overlooked diagonal epipolar geometry holds the key to boosting light field super-resolution, as demonstrated by a new omnidirectional EPI Transformer.
Bridging the gap between aerial and ground-level tracking, VL-UniTrack uses visual-language prompts to achieve robust object tracking even with significant viewpoint differences.