Search papers, labs, and topics across Lattice.
100 papers published across 6 labs.
Achieve superior video generation quality and temporal coherence without expensive retraining by intelligently scaling and steering diffusion models at test time.
Achieve near-lossless 60% attention latency reduction in video editing by exploiting query sharpness to dynamically route attention.
Fine-tuning efficient few-step diffusion models no longer requires sacrificing their speed, thanks to a self-distillation approach that preserves inference capabilities.
Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.
Outlier tokens in Diffusion Transformers aren't just extreme values; they corrupt local patch semantics, and can be tamed with Dual-Stage Registers to boost image generation quality.
Achieve superior video generation quality and temporal coherence without expensive retraining by intelligently scaling and steering diffusion models at test time.
Achieve near-lossless 60% attention latency reduction in video editing by exploiting query sharpness to dynamically route attention.
Fine-tuning efficient few-step diffusion models no longer requires sacrificing their speed, thanks to a self-distillation approach that preserves inference capabilities.
Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.
Outlier tokens in Diffusion Transformers aren't just extreme values; they corrupt local patch semantics, and can be tamed with Dual-Stage Registers to boost image generation quality.
Learned image compression finally delivers on its promise: a codec that's not just perceptually superior, but also crushes traditional and learned alternatives in bitrate savings while running blazingly fast on mobile.
Task-aware 3D reconstruction slashes the number of views needed by focusing on the data that actually matters for downstream applications.
Discovering spatial regions and their temporal signatures in massive time series data just got much faster and easier, thanks to a new method that scales log-linearly with the number of time series.
Nanometer-accurate, full-chip CMP modeling is now possible with a fast, FCN-based approach that leapfrogs traditional, resource-intensive methods.
Decoupling radial and angular dynamics in vision-language model adaptation unlocks significant gains in few-shot performance, outperforming existing flow matching methods.
Training vision-language models to detect glaucoma fairly across demographics requires debiasing both text *and* images, which this paper achieves with a novel pretraining strategy.
MLLMs can overcome self-referential bias and improve visual grounding by actively exploring and correcting their cognitive deficiencies, guided by token-level epistemic uncertainty.
Diffusion models' reliance on global information isn't just a quirk – it's fundamentally linked to the moment they commit to a specific semantic outcome.
Turns out, all gaze estimation models stumble when robots look down, and complex architectures aren't the answer – data diversity is the real secret to robust human-robot interaction.
Achieve robust long-horizon visual control by adaptively balancing model-based lookahead with bootstrapping, enabling zero-shot transfer to real-world tasks with severe occlusions.
By embedding whole-slide images in a hybrid hyperbolic-Euclidean space, BatMIL unlocks superior classification performance compared to traditional Euclidean-only methods, revealing the importance of geometric awareness in capturing complex tissue organization.
Finally, a way to judge the *vibes* of your 3D Gaussian Splatting scenes, without needing to render a bunch of images.
End-to-end ML models get smoked in real-world mmWave vehicular connectivity: a hybrid vision-primed approach slashes outage rates by leveraging model-based reasoning and RF feedback.
Hallucinations in diffusion models aren't just mode interpolation gone wrong, but instabilities on the model's manifold, and squashing its local intrinsic dimension can fix them.
A single vision-language foundation model, DART, can perform a full rope inspection workflow, including damage classification, severity estimation, and few-shot recognition, all without task-specific fine-tuning.
Human crowdsourcing struggles to reliably identify audiovisual deepfakes, especially when both audio and video are manipulated, suggesting current detection methods may overestimate human capabilities.
Achieve near-perfect traffic congestion classification by fusing motion-guided visual attention with data-adaptive temporal decomposition, outperforming existing vision-based and signal-based methods.
Identity-preserving video generation just got a whole lot more faithful: FaithfulFaces maintains identity even under extreme pose variations and occlusions, a feat previous methods struggled with.
Achieve 80.5% Top-1 accuracy in zero-shot EEG-to-image retrieval by mimicking the human visual system, substantially outperforming existing methods.
Unsupervised object detection can now achieve category awareness, bridging the gap with supervised methods without needing any labeled data.
Ditching diffusion's noise-denoising, RLFSeg uses Rectified Flow to directly predict segmentation masks from text prompts, unlocking zero-shot performance gains.
Synthesizing high-resolution satellite imagery with geometric precision is now more efficient, thanks to a windowed cross-attention method that rivals existing techniques while better respecting geometric constraints.
Unlock zero-shot 3D scene understanding: Ilov3Splat lets you identify and segment arbitrary objects in 3D scenes using only natural language, no category supervision needed.
Freezing your VAE and permuting high-frequency visual signals unlocks a new SOTA for VLM prompt learning, boosting harmonic-mean accuracy to 81.51%.
Current image difference captioning benchmarks fail to capture semantic consistency and penalize hallucinations, but DiffCap-Bench offers a robust alternative that aligns with human expert judgments and predicts downstream utility for image editing.
VLMs can be easily tricked into "hallucinating" object relationships with simple image rotations or noise, revealing a surprising fragility in their multimodal reasoning.
Open-source image editing models can match or beat fine-tuned models on visual understanding tasks *without any task-specific training*.
Forget backprop and memory lookups: FAAST lets you adapt models at test time with a single forward pass, matching fine-tuning accuracy with massive speed and memory gains.
Vol-Mark offers a way to protect sensitive 3D medical data from tampering and unauthorized copying with a reversible watermarking technique that maintains diagnostic accuracy.
Freezing a text encoder and distilling prompts from vision-language models can stabilize semantics and boost performance in lifelong person re-identification, even across unseen domains.
Training on Syn4D could unlock breakthroughs in dynamic scene understanding, where current datasets fall short in providing dense, complete, and accurate geometric annotations.
Counterintuitively, moderately similar reference images are the key to unlocking accurate VLM-based anomaly localization in medical imaging.
Forget dataset-specific hacks: CPCANet achieves SOTA domain generalization by explicitly learning a structured, domain-invariant subspace with a differentiable CPCA layer.
By intelligently incorporating LiDAR-derived height information, HiPR overcomes limitations of fixed projection spaces, achieving state-of-the-art camera-LiDAR occupancy prediction with real-time performance.
Finally, a driving dataset that doesn't just assume perfectly paved roads, offering 6.5x more depth data than KITTI for realistic autonomous driving scenarios.
Even with limited data, a simple combination of pre-trained CNN features and nearest-centroid classification can achieve surprisingly strong results in monkeypox skin disease classification.
Discrete diffusion, with carefully designed transition matrices for commands and parameters, unlocks superior CAD generation compared to continuous diffusion baselines.
For more reliable animal identification, force your model to reconstruct masked skin patterns, and it will learn embeddings that better capture individual differences.
Generate CT-like images from ultrasound with a transformer-augmented network, potentially reducing the need for harmful radiation exposure.
Encoding cross-task relationships between building footprints and heights slashes height estimation error by 7% – more effective than just refining individual encoders.
Adult-trained human mesh recovery models can now handle kids, too, thanks to a new framework that enforces spatial consistency and leverages VLM-derived age and gender cues.
Synthesizing realistic duet dance motions gets a boost from explicitly modeling inter-dancer contact, leading to significantly improved interaction fidelity and rhythmic synchronization.
Steer LVLMs' attention with caption guidance and watch object hallucinations drop by 6%—no training required.
Overlooked diagonal epipolar geometry holds the key to boosting light field super-resolution, as demonstrated by a new omnidirectional EPI Transformer.
Bridging the gap between aerial and ground-level tracking, VL-UniTrack uses visual-language prompts to achieve robust object tracking even with significant viewpoint differences.
Unleashing geospatial reasoning on a torrent of unlabeled remote sensing data, RemoteZero rivals supervised methods by having models verify their own reasoning, not relying on human-annotated coordinates.
Radar SLAM can now achieve state-of-the-art performance via direct scan registration, eliminating the need for hand-engineered feature extraction and enabling robust localization in adverse weather.
Achieve autonomous laparoscope control by translating multimodal surgical data into a single "wrench" that guides the robot's movements.
Fine-grained analysis of user behavior on search engine results pages is now possible thanks to AllSERP, which adds exhaustive per-element annotations to the AdSERP dataset, covering organic results and widgets in addition to ads.
Hand-eye calibration gets a 67% accuracy boost in high-uncertainty scenarios thanks to a new optimization framework that cleverly avoids explicit uncertainty modeling.
Robotic manipulation gets a serious upgrade: ConsisVLA-4D boosts performance by up to 41.5% and speeds up inference by 2.4x, all while ensuring your robot understands the scene in 3D *and* how it changes over time.
Standard camera auto-exposure is blind to the needs of remote heart-rate monitoring, but a new method closes the gap to enable robust in-vehicle driver monitoring.
Escaping the endless cat-and-mouse game of deepfake detection may be possible by shifting from static pattern recognition to physics-inspired dynamical stability analysis, where real images are stable and deepfakes are not.
LEGO's modular design lets you detect deepfakes with 10x less training data and far fewer epochs, all by focusing on the unique fingerprints of each image generator.
Achieve spatially grounded natural language descriptions of urban development with PTNet, a new model that understands change semantics better than existing methods.
Forget training from scratch: surprisingly, off-the-shelf 2D diffusion models can unlock generalizable style control in 3D generation models, even for out-of-distribution styles.
By grounding temporal Gaussian aggregation in spatial voxels, Ground4D achieves state-of-the-art 4D reconstruction in challenging off-road environments where existing methods falter.
Face symmetry and half-face alignment can be combined to achieve state-of-the-art facial expression recognition.
Stop feeding LLMs redundant and conflicting sensor data in autonomous driving: a new architecture slashes hallucinated entities by coordinating multi-sensor inputs *before* reasoning.
Ditch the Bradley-Terry model: a game-theoretic approach to diffusion alignment unlocks better text-to-image generation by directly optimizing for Nash equilibrium in human preferences.
Stop retraining your object detector every time it makes a mistake: EBOD learns from failure examples to prevent recurring errors in open-vocabulary object detection.
Brain tumor segmentation gets a lightweight boost: DALight-3D achieves comparable accuracy to larger U-Nets with significantly fewer parameters.
Technical artists overwhelmingly prefer this new method for single-image head mesh reconstruction, finding it closest to industry-grade usability.
Unlock efficient 4D object understanding from dynamic point clouds with Velox, a representation that's descriptive, compressive, and accessible.
Forget training, just nudge your text embeddings: RGSE closes the open-vocabulary object detection gap under distribution shift by directly and efficiently adapting text embeddings at test time.
Even with noisy initial matches, Angle-I2P leverages angular consistency and hierarchical attention to achieve state-of-the-art image-to-point cloud registration.
Explicitly modeling human-object interactions boosts multi-person human mesh recovery accuracy by up to 9.9%, showing that interaction context is key to understanding human pose and shape in complex scenes.
Mamba's linear complexity meets perceptual image compression, yielding a lightweight model that rivals GANs and diffusion models in visual quality while being far more efficient.
By fusing CLIP with a diffusion model, DiCLIP unlocks surprisingly strong weakly supervised segmentation, outperforming prior methods and slashing training costs.
Stop letting semantics dictate composition: Composer unlocks semantic-agnostic control over image aesthetics, letting you transfer and plan compositions with unprecedented precision.
Generating synthetic training data with multi-modal diffusion beats hand-crafting better detection architectures for PCB defect inspection.
Adversarial clothing with non-overlapping visible-thermal patterns can reliably evade RGB-T detectors, even transferring across different fusion architectures.
Random masking in self-supervised learning can destroy crucial diagnostic features in medical images; instead, try inverting chaos.
Image-based latent actions are your secret weapon for long-horizon reasoning in VLAs, while action-based latent actions unlock complex motor coordination.
Spatial transcriptomics predictions get a boost from HEXST, a Transformer that respects the hexagonal geometry of spot arrays and recovers gene-specific spatial heterogeneity.
Alpha-blending, a core optimization in 3D Gaussian Splatting, subtly hobbles feature learning, but a geometry-weighted fusion approach can unlock more accurate and efficient visual localization.
Resource-strapped edge devices can now achieve state-of-the-art face recognition across different sensing modalities thanks to a new lightweight CNN-Transformer architecture.
FlowDIS achieves state-of-the-art dichotomous image segmentation by using flow matching, even allowing for precise, pixel-level control via text prompts.
3D Gaussian Splatting gets a nearly 2x speed boost thanks to a clever bounding box strategy that drastically reduces unnecessary tile intersection checks.
Top-view RGB-D person re-identification is surprisingly feasible, even across modalities, despite the inherent challenges of viewpoint and modality variations.
Forget full fine-tuning: LoRA lets you adapt Geospatial Foundation Models for wildfire mapping with comparable accuracy while only tweaking 1% of the parameters.
Forget ImageNet – pre-training with chaotic augmentations yields surprisingly robust texture features, outperforming SOTA methods across diverse texture datasets.
ScriptHOI reveals that current HOI detectors over-rely on object affordance and phrase co-occurrence, and proposes a novel approach to explicitly model interaction scripts for improved open-vocabulary generalization.
Existing restoration methods crumble when faced with the extreme geometric distortions caused by strong refractive warping, highlighting the need for robust new approaches benchmarked on this challenging dataset.
Turns out, deep learning models trained to predict breast density from ultrasound images generalize surprisingly well to external datasets, but still struggle with heterogeneously dense breasts.
Get expert-level feedback on your performance, not just a score, thanks to a new approach that uses language generation for proficiency estimation.
Bidirectional interaction between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables a unified multimodal model to achieve spatial intelligence beyond general visual competence.
Stop wasting compute on unreliable rollouts and easy frames: Stream-R1 adaptively focuses video diffusion distillation where it matters most, boosting quality without architectural changes or added inference cost.
Despite impressive OCR performance on existing benchmarks, today's best LMMs still struggle with the messy realities of enterprise document processing.
Immersive video reveals that "being there" hinges more on feeling spatially located than having a virtual body, challenging conventional notions of embodiment in XR.
Production VLMs like GPT-4, Claude Opus, Gemini, and Grok can be easily manipulated into confidently providing false information via subtle adversarial perturbations to images, even without compromising model alignment.
Provably undetectable backdoors can be injected into pre-trained image classifiers, even with white-box access, by exploiting sparse perturbations and Gaussian dithering.
Cosine distance unexpectedly cracks PolyProtect, but a smart key selection algorithm can harden it again, offering better control over the accuracy-irreversibility tradeoff.
Computation-in-memory combined with lightweight cryptography slashes energy consumption by up to 44% in steganography applications.
Event cameras can significantly boost the reliability of autonomous driving in high-dynamic-range and high-speed scenarios, achieving perfect route completion in CARLA benchmarks.