Search papers, labs, and topics across Lattice.
Image recognition, object detection, segmentation, video understanding, and visual generation.
#1 of 24
1
Forget training data – Extend3D generates impressive town-scale 3D scenes from a single image by cleverly extending and patching the latent space of an object-centric 3D generative model.
By tightly coupling reasoning, searching, and generation, Unify-Agent achieves state-of-the-art world-grounded image synthesis, rivaling closed-source models and opening new avenues for agent-based multimodal generation.
Cut your 3D-QA model's token budget by 91% and latency by 86% with a new pruning method that intelligently balances semantic importance and geometric coverage.
Adding MRI data to histopathology and gene expression modestly improves glioma survival prediction, but only when combined effectively in a trimodal deep learning model.
Achieve superior compression of wind turbine images without sacrificing defect detection accuracy by using a segmentation-guided, dual lossy/lossless compression scheme.
Forget privacy concerns: you can train high-performing deep learning models for dynamic MRI reconstruction using *synthetic* fractal data.
Achieve real-time, privacy-aware action detection on edge devices by intelligently fusing fast skeleton tracking with vision-language models, outperforming either approach alone.
Current vision-language models are surprisingly bad at identifying common household safety hazards, but a new benchmark could change that.
Forget Fitzpatrick scores: lesion-skin contrast is the real culprit behind skin lesion segmentation errors, not overall skin tone.
Image generation models can now achieve state-of-the-art fidelity with up to 64x fewer tokens, thanks to a novel masking strategy that prevents latent space collapse.
Pose-guided GANs and diffusion models can faithfully generate complex cultural dance postures, opening new avenues for digital preservation and education.
Run multiple LoRA-tuned GenAI models on your phone without blowing up storage or latency: just swap weights at runtime.
Forget tedious poster design – iPoster lets you sketch your vision and then uses a smart diffusion model to instantly generate polished, content-aware layouts that respect your constraints.
Forget fine-tuning: this HTR model adapts to new handwriting styles in just a few shots, *without* any parameter updates.
Overcoming the challenge of limited and inconsistent imaging criteria for perineural invasion (PNI) diagnosis, NeoNet achieves state-of-the-art prediction accuracy by generating synthetic training data with a 3D Latent Diffusion Model.
Adversarial training doesn't have to destroy VLMs' zero-shot abilities: aligning adversarial visual features with textual embeddings using the original model's probabilistic predictions can actually *improve* robustness.
Robots can now generalize to unseen objects and categories for manipulation tasks with only a few training examples, thanks to a novel retrieval-augmented affordance prediction framework.
AI-generated image forgery detection gets a major boost with PromptForge-350k, a dataset so large and well-annotated it pushes IoU scores 5% higher and generalizes to unseen models.
Quantum-inspired architectures can significantly improve 3D cloud forecasting by better capturing nonlocal dependencies, outperforming classical methods like ConvLSTM and Transformers.
Correcting a vision-language model's "hallucinations" is now as simple as pinpointing and editing the right intermediate representation, sidestepping costly retraining or dual inference.
FL systems are far more vulnerable to backdoor attacks using realistic, semantically-aligned triggers (like sunglasses) than previously thought based on simple corner patches.
Robots can now learn to reproduce oil paintings with impressive accuracy through self-play and learned dynamics, even without human demonstrations or high-fidelity simulators.
Diffusion-based denoising can significantly improve composed image retrieval by making similarity scores more robust to hard negative samples.
Throw out your full images: focusing on pathology-relevant visual patches in radiology reports dramatically outperforms using the entire image for summarization.
Radiology report generation models can now verbalize calibrated confidence estimates, enabling targeted radiologist review of potentially hallucinated findings.
Diffusion-based watermarks, thought to be secure, can be completely bypassed with a simple stochastic resampling trick that breaks trajectory reconstruction.
Open-source SurgNavAR slashes the barrier to entry for AR surgical navigation research, offering a ready-to-use framework adaptable to diverse surgical applications.
Polarization cues, often overlooked, can significantly boost camouflaged object detection by explicitly guiding RGB feature learning, leading to state-of-the-art performance.
GPT-5 can only solve 37% of PhD-level 3D geometry coding problems, suggesting AI can't reliably automate complex scientific coding tasks yet.
Synthetic data, when carefully aligned with real-world characteristics, can boost hand-object interaction detection by over 11% even when real labeled data is scarce.
Vision-language models falter at the fine-grained temporal recognition crucial for surgical video understanding, while SurgRec excels.
Surgical VQA gets a major upgrade: SurgTEMP's hierarchical visual memory and competency-based training leapfrog existing models in understanding complex, time-sensitive surgical procedures.
By separating known and unknown object representations into orthogonal subspaces, DEUS achieves state-of-the-art open world object detection, outperforming prior methods that struggle to learn distinct unknown object representations.
Simply averaging pixel-level uncertainty in image segmentation throws away crucial spatial information, leading to worse performance on downstream tasks like detecting when your model is likely to fail.
Forget generating uncanny valley characters - Gloria lets you create consistent, expressive digital characters in videos exceeding 10 minutes, a leap towards believable virtual actors.
Diffusion-based feature denoising can significantly bolster the robustness of handwritten digit classifiers against adversarial attacks, even outperforming standard CNNs.
YOLOv11 crushes the competition in form element detection, showcasing its potential for automating document processing across diverse real-world forms.
Achieve fine-grained, six-degrees-of-freedom camera control in dynamic scenes with a generalizable model that outperforms scene-specific and diffusion-based approaches.
Single-pixel imaging gets a deep learning boost: SISTA-Net leverages learned sparsity and hybrid CNN-VSSM architectures to achieve state-of-the-art reconstruction quality, even in noisy underwater environments.
Fusing low-level statistical anomalies, high-level semantic coherence, and mid-level texture patterns makes AI-generated image detection far more reliable across diverse generative models.
Achieve massive gains in few-shot hierarchical multi-label classification (+42%) by adaptively balancing semantic priors and visual evidence using level-aware embeddings.
Stop training your image restoration models to mimic flawed ground truth; instead, explicitly optimize for perceptual quality using a plug-and-play module guided by No-Reference Image Quality Assessment.
Current facial expression editing models can't simultaneously preserve identity and accurately manipulate expressions, revealing a critical need for better fine-grained instruction following.
Video Transformers can achieve near-full attention accuracy with significantly less compute by focusing only on informative vertical vectors.
By injecting LLM-derived contextual cues into skeleton representations, SkeletonContext achieves state-of-the-art zero-shot action recognition, even distinguishing visually similar actions without explicit object interactions.
Forget expensive labels: CoRe-DA leverages contrastive learning and self-training to achieve state-of-the-art surgical skill assessment across diverse surgical environments without requiring target domain annotations.
Radio astronomy-aware self-supervised pre-training beats out-of-the-box Vision Transformers for transfer learning on radio astronomy morphology tasks.
Masked motion generators struggle with complex movements because they treat all frames the same – until now.
Edge cameras can achieve a 45% improvement in cross-modal retrieval accuracy by ditching redundant frames and focusing only on what's new.
VLMs struggle with Earth observation tasks involving complex land use, but a new dataset with nearly 10 million text annotations could change that.