Search papers, labs, and topics across Lattice.
100 papers published across 1 lab.
Forget training data – Extend3D generates impressive town-scale 3D scenes from a single image by cleverly extending and patching the latent space of an object-centric 3D generative model.
By tightly coupling reasoning, searching, and generation, Unify-Agent achieves state-of-the-art world-grounded image synthesis, rivaling closed-source models and opening new avenues for agent-based multimodal generation.
Cut your 3D-QA model's token budget by 91% and latency by 86% with a new pruning method that intelligently balances semantic importance and geometric coverage.
Adding MRI data to histopathology and gene expression modestly improves glioma survival prediction, but only when combined effectively in a trimodal deep learning model.
Achieve superior compression of wind turbine images without sacrificing defect detection accuracy by using a segmentation-guided, dual lossy/lossless compression scheme.
Forget training data – Extend3D generates impressive town-scale 3D scenes from a single image by cleverly extending and patching the latent space of an object-centric 3D generative model.
By tightly coupling reasoning, searching, and generation, Unify-Agent achieves state-of-the-art world-grounded image synthesis, rivaling closed-source models and opening new avenues for agent-based multimodal generation.
Cut your 3D-QA model's token budget by 91% and latency by 86% with a new pruning method that intelligently balances semantic importance and geometric coverage.
Adding MRI data to histopathology and gene expression modestly improves glioma survival prediction, but only when combined effectively in a trimodal deep learning model.
Achieve superior compression of wind turbine images without sacrificing defect detection accuracy by using a segmentation-guided, dual lossy/lossless compression scheme.
Forget privacy concerns: you can train high-performing deep learning models for dynamic MRI reconstruction using *synthetic* fractal data.
Achieve real-time, privacy-aware action detection on edge devices by intelligently fusing fast skeleton tracking with vision-language models, outperforming either approach alone.
Current vision-language models are surprisingly bad at identifying common household safety hazards, but a new benchmark could change that.
Forget Fitzpatrick scores: lesion-skin contrast is the real culprit behind skin lesion segmentation errors, not overall skin tone.
Image generation models can now achieve state-of-the-art fidelity with up to 64x fewer tokens, thanks to a novel masking strategy that prevents latent space collapse.
Pose-guided GANs and diffusion models can faithfully generate complex cultural dance postures, opening new avenues for digital preservation and education.
Run multiple LoRA-tuned GenAI models on your phone without blowing up storage or latency: just swap weights at runtime.
Forget tedious poster design – iPoster lets you sketch your vision and then uses a smart diffusion model to instantly generate polished, content-aware layouts that respect your constraints.
Forget fine-tuning: this HTR model adapts to new handwriting styles in just a few shots, *without* any parameter updates.
Overcoming the challenge of limited and inconsistent imaging criteria for perineural invasion (PNI) diagnosis, NeoNet achieves state-of-the-art prediction accuracy by generating synthetic training data with a 3D Latent Diffusion Model.
Adversarial training doesn't have to destroy VLMs' zero-shot abilities: aligning adversarial visual features with textual embeddings using the original model's probabilistic predictions can actually *improve* robustness.
Robots can now generalize to unseen objects and categories for manipulation tasks with only a few training examples, thanks to a novel retrieval-augmented affordance prediction framework.
AI-generated image forgery detection gets a major boost with PromptForge-350k, a dataset so large and well-annotated it pushes IoU scores 5% higher and generalizes to unseen models.
Quantum-inspired architectures can significantly improve 3D cloud forecasting by better capturing nonlocal dependencies, outperforming classical methods like ConvLSTM and Transformers.
Correcting a vision-language model's "hallucinations" is now as simple as pinpointing and editing the right intermediate representation, sidestepping costly retraining or dual inference.
FL systems are far more vulnerable to backdoor attacks using realistic, semantically-aligned triggers (like sunglasses) than previously thought based on simple corner patches.
Robots can now learn to reproduce oil paintings with impressive accuracy through self-play and learned dynamics, even without human demonstrations or high-fidelity simulators.
Diffusion-based denoising can significantly improve composed image retrieval by making similarity scores more robust to hard negative samples.
Throw out your full images: focusing on pathology-relevant visual patches in radiology reports dramatically outperforms using the entire image for summarization.
Radiology report generation models can now verbalize calibrated confidence estimates, enabling targeted radiologist review of potentially hallucinated findings.
Diffusion-based watermarks, thought to be secure, can be completely bypassed with a simple stochastic resampling trick that breaks trajectory reconstruction.
Open-source SurgNavAR slashes the barrier to entry for AR surgical navigation research, offering a ready-to-use framework adaptable to diverse surgical applications.
Polarization cues, often overlooked, can significantly boost camouflaged object detection by explicitly guiding RGB feature learning, leading to state-of-the-art performance.
GPT-5 can only solve 37% of PhD-level 3D geometry coding problems, suggesting AI can't reliably automate complex scientific coding tasks yet.
Synthetic data, when carefully aligned with real-world characteristics, can boost hand-object interaction detection by over 11% even when real labeled data is scarce.
Vision-language models falter at the fine-grained temporal recognition crucial for surgical video understanding, while SurgRec excels.
Surgical VQA gets a major upgrade: SurgTEMP's hierarchical visual memory and competency-based training leapfrog existing models in understanding complex, time-sensitive surgical procedures.
By separating known and unknown object representations into orthogonal subspaces, DEUS achieves state-of-the-art open world object detection, outperforming prior methods that struggle to learn distinct unknown object representations.
Simply averaging pixel-level uncertainty in image segmentation throws away crucial spatial information, leading to worse performance on downstream tasks like detecting when your model is likely to fail.
Forget generating uncanny valley characters - Gloria lets you create consistent, expressive digital characters in videos exceeding 10 minutes, a leap towards believable virtual actors.
Diffusion-based feature denoising can significantly bolster the robustness of handwritten digit classifiers against adversarial attacks, even outperforming standard CNNs.
YOLOv11 crushes the competition in form element detection, showcasing its potential for automating document processing across diverse real-world forms.
Achieve fine-grained, six-degrees-of-freedom camera control in dynamic scenes with a generalizable model that outperforms scene-specific and diffusion-based approaches.
Single-pixel imaging gets a deep learning boost: SISTA-Net leverages learned sparsity and hybrid CNN-VSSM architectures to achieve state-of-the-art reconstruction quality, even in noisy underwater environments.
Fusing low-level statistical anomalies, high-level semantic coherence, and mid-level texture patterns makes AI-generated image detection far more reliable across diverse generative models.
Achieve massive gains in few-shot hierarchical multi-label classification (+42%) by adaptively balancing semantic priors and visual evidence using level-aware embeddings.
Stop training your image restoration models to mimic flawed ground truth; instead, explicitly optimize for perceptual quality using a plug-and-play module guided by No-Reference Image Quality Assessment.
Current facial expression editing models can't simultaneously preserve identity and accurately manipulate expressions, revealing a critical need for better fine-grained instruction following.
Video Transformers can achieve near-full attention accuracy with significantly less compute by focusing only on informative vertical vectors.
By injecting LLM-derived contextual cues into skeleton representations, SkeletonContext achieves state-of-the-art zero-shot action recognition, even distinguishing visually similar actions without explicit object interactions.
Forget expensive labels: CoRe-DA leverages contrastive learning and self-training to achieve state-of-the-art surgical skill assessment across diverse surgical environments without requiring target domain annotations.
Radio astronomy-aware self-supervised pre-training beats out-of-the-box Vision Transformers for transfer learning on radio astronomy morphology tasks.
Masked motion generators struggle with complex movements because they treat all frames the same – until now.
Edge cameras can achieve a 45% improvement in cross-modal retrieval accuracy by ditching redundant frames and focusing only on what's new.
VLMs struggle with Earth observation tasks involving complex land use, but a new dataset with nearly 10 million text annotations could change that.
FlowID enables forensic facial reconstruction on damaged faces with better identity preservation and lower computational cost than existing methods, potentially accelerating victim identification in violent deaths.
Diffusion models can beat discriminative classifiers at facial expression recognition, but only with a dynamically adjusted margin loss that accounts for per-sample difficulty.
Nighttime image dehazing gets a boost from a structure-texture decomposition that enhances details and corrects color biases in the YUV color space.
Surgeons can now pinpoint tumor margins with millimeter precision using augmented reality, potentially reducing positive margins in head and neck cancer resections.
Square superpixels, generated via granular ball computing, unlock efficient parallel processing and end-to-end optimization in deep learning pipelines by replacing irregular shapes with multi-scale square blocks.
Querying satellite imagery just got easier: EarthEmbeddingExplorer lets you find images using text, visuals, or location, unlocking insights previously trapped in research papers.
A training-free feature adjustment pipeline unlocks the power of Visual Geometry Grounded Transformers for stereo vision, achieving state-of-the-art results on KITTI.
Turn semantic segmentation into hyperspectral unmixing with a surprisingly simple pipeline that leverages polyhedral-cone partitioning, outperforming existing deep and non-deep methods.
Rendering artifacts in feed-forward 3D Gaussian Splatting? Solved: AA-Splat delivers a whopping 7dB PSNR boost by fixing screen-space dilation filters.
Finally, a blind face restoration method that doesn't just hallucinate details, but lets you precisely control facial attributes via text prompts while maintaining high fidelity.
Multimodal models surprisingly falter when applied to presentation attack detection on ID documents, challenging the assumption that combining visual and textual data inherently improves security.
Ditching depth map projections for camera-LiDAR calibration unlocks significant gains in accuracy and robustness, especially when starting from poor initial extrinsic estimates.
Quantifying and integrating map uncertainty—both positional and semantic—into trajectory prediction pipelines significantly boosts forecast accuracy, even when using existing baseline models.
Achieve a 60% reduction in trajectory error for monocular SLAM by tightly integrating multi-task dense prediction with a compact perception-to-mapping interface.
Reconstructing dynamic 3D scenes from video just got a whole lot better: MotionScale achieves state-of-the-art fidelity and temporal stability by scaling Gaussian splatting to long, complex sequences.
Gaze, often overlooked, reveals deepfake origins with surprising accuracy, enabling a new CLIP-based approach that significantly boosts deepfake attribution and detection.
Stop segmenting remote sensing images in isolation: modeling inter-unit dependencies boosts open-vocabulary segmentation accuracy by up to 6%.
Negation, a known weakness in VLMs like CLIP, can be dramatically improved by strategically fine-tuning only the *front* layers of the text encoder with a modified contrastive loss.
Forget blurry averages – DMA unlocks sharp, realistic concept prototypes directly within diffusion models, offering a new lens into model understanding and bias.
Forget expensive training: FlexMem unlocks SOTA long-video MLLM performance on a single GPU by cleverly mimicking human memory recall.
Publicly available satellite imagery can now estimate building heights with state-of-the-art accuracy thanks to a new dataset and network architecture designed for the task.
By explicitly modeling camouflage and distractors, CCDNet achieves state-of-the-art infrared small target detection, even in challenging environments where targets blend into the background.
Forget tedious optimization – LightHarmony3D generates realistic lighting and shadows for inserted 3D objects in a single pass, making scene augmentation feel truly real.
A novel data-dependency-free palette unlocks high-throughput, low-resource mezzanine coding, outperforming JPEG-XS while slashing LUT resource usage in half.
Diffusion-based image editing's impressive flexibility comes with fundamental trade-offs between controllability, faithfulness, consistency, locality, and quality, which this paper exposes with clear theoretical bounds.
Turn 2D orthographic views into 3D models automatically using corner detection and geometric reconstruction.
Current text-to-long-video evaluation metrics can't reliably assess video quality, failing to match human judgment in 9 out of 10 tested degradation aspects.
You can halve the polygon count of dynamic 3D meshes in VR without users noticing, but existing quality metrics won't tell you that.
Humanoids can now nimbly navigate real-world clutter and complex terrain using only raw depth data, ditching hand-engineered geometric representations.
Forget brute-force coverage – this method learns from simulated expert guidance to prioritize semantically relevant areas, dramatically speeding up target search in unseen environments.
Automating disassembly of complex, degraded appliances in recycling plants is now feasible, achieving high accuracy without pre-programmed coordinates.
SuperGrasp achieves robust single-view grasping by cleverly combining superquadric-based similarity matching with an end-to-end refinement network, outperforming existing methods in stability and generalization.
Real-time, uncertainty-aware signed distance functions are now possible without sacrificing accuracy, thanks to a novel kernel regression and GP regression hybrid.
Policies trained with GenSplat maintain robust performance under severe spatial perturbations where baseline methods completely fail, thanks to its novel 3D Gaussian Splatting-based augmentation.
World models can achieve state-of-the-art video prediction and emergent object decomposition by combining object-centric slots, hierarchical temporal dynamics, and learned causal interaction graphs.
Turn monaural video into immersive binaural audio with SIREN, a visually-guided framework that learns spatial audio cues without task-specific annotations.
Giving VLMs access to basic image manipulation tools and a strategic routing system dramatically improves their ability to "see through" visual illusions, even generalizing to unseen illusion types.
Over half of video understanding benchmark samples are solvable without watching the video, and current models barely outperform random guessing on the rest.
Style transfer can now capture the essence of artistic abstraction, not just surface-level appearance, by explicitly reinterpreting image structure.
Finally, a video generation model lets you roam through a scene with long-term spatial and temporal consistency, opening up new possibilities for virtual exploration.
Unbalanced class prevalence, not just disjoint label sets, is the dominant factor hindering federated learning performance under label-space heterogeneity.
Existing object detection models stumble when faced with the morphological diversity of cells in high-resolution, whole-brain microscopy data, revealing a critical gap in their generalization ability.
Current multimodal LLMs struggle to count objects and ground evidence in videos longer than 30 minutes, achieving only ~25% accuracy compared to human performance on a new benchmark.
Video diffusion models lock in their high-level plan almost immediately, suggesting a new path to scaling their reasoning abilities by focusing compute on promising early trajectories.
Unleashing creative potential in text-to-image models just got easier: on-the-fly repulsion in the contextual space lets you steer diffusion transformers towards richer diversity without sacrificing image quality or blowing your compute budget.
Generate or edit 1024x1024 images on your phone in under a second with DreamLite, a unified diffusion model that rivals server-side performance despite its tiny 0.39B parameters.
Image generation takes a leap towards real-world knowledge by training an agent that actively searches for and integrates external information, substantially boosting performance on knowledge-intensive tasks.
Zero-shot Vision-Language Models can now guide chip floorplanning, beating specialized ML methods by up to 32% without any fine-tuning.
Correcting errors early in the diffusion process matters more than fixing them later: Stepwise-Flow-GRPO leverages this insight to dramatically improve RL-based flow model training.
Aggregate accuracy can be dangerously misleading when evaluating facial recognition systems for law enforcement, obscuring significant disparities in error rates across demographic subgroups.