Search papers, labs, and topics across Lattice.
GPT-5 can only solve 37% of PhD-level 3D geometry coding problems, suggesting AI can't reliably automate complex scientific coding tasks yet.
Stop training your image restoration models to mimic flawed ground truth; instead, explicitly optimize for perceptual quality using a plug-and-play module guided by No-Reference Image Quality Assessment.
Instruction-guided video editing can achieve impressive zero-shot performance simply by pre-training on motion-centric video restoration tasks *before* fine-tuning on paired editing data.
Video diffusion models can be aggressively quantized down to 6-bit precision with minimal quality loss by dynamically adapting the bit-width of each layer based on its temporal stability.
Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.
Animate 3D characters using bananas and plush toys – DancingBox turns everyday objects into motion capture proxies, making animation accessible to novices.
Ditch the diffusion vs. autoregressive debate: this VLA framework uses diffusion to *draft* actions and an autoregressive model to *verify* them, boosting real-world success by nearly 20%.
A new RGB-T dataset and frequency-aware network exposes the surprising limitations of existing UAV detectors when faced with real-world camouflage and complex backgrounds.
By explicitly modeling tooth relationships, TCATSeg achieves state-of-the-art accuracy in 3D dental model segmentation, even in challenging pre-orthodontic cases.
Directly modeling 3D geometry in dental scans unlocks a 9.58% accuracy boost in multi-disease diagnosis compared to methods relying on 2D or multi-view image representations.
By treating camera pose as a unifying geometric representation, WorldCam achieves significantly improved action controllability and long-horizon 3D consistency in interactive gaming world models compared to prior video diffusion transformer approaches.
By aligning image and LiDAR features to event-derived spatiotemporal edges, $x^2$-Fusion achieves state-of-the-art accuracy in optical and scene flow estimation, particularly under challenging conditions where other multimodal fusion methods falter.
Achieve real-time cattle mounting pose estimation in complex environments with FSMC-Pose, a framework that outperforms existing methods while drastically reducing computational costs.
Forget painstakingly creating 3D assets for robot training - ManiTwin automates the process, turning single images into simulation-ready objects at scale.
DriveFix tackles the "shaky camera" problem in 4D driving scene reconstruction, producing significantly more stable and coherent novel views by explicitly modeling spatio-temporal dependencies.
Ditch the pixel-perfect annotations: this method achieves near state-of-the-art infrared small target detection using only point annotations, slashing annotation costs.
Achieve diffusion-level perceptual quality in monocular depth estimation at 40x the speed, by replacing the slow initial diffusion steps with a fast ViT-based depth map and refining in a compact latent space.
A 2B parameter model trained on a new 1.1M dataset can now forecast remote sensing scenes better than Gemini-2.5-Flash Image, suggesting that task-specific training data and methods can beat sheer scale.
Generate realistic GPS trajectories across an entire nation with TrajFlow, a new flow-matching model that leapfrogs diffusion-based approaches in scale, diversity, and efficiency.
Forget finetuning rare tokens: MoKus leverages cross-modal knowledge transfer to bind diverse textual knowledge to visual concepts, achieving high-fidelity customized generation.
By decoupling patch details from semantics, Cheers achieves state-of-the-art multimodal performance at 20% of the training cost of comparable models.
Current embodied AI agents falter when faced with the multi-floor complexity of MANSION, a new language-driven framework for generating realistic, building-scale 3D environments.
Achieve 92% accuracy in identifying who's commanding a robot from 34 meters away by fusing IMU and camera data, a 48% leap over prior art.
Floor plan generation gets a major upgrade with HouseMind, a multimodal LLM that uses discrete room-instance tokens to achieve unprecedented geometric validity and controllability.
Control both multi-subject identity and multi-granularity motion in video generation with DreamVideo-Omni, a framework that uses latent identity reinforcement learning to avoid identity degradation.
A new video-based reward model beats GPT-5.2 and Gemini-3 Pro at evaluating computer-using agents, offering a scalable, model-agnostic alternative to traditional methods.
By learning visual representations from scene-level semantics down to pixel-level details, C2FMAE overcomes the limitations of both contrastive learning and masked image modeling.
Forget training separate models for different field-of-views in geo-localization — SinGeo achieves SOTA robustness with a single model, even outperforming specialized architectures.
Pathology MLLMs can now better incorporate diagnostic standards during reasoning, thanks to a new memory architecture inspired by how human pathologists process information.
Achieve safer and more effective human-robot collaboration by decoupling task execution from human interaction using a redundant robot's null space.
Fisheye cameras can now see the world in 4D, thanks to a new benchmark and method that tackles the unique distortions of spherical projection for improved occupancy tracking.
LLMs can significantly boost micro-expression recognition by reasoning about subtle facial muscle movements when guided by structured visual and relational prompts.
Get 2x faster video generation from diffusion transformers without sacrificing quality, thanks to a clever parameter-free error compensation technique.
Achieve nearly 2x speedup in Stable Diffusion 3 by intelligently stitching together large and small diffusion models at both the pixel and timestep level.
Achieve sub-millimeter accuracy in 3D reconstruction of flexible continuum robots by enforcing global biplanar geometric consistency, even with noisy or occluded images.
Text-to-image customization can now preserve the original model's behavior, thanks to a decoupled learning objective that balances new concepts with pre-existing capabilities.
Interpolating latent representations before decoding yields a reconstruction FID (iFID) that finally aligns with the generation FID of latent diffusion models, achieving ~0.85 correlation where standard rFID fails.
Ditch the optimization: MoRe achieves real-time 4D scene reconstruction from monocular video using a feedforward transformer that disentangles motion and structure.
Finally, AI can generate hour-long videos with consistent characters and backgrounds, thanks to a new framework that nails seamless transitions between shots.
By explicitly disentangling degradation and semantic features with wavelet attention, CWP-Net achieves superior all-in-one image restoration, outperforming previous methods hampered by spurious correlations and biased degradation estimation.
Color-invariant neural nets get a boost: representing saturation and luminance on a circle, not a line, unlocks true equivariance and avoids artifacts that plague existing methods.
Multimodal models are often blind at birth: a new "Visual Attention Score" reveals they struggle to focus on visual inputs during cold-start, but a simple attention-guided fix can boost performance by 7%.
Achieve state-of-the-art semantic scene understanding from sparse views with a feed-forward architecture that generalizes across diverse environments.
Ditch the linear CFG gains: Sliding Mode Control offers provably stable and semantically richer diffusion guidance, especially when you crank up the guidance scale.
Get 10x faster generative image compression on GPUs with ProGIC, a lightweight RVQ codec that doesn't sacrifice perceptual quality.
Achieve high-fidelity 3D human rendering from a single image by distilling priors from a multi-view diffusion model into 3D Gaussians, outperforming prior single-view reconstruction methods.
StegaFFD lets you hide faces inside other images to protect privacy during face forgery detection, achieving better accuracy and stealth than existing methods.
Achieve state-of-the-art image fusion and restoration in complex adverse weather by unifying infrared-visible fusion with compound degradation removal in a single Mamba-based model.
Synthesizing training data with foundation models and attending to wavelet domains can dramatically boost anomaly detection, even without fine-tuning or class-specific training.
AI-powered pathology slashes GTD diagnosis time by 71% while boosting accuracy, offering a lifeline for maternal health.