Search papers, labs, and topics across Lattice.
100 papers published across 4 labs.
Compressing multi-dimensional human preferences into single binary labels severely degrades DPO training, but a semi-supervised approach can recover state-of-the-art performance without additional human annotation.
Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.
Finally, a plugin framework that lets you mix-and-match KV-Cache, LoRA, and other controls to steer diffusion models without being locked into a specific backbone.
Finally, a single model that handles any segmentation task in both images and videos, understanding both text and visual prompts.
Decomposing image editing tasks into meta-tasks and aligning model reasoning with editing behavior unlocks surprising generalization to unseen editing operations.
Compressing multi-dimensional human preferences into single binary labels severely degrades DPO training, but a semi-supervised approach can recover state-of-the-art performance without additional human annotation.
Finally, a plugin framework that lets you mix-and-match KV-Cache, LoRA, and other controls to steer diffusion models without being locked into a specific backbone.
Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.
Finally, a single model that handles any segmentation task in both images and videos, understanding both text and visual prompts.
Decomposing image editing tasks into meta-tasks and aligning model reasoning with editing behavior unlocks surprising generalization to unseen editing operations.
Forget handcrafted prompts: a hierarchical multi-agent framework turns diffusion models into coherent storytelling engines by globally optimizing for semantic coherence.
Current VLM spatial reasoning benchmarks are misleading, as they often penalize models for "incorrect" answers that are actually correct given the limited visual information the models receive.
State-of-the-art shot boundary detection gets a major upgrade with a Transformer-based approach that not only improves accuracy but also offers more interpretable boundaries, thanks to a novel relational prediction framework and synthetic training data.
Ditching the vision encoder actually *improves* multimodal understanding at scale, proving that pixel embeddings alone can achieve state-of-the-art results in unified multimodal models.
Knowing *when* to listen to *which* sensor lets robotic fruit pickers predict failures before they happen, boosting accuracy to 90% even with minimal sensor sets.
Diffusion models, typically used for image generation, can now forecast infectious disease with accuracy rivaling traditional ensemble methods, offering a new tool for public health.
Unlock the secrets hidden in your lab's backed-up microscopy data: style transfer networks can now "re-imagine" images as if they were captured with different instrument settings.
Frozen vision-language models can dramatically improve abnormality grounding in rare disease imaging by iteratively refining decisions through optimized instructions and visual perturbations.
Decomposing robotic manipulation into coarse and fine-grained actions isn't just conceptually cleaner—it actually unlocks a sweet spot where learning difficulty is balanced, boosting performance.
Your sign language translation model's performance could be bottlenecked by your choice of pose estimator: switching from MediaPipe to SDPose or Sapiens could boost BLEU score by 1.5 points.
Object detection models are surprisingly vulnerable to practical backdoor attacks using real-world semantic triggers that work across different sizes, locations, and viewpoints.
Real-time differentiable rendering just got a whole lot faster: Power Foam unifies ray tracing and rasterization, rivaling 3DGS performance without sacrificing ray tracing benefits.
Achieve SOTA zero-shot segmentation by simply fusing two CLIP branches, one focusing on local token reliability and the other on structural priors, all without training.
Finally, a dataset exists to train and benchmark algorithms for automatically detecting airway bifurcations in 3D CT scans, a crucial step towards understanding respiratory diseases.
Even the best vision models make shockingly bad shape recognition errors, like confusing a car with a chair, when evaluated on a new viewpoint-invariant shape recognition benchmark.
Scaling up pathology foundation models doesn't guarantee better survival prediction—a distilled model with 8% of the parameters can outperform its larger teacher.
Road crack detection gets a boost by having the infrastructure tell the car where to look.
Agentic AI struggles with Earth Observation because reprojection, resampling, and other geospatial operations silently corrupt data, demanding a new agent design paradigm.
Cytogeneticists can now slash chromosome analysis time from days to seconds with Aycromo, an open-source platform that democratizes access to high-performance deep learning models.
Safe visuomotor control from high-resolution images is now practical at scale, thanks to a learned visual abstraction coupled with an efficient SLS solver.
Autoregressive image models can now compete with diffusion models in image quality and efficiency, thanks to a variable-length tokenization scheme that decouples compute from resolution.
Text-guided 3D medical image segmentation just got a whole lot more practical: ESICA achieves state-of-the-art accuracy with a "Lite" variant that slashes parameter count without sacrificing performance.
Interactive feedback slashes error rates in episodic memory retrieval, outperforming even large vision-language models while remaining efficient.
Text-to-video models can now learn geometrically consistent world dynamics via reinforcement learning, without expensive architectural changes.
Unlock species-agnostic 3D tracking from standard drone footage with WildLIFT, turning 2D video into structured, viewpoint-aware representations for richer wildlife analysis.
Test-time adaptation of vision-language models can actually *hurt* performance when modalities shift asymmetrically; MG-MTTA fixes this by explicitly modeling modality reliability.
Turns out, your image-generating diffusion model already knows how to segment anything you ask it to.
Achieve real-time, accurate image reconstruction from sparse Laplacian fields using a wavelet neural network with only 200 parameters.
Robots can now understand human intentions with near-human accuracy thanks to a new video-language model that reasons about goals like a human.
Radar odometry, typically confined to urban settings, can be pushed off-road with simple adaptations like IMU preintegration, but still faces significant challenges in unstructured environments.
Encoding vehicle trajectory directionality via HSV rasterization unlocks accurate lane-level HD map generation from crowdsourced data using a DETR architecture.
An open-source autonomous driving platform offers researchers a modular, scalable, and cost-effective alternative to complex and restrictive hardware validation setups.
Robots can now "see" and understand doorways, enabling more robust navigation in complex indoor environments.
Low-cost stereo vision can rival LiDAR for real-time windrow detection, paving the way for more accessible autonomous farming solutions.
Simulate once, deploy anywhere: SPLIT lets you train tactile perception models on synthetic data and transfer them across different sensors without retraining.
Forget clunky animation pipelines – MotionBricks lets you assemble real-time, high-quality character motions like LEGOs, even controlling robots.
Current event-based SLAM algorithms falter when faced with the full complexity of high-speed, 6-DoF maneuvers, highlighting a gap between current capabilities and the promise of event cameras.
Score-based diffusion models can now generate robust guiding vector fields for robotic path following, even when traditional methods stumble on unordered, branching, or probabilistically-generated paths.
Forget end-to-end fine-tuning: $M^2$-VLA unlocks the power of generalized VLMs for robotic manipulation by intelligently mixing layers and incorporating meta-skills.
Ditch silicon bottlenecks: a novel optoelectronic correlator uses cold atoms to accelerate 3D CNNs by orders of magnitude.
Compiling and executing YOLO-NAS on an FPGA-based accelerator is now possible, opening doors for real-time object detection in safety-critical applications like aeronautics.
Ditch the prompts: DiffuSAM adapts SAM2 for medical image segmentation by synthesizing mask embeddings with a diffusion model, achieving strong performance without fine-tuning or expert input.
Self-supervised vision models that ace linear probing can still flop at semantic image retrieval because of skewed latent space geometry that breaks approximate nearest neighbor search.
Quantum kernels unlock signal in medical image embeddings where classical methods fail, suggesting a new path for extracting value from medical foundation models.
Visual RL agents can recover near-perfect performance even under severe, dynamically changing visual corruptions by learning to disentangle task-relevant foreground from perturbation artifacts.
Open-source diffusion models can now achieve state-of-the-art illumination control rivaling closed-source alternatives, thanks to a novel training pipeline and dataset.
Achieve millisecond-level 3D point cloud reconstruction from a single image without sacrificing quality, blowing past diffusion model latency.
CLIP models, despite their prowess, stumble when understanding 360° images, failing to maintain semantic alignment under horizontal circular shifts.
Unified multimodal models can ace visual understanding and generation tasks, yet still fail to maintain basic semantic consistency between them.
A new large-scale dataset of human-annotated video crops enables training models that adapt videos to different aspect ratios while preserving visual quality and meaning.
You don't need billions of parameters to accurately ground GUI elements: GoClick, a 230M parameter model, matches the performance of much larger models, opening the door for on-device GUI agents.
Achieve surgical 3D edits without training: Prox-E lets you reshape objects with language by manipulating a compact set of geometric primitives.
Disentangling high-level cross-modal reasoning from low-level modality-specific refinement in talking head generation yields superior lip-sync accuracy, video quality, and audio quality compared to entangled approaches.
By reconstructing extractions and comparing them to the original document, RaV-IDP offers a grounded, label-free quality signal that dramatically improves the fidelity of intelligent document processing pipelines.
Current 3D anomaly detection struggles with real-world complexity, but this new approach directly models inlier feature distributions, achieving state-of-the-art results and offering a more robust solution.
FlowAnchor makes flow-based video editing robust to multi-object scenes and long sequences by stabilizing the editing signal, opening the door to more complex and controllable video manipulation.
Existing document OCR models fail to preserve crucial structural and executable properties of LaTeX, but TexOCR, trained with verifiable rewards, finally delivers compilable page-to-LaTeX reconstruction.
Adapting RGB-pretrained ViTs with viewpoint-conditioned feature selection leaps ahead in thermal vehicle re-identification, outperforming existing methods by a significant margin.
Unlock the secrets of historical keyboard performance with PHOTON, a non-invasive optical tracking system that reveals the subtle interplay between performer input and instrument mechanics.
SAD offers a surprisingly fast and accurate alternative to neural implicit representations for image compression and differentiable rendering, achieving 4-19x training speedups while outperforming state-of-the-art methods like Image-GS.
Unlock reusable architectures for climate data super-resolution: a single diffusion model now handles spatial upscaling from 1x to 25x and temporal upscaling from 1x to 6x.
Deepfakes betray themselves through subtle irregularities in the timing of facial movements, especially when expressing emotions, offering a new avenue for detection.
Ramen achieves robust test-time adaptation of VLMs in mixed-domain scenarios by selecting the right samples to adapt to, sidestepping the common pitfall of performance degradation when faced with diverse and inconsistent test data.
Persistent homology, when applied to eye-tracking data via novel filtration techniques, unlocks dyslexia detection performance exceeding traditional statistical methods.
Volatile memristors can achieve state-of-the-art image classification accuracy in reservoir computing, even with significant device variability, suggesting they are a viable alternative to traditional CMOS.
VARestorer distills a text-to-image VAR model into a one-step super-resolution network, achieving state-of-the-art image quality with a 10x speedup.
Stimuli that vision models agree on most strongly drive alignment with language models, doubling cross-modal convergence.
Stop punishing your model for disagreeing with corrupted data – Trust-SSL learns better representations by treating alignment with degraded views as a residual learning problem, not a hard constraint.
Learnable critics that evaluate the model's own GUI grounding proposals, rather than relying on static geometric heuristics, unlock substantial gains in accuracy.
Quantum trajectory reversal, previously understood through specific feedback protocols, is now shown to be fundamentally linked to score-based diffusion, opening the door to ML-driven control in noisy, real-world quantum systems.
Autoregressive video diffusion models can achieve faster decoding, lower memory footprint, and higher quality long-horizon generations by learning to attend to only the most salient spatiotemporal blocks.
Forget repeatedly re-running inference on residual graphs: this GNN-guided Ford-Fulkerson algorithm learns edge importance probabilities to dramatically accelerate max-flow computation and image segmentation.
Your camera's AI could be subtly rewriting reality, but this method lets you reverse the changes and see the "unhallucinated" original.
A new synthetic aerial imagery dataset provides pixel-perfect depth, controlled illumination, and multi-scale imagery, unlocking joint research across geometric understanding, domain robustness, and resolution enhancement.
Mimicking how clinicians review capsule endoscopy videos—first screening, then weaving context, and finally converging evidence—yields surprisingly effective summarization of these ultra-long videos.
Achieve high-fidelity image enhancement on mobile devices even after quantization by training a model that anticipates and adapts to low-precision representations.
Achieve state-of-the-art image quality assessment by causally disentangling content and degradation, even in data-scarce domains where existing methods fail.
Achieve competitive video copy detection accuracy with descriptors orders of magnitude smaller and inference speeds exceeding 11k samples per second by replacing floating-point operations with a learned Boolean circuit.
LLM-driven visual agents form complex communication structures, but stubbornly resist stylistic convergence, revealing a fundamental tension between social expression and individual identity.
Forget hand-annotated visual reasoning datasets: VG-CoT leverages a fully automated pipeline to generate grounded, step-by-step reasoning, enabling scalable and cost-efficient training of more trustworthy LVLMs.
Imagine reconstructing detailed human motion and scene layouts using just your smartwatch and earbuds – no cameras needed.
Forget generating static shapes – Sculpt4D now lets you efficiently sculpt dynamic 4D objects with state-of-the-art temporal coherence.
Training a video reshooting model on internet-scale monocular videos is now possible, thanks to a clever self-supervision trick that generates multi-view training data from a single video.
Domain shifts and novel classes at test time can be tamed by nudging features back towards the source distribution, even for out-of-distribution examples.
Despite achieving comparable accuracy, humans and deep vision models exhibit fundamentally different error patterns, revealing distinct inductive biases that can be quantified through directional confusion analysis and Rate-Distortion geometry.
VLMs can reliably reveal population-level trends in climate change discourse on social media, even when per-image accuracy is only moderate.
Current video Q&A benchmarks can be fooled by textual regularities, failing to actually ground reasoning in the video's physical reality.
Super-resolution is possible without image priors by cleverly combining low-resolution images at different scales, unlocking a stable inverse system for reconstruction.
Multi-modification image retrieval is now possible: TEMA handles complex, real-world instructions that go beyond simple changes, outperforming existing methods on new datasets M-FashionIQ and M-CIRR.
Seemingly innocuous choices in loss functions and training regimes can significantly hinder visual geometry estimation, even for state-of-the-art methods.
Turn your 3D Gaussian Splatting failures into features: DualSplat uses initial reconstruction artifacts to bootstrap robust scene representations in the presence of transient objects.
Achieve millimeter-level accuracy in 3D human body fitting from multi-modal inputs, even with scale distortion common in AI-generated assets.
Decomposing inputs into functional components lets you spot subtle, compositional anomalies that global or patch-based OOD detectors miss.
Frozen vision foundation models can be surprisingly effective at improving out-of-domain object detection by stabilizing relational modeling and semantic-spatial alignment in the detector.
Achieve a 10x speedup in detecting tiny objects in massive satellite images without sacrificing accuracy, even on a single GPU.