Search papers, labs, and topics across Lattice.
46 papers published across 3 labs.
Forget bulky atlases and unreliable image searches: MIRAGE offers medical students a free, interactive tool to retrieve, generate, and understand medical images using only open-source models.
Interactive 3D asset generation can now be driven by functional logic and hierarchical physics, thanks to a new framework that synthesizes simulation-ready assets.
Synthetic data augmentation and per-language threshold tuning can significantly boost the performance of LLMs on multilingual tasks, outperforming alternative architectures that showed promise on the development set.
Don't let your materials science dataset become obsolete: a diversity-aware construction framework can boost performance on both targeted and *untargeted* properties by up to 40%.
Training data order matters more than you think: reordering your data can significantly improve unsupervised domain adaptation by reducing variance in domain discrepancy estimates.
Interactive 3D asset generation can now be driven by functional logic and hierarchical physics, thanks to a new framework that synthesizes simulation-ready assets.
Synthetic data augmentation and per-language threshold tuning can significantly boost the performance of LLMs on multilingual tasks, outperforming alternative architectures that showed promise on the development set.
Don't let your materials science dataset become obsolete: a diversity-aware construction framework can boost performance on both targeted and *untargeted* properties by up to 40%.
Training data order matters more than you think: reordering your data can significantly improve unsupervised domain adaptation by reducing variance in domain discrepancy estimates.
Stop wasting time and resources on massive localization datasets: this framework achieves highly accurate outdoor localization by adaptively switching between offline and online learning strategies based on data availability.
Tabular data synthesis no longer needs to sacrifice privacy for quality: pretraining on diverse datasets lets models generalize from limited context, breaking the traditional tradeoff.
Existing causal discovery methods can be dangerously wrong when data is missing, but PAIR-CI slashes false positives by directly accounting for imputation errors, leading to more accurate causal graphs.
Federated learning struggles when data quality varies across clients, but FedQual solves this with a novel approach that calibrates low-quality clients while preserving high-quality autonomy.
Incomplete one-hot encoding during FMQA's initial training phase can be overcome with space-filling sampling methods, leading to improved optimization performance.
Frontier LLMs are leaving 70% of relevant pharmaceutical assets undiscovered, a gap that can be largely closed by swapping generic web search for a curated index.
Unsupervised object detection can now achieve category awareness, bridging the gap with supervised methods without needing any labeled data.
Synthesizing high-resolution satellite imagery with geometric precision is now more efficient, thanks to a windowed cross-attention method that rivals existing techniques while better respecting geometric constraints.
Dissimilarity, not just similarity, unlocks better language generalization for low-resource varieties.
Unlock Tajik NLP: a new open-source toolkit delivers a comprehensive pipeline for processing Cyrillic-script Tajik text, complete with datasets and pre-trained embeddings.
Standard data anonymization techniques crumble when outliers are present; ICSA offers a robust alternative that maintains utility while providing stronger privacy guarantees.
Even subtle, functionality-preserving manipulations of malware binaries can cripple detection pipelines, demanding a rethink of pre-ingestion validation.
Training on Syn4D could unlock breakthroughs in dynamic scene understanding, where current datasets fall short in providing dense, complete, and accurate geometric annotations.
Even with limited data, a simple combination of pre-trained CNN features and nearest-centroid classification can achieve surprisingly strong results in monkeypox skin disease classification.
Stop relying on significance tests that only find differences: this Bayesian framework tells you if your synthetic data is *practically equivalent* to real-world data for your specific safety assessment task.
Generate more realistic and diverse safety-critical autonomous vehicle scenarios by using conditional latent flow matching to bridge the gap between real-world and simulated data.
Fine-grained analysis of user behavior on search engine results pages is now possible thanks to AllSERP, which adds exhaustive per-element annotations to the AdSERP dataset, covering organic results and widgets in addition to ads.
LEGO's modular design lets you detect deepfakes with 10x less training data and far fewer epochs, all by focusing on the unique fingerprints of each image generator.
Generating synthetic training data with multi-modal diffusion beats hand-crafting better detection architectures for PCB defect inspection.
Quickly sanitize your engagement recognition models after training: subject-level unlearning recovers ~90% of retraining benefits at 25% of the cost.
Forget bulky atlases and unreliable image searches: MIRAGE offers medical students a free, interactive tool to retrieve, generate, and understand medical images using only open-source models.
Top-view RGB-D person re-identification is surprisingly feasible, even across modalities, despite the inherent challenges of viewpoint and modality variations.
Existing hallucination detection methods are missing subtle, word-level medical errors, but a new data-centric pipeline and detector closes the gap by 15%.
Model collapse isn't just a technical problem; it's a threat to AI democratization that will widen the gap between high- and low-resource communities.
Forget scaling laws: QLoRA-tuned Mistral 7B crushes other architectures for low-resource Tajik text generation, highlighting the importance of architecture choice in PEFT.
Stack Overflow code quality varies significantly across US states, with major tech hubs surprisingly not producing the highest quality code.
AI data annotation companies are publicly framing human expertise as a commodity ripe for disruption, potentially devaluing traditional forms of knowledge and institutional authority.
LLMs can achieve surprisingly high precision in smart contract vulnerability detection, but only with vulnerability-specific prompts and AST-based context.
Finally, a zero-knowledge data valuation system that scales: ZK-Value proves Shapley values in seconds to minutes, beating specialized ZK baselines by over an order of magnitude.
Unlock agile humanoid robots by ditching teleoperation and training directly from human VR demos.
Active learning guided by transition path sampling overcomes the limitations of machine-learned potentials in transition-state regions, enabling accurate and efficient simulation of rare events without prior mechanistic knowledge.
Pretrained MLIPs already encode sufficient information in their latent spaces to guide active learning, enabling efficient fine-tuning without uncertainty quantification.
Forget federated learning, bioacoustic classifiers can be unified across 661 species by simply averaging independently trained task vectors, unlocking a collaborative, privacy-preserving paradigm.
Fine-tuning dense retrievers on a mix of domain-specific and general question-answering data achieves surprisingly robust performance across diverse legal search tasks, outperforming models trained solely on legal data.
Ditch the brittle RAG stack: a unified PostgreSQL data layer slashes latency by up to 92% and eliminates data leakage, making production RAG finally reliable.
Conformal prediction offers a surprisingly effective way to handle both modality imbalance and noisy corruption in multimodal learning by explicitly modeling predictive uncertainty during training.
Achieve state-of-the-art object detection in multi-camera surveillance without compromising data privacy by fusing models trained on synthetically augmented and federated data.
Transfer learning from a large, pre-trained speech synthesis model unlocks high-quality Tibetan TTS, even with limited Tibetan-specific data.
Synthetic data closes the Indic ASR gap where commercial and open-source systems fail, boosting entity recognition by up to 22x.
Combining diffusion models with image-to-image translation yields surprisingly realistic synthetic data, outperforming either method alone in closing the sim2real gap.
Offloading geospatial data sampling to the edge slashes latency and bandwidth costs, achieving cloud-competitive accuracy with 80% less data.
LLM-powered data augmentation combined with rule-based pre-processing unlocks surprisingly high NER accuracy in low-resource domains, even with limited training data.