Search papers, labs, and topics across Lattice.
100 papers published across 10 labs.
Automating museum video metadata curation is now possible with a locally deployable video language model, unlocking previously inaccessible audiovisual archives.
Even the best LLMs struggle with multi-turn medical dialogues, with error rates tripling by the third turn and a single wrong answer significantly increasing the probability of subsequent errors.
Forget retraining from scratch: incremental federated learning can keep your IoT intrusion detection models sharp against evolving threats, but the right update strategy is crucial for balancing accuracy and speed.
Reading Activity Traces (RATs) reveal the hidden creative work lost when algorithms automate interpretation, offering a path to design AI that preserves human insight.
LLMs can be made better software engineers by pre-training them to reconstruct the messy, iterative development process that led to the final, clean code in repositories.
Automating museum video metadata curation is now possible with a locally deployable video language model, unlocking previously inaccessible audiovisual archives.
Even the best LLMs struggle with multi-turn medical dialogues, with error rates tripling by the third turn and a single wrong answer significantly increasing the probability of subsequent errors.
Forget retraining from scratch: incremental federated learning can keep your IoT intrusion detection models sharp against evolving threats, but the right update strategy is crucial for balancing accuracy and speed.
Reading Activity Traces (RATs) reveal the hidden creative work lost when algorithms automate interpretation, offering a path to design AI that preserves human insight.
LLMs can be made better software engineers by pre-training them to reconstruct the messy, iterative development process that led to the final, clean code in repositories.
A massive, bilingual, authority-grounded dataset could finally make AI-assisted cataloging a reality.
A new, large-scale diachronic corpus for Sinhala, SiDiaC-v.2.0, offers a crucial resource for NLP research on this low-resource language, enabling studies of linguistic change and historical text analysis.
Multilingual math reasoning just got a serious upgrade: mAceReason-Math offers a meticulously translated and cleaned dataset of challenging problems across 14 languages, purpose-built for RLVR training.
Luxembourgish news reveals a surge in code-switching and morphologically adapted borrowings, primarily from French, challenging simple document-level mixing indices.
Forget expensive LLM inference for MTQE: train a COMET model on GPT-4o-generated annotations and get competitive performance.
Train web-navigating agents in safe, scalable, and verifiable synthetic environments automatically cloned from real websites, sidestepping the risks and limitations of real-world interaction.
Overcoming the data scarcity bottleneck in robotic arm-hand coordination, FAR-Dex achieves over 80% real-world success in fine-grained dexterous manipulation tasks.
Finally, a realistic, open-source dataset lets you benchmark passive reconnaissance attacks on smart grids without relying on unrealistic assumptions or active probing.
State-of-the-art skeleton-based action recognition is now possible through a game-theoretic contrastive learning framework that maximizes action-relevant information while minimizing encoding redundancy.
Skip the expensive proxy model training: this training-free method boosts VLLM performance by up to 4.8% using only 10-15% of the data, simply by measuring how much the question *changes* the model's view of the answer.
Forget laboriously sifting through layers or datasets for PEFT: GAST co-optimizes both, adaptively picking the most impactful data for each layer based on gradient alignment.
Modern speech enhancement algorithms may not improve ASR performance in realistic noisy environments, challenging assumptions about their effectiveness in real-world applications.
Get 6x the RLHF alignment for your LLM with a new active learning pipeline that focuses on annotating the most informative response pairs.
Achieve real-time super-resolution ultrasound without labeled data using CycleULM, a CycleGAN-based framework that boosts image contrast by 15.3 dB and localization precision by 46%.
Despite ChatGPT's known flaws, it can generate surprisingly realistic synthetic system requirement specifications that fool experts more often than you'd expect.
A new large-scale dataset could jumpstart Vietnamese VQA research by providing a crucial resource for training and evaluating multimodal models in a low-resource language.
VLMs can now self-evolve from *zero* data, thanks to a multi-agent RL framework that synthesizes its own visual concepts and reasoning tasks.
Bridge the gap between sparse core samples and continuous wellbore data with a cGAN that synthesizes realistic subsurface images conditioned on well log porosity.
Rényi differential privacy unlocks tighter privacy guarantees in partition selection, but releasing partition frequencies comes at a cost.
Forget expensive fine-tuning: FoodOntoRAG links food entities with near SOTA accuracy while adapting to evolving ontologies using a clever RAG architecture with retrieval, selection, scoring, and synonym generation agents.
Forget expensive human annotations: LLMs can reliably generate synthetic data to validate NLP evaluation metrics, even outperforming human agreement in some multilingual tasks.
Text prompts might be inflating your SLLM's performance: spoken prompts reveal a significant performance gap, especially in low-resource languages.
Achieve up to 23% better prediction accuracy in manufacturing surrogate modeling by jointly modeling inter-task similarity and data fidelity using a hierarchical Bayesian approach.
Evaluating classification models on biased data can mask true performance and fairness, but this work provides a framework to create unbiased test sets that reveal the real impact of different biases and mitigation strategies.
Forget generic fine-tuning data — Bloom's Taxonomy-based data generation can boost LLM performance in complex engineering domains like space situational awareness by up to 176%.
Correcting systematic errors in aggregate data is now possible by using proxy variables to disentangle true signals from biases via a VAE-based framework.
Even when paraphrasing content that explicitly contradicts a teacher's preferences, language models can still subliminally learn those preferences, raising serious concerns about bias propagation in self-training scenarios.
Domain-specific prompts can significantly boost document layout analysis, achieving state-of-the-art results by explicitly guiding models with dataset-aware cues.
A meticulously curated, bidirectional English-German corpus of parliamentary proceedings now offers researchers a goldmine for dissecting the nuances of translation, interpreting, and language variation through an information-theoretic lens.
LLMs can generate spatial relation labels that align with human judgments, offering a scalable path to richer, multilingual spatial datasets.
A new OCR pipeline slashes error rates on noisy, polytonic Greek texts, opening up a vast historical corpus for NLP research and LLM training.
Current methods struggle to understand human behavior in industrial settings, as evidenced by the challenging ENIGMA-360 dataset of synchronized ego-exo videos.
Stop generating superficial reviews: RbtAct leverages rebuttals to train LLMs to provide actionable feedback, leading to concrete revisions and improved author uptake.
Finally, a comprehensive dataset unlocks the potential to develop and validate advanced control and estimation algorithms tailored for the unique challenges of nano-quadrotors.
Bridging the gap between CT and scarce CBCT data, a novel UDA framework achieves state-of-the-art liver segmentation by reformulating Margin Disparity Discrepancy.
Dataset condensation, previously limited to neural networks, can now democratize access to clinical data by enabling privacy-preserving training of classical models like decision trees and Cox regression.
Synthetic data, when grounded in vision-language models for evaluation, demonstrably boosts performance in remote sensing tasks like segmentation and captioning, outperforming models trained solely on real-world data.
By integrating physical constraints with adaptive representation learning, TAM-RL substantially enhances the accuracy of global carbon flux estimates, outperforming existing methods.
Imperfect code from LLMs can still teach AI to understand circuit structure, unlocking a scalable path to netlist representation learning without expensive, clean datasets.
Forget manual labeling: influence functions can automatically surface high-quality robot demonstrations, boosting policy performance by intelligently curating training data.
Achieve near-perfect privacy against clustering and inversion attacks in split learning without sacrificing model accuracy by using differential privacy and secret label obfuscation.
State-of-the-art semi-supervised domain generalization (SSDG) methods crumble when faced with the real-world challenge of long-tailed class distributions, but IMaX offers a simple, effective fix.
MLLMs can generate surprisingly effective synthetic training data for defect classification, boosting performance by 20% even with very limited real data.
A modular statistical transformation pipeline boosts audio deepfake detection accuracy by 10.7% in cross-domain scenarios, without needing labeled target data.
Forget more data: pre-training on just 164M tokens of synthetic data from Neural Cellular Automata can outperform pre-training on 1.6B tokens of natural language for downstream LLM tasks.
Emirati Arabic finally gets a dedicated, sociolinguistically rich speech corpus, opening doors for better ASR/TTS in this low-resource language.
Stop wasting time on manual LLM domain adaptation: AutoAdapt automates the process and boosts accuracy by 25% over existing AutoML methods.
Finally, a dataset that tackles the virtual try-on problem head-on with paired, multi-view fashion data, realistic garment dynamics, and rich annotations.
Decentralized z-anonymity is now practical: deZent achieves comparable performance to centralized approaches while minimizing reliance on a trusted central entity.
Robots can now learn manipulation skills from ordinary human videos, thanks to a 3D point tracking method that bridges the embodiment gap and requires only 20 robot demonstrations.
By dynamically adjusting contrastive learning temperatures based on data density, MM-TS achieves state-of-the-art results on multimodal long-tail datasets.
FedLECC slashes communication overhead in federated learning by 50% while boosting accuracy by 12%, all by cleverly picking clients based on data similarity and loss.
MLLMs can now reliably interpret electromagnetic signals even in noisy environments, thanks to a new training framework and benchmark designed specifically for this challenging domain.
Unlock AV speech recognition for any language, even with zero labeled video data, by training on synthetically generated talking-head videos.
LLMs often prefer awkward, literal translations over natural-sounding alternatives, even when the original source text is removed.
Federated differentially private data synthesis can now achieve utility comparable to centralized approaches, even with heterogeneous data distributions, thanks to a novel framework that smartly handles noise and redundancy.
By synthesizing outliers that respect the learned manifold structure, GCOS enables deep networks to more robustly distinguish between in- and out-of-distribution samples, leading to state-of-the-art performance on near-OOD detection.
Noisy issue descriptions holding back your software agent? SWE-Fuse unlocks 60% higher solve rates by fusing issue-guided and issue-free training trajectories.
Instead of discarding noisy pseudo-labels in image restoration, QualiTeacher leverages them by teaching the model to understand and even surpass the quality levels they represent.
Scale qualitative analysis of educational discourse data without sacrificing rigor using a mixed-initiative system that orchestrates LLMs and human expertise.
Achieve 40% better fraud detection by ditching standard gradient descent for a fractional calculus optimizer that remembers the past.
A Shapley-incentivized blockchain boosts federated learning accuracy by 14% and thwarts 90% of malicious attacks in high-speed rail data sharing.
Reported successes in reconstructing PII from sanitized documents may be overstated due to data leakage, leaving the true vulnerability of PII removal techniques uncertain.
A 3B model can match the performance of models more than twice its size in mobile GUI automation by distilling visual history into concise natural language summaries.
A million-scale dataset of globally diverse, cross-modal geo-location pairs, coupled with a novel physical-law-aware network, leapfrogs existing CMGL benchmarks and opens the door to truly universal positioning systems.
A unified framework and comprehensive evaluation reveals the surprisingly nuanced performance of diffusion-based data augmentation, showing where it shines and where it falls short in low-data image classification.
State-of-the-art SLAM algorithms can fail to re-localize in changing seasons, as highlighted by a new multi-modal, year-long boreal forest dataset.
Forget expensive data collection: Seed2Scale leverages a small-model/large-model synergy to self-generate high-quality embodied AI training data, starting from just four seed demonstrations.
A new multilingual benchmark dataset with over 2,500 annotations of personal information enables privacy-preserving machine learning across ten languages, sidestepping the need for sensitive patient data.
LLMs can bootstrap high-quality legal argument mining datasets at scale, but only with careful human-in-the-loop refinement to correct ~20% of initial errors.
Current machine unlearning methods for recommender systems struggle with robustness and sequential deletions, especially in attention-based and recurrent models, highlighting a critical gap ERASE helps to expose.
Forget generic robot demos – this work introduces a complete pipeline and dataset for AI-powered massage robots that can understand language and identify acupoints.
A new 30B open-weight LLM trained on 34 European languages achieves state-of-the-art performance on low-resource languages with significantly less compute, proving that clever training beats brute force.
Forget massive datasets – targeted training on a smaller, carefully curated dataset of challenging competitive programming problems yields 3x faster gains in code generation performance.
Reddit's political echo chambers aren't just a vibe, they're a quantifiable force field that hardens opinions through self-selection, not softened by exposure to opposing views.
LLMs still struggle with complex legal reasoning, as evidenced by their difficulty in solving Islamic inheritance cases, even with a new dataset designed to support step-by-step reasoning.
Forget expensive, noisy recordings: this procedural engine sound dataset offers 19 hours of clean, annotated audio for training better automotive sound AI.
Tired of fragmented datasets? SeDa unifies 7.6M+ datasets from 200+ platforms with semantic annotation and provenance tracking, making cross-domain data discovery a breeze.
Forget massive multilingual models: fine-tuning on just 5 hours of speech data from a related language slashes ASR error rates for an endangered language, rivaling the performance of Whisper-Small.
Forget hand-annotated 3D datasets: a new automated pipeline generates massive, high-quality 3D spatial intelligence from raw video, unlocking better VLM reasoning.
Forget scaling laws, targeted data engineering—specifically multi-stage distillation and difficulty-aware sampling—allows an 8B model to outperform larger open-source financial LLMs.
Forget re-prompting or inversion: MedSteer lets you surgically edit endoscopic images by steering diffusion model activations, creating perfectly matched counterfactuals with 95% concept flip rates.
Achieve sharper, more accurate infrared super-resolution in real-world conditions by disentangling thermal and structural degradations with a novel autoregressive framework.
Open-set corrective assistance, requiring models to inspect lengthy user behavior and provide corrective actions or language-based feedback, remains a significant challenge even with fine-tuning on diverse interactive data.
Weak LLMs, when strategically leveraged via confidence-based sample weighting, can not only drastically cut preference alignment costs but also surpass the performance of models trained on full human-labeled datasets.
Replaying generic pre-training data during fine-tuning boosts target task performance by up to 2x, challenging the common practice of minimizing its use.
Forget hand-crafted curricula: TSE-Datamap leverages training dynamics to automatically surface optimal learning schedules for target speaker extraction.
Omnidirectional imagery + language unlocks robust multi-object tracking that overcomes the field-of-view limitations plaguing conventional video datasets.
A new cross-linguistic phoneme recognition system, BabAR, finally unlocks scalable analysis of early childhood speech development.
A new Transformer architecture, IAENet, predicts multiple interdependent surgical complications more accurately by explicitly modeling event co-occurrence and handling data heterogeneity.
Forget local geometry – this dynamic data selection method uses a sparse autoencoder to prioritize samples covering frequent feature factors, leading to 2x training acceleration.
Unlock scalable CAD generation from unannotated 3D meshes with DreamCAD, a framework that directly produces editable BREPs from point-level supervision, outperforming existing methods and achieving over 75% user preference.
Curriculum reinforcement learning closes the distributional gap between pre-trained MLLMs and KB-VQA, yielding SOTA results by strategically generating and sampling training data.
A new, meticulously cleaned corpus of Sinhala legal texts opens the door for NLP research in an under-resourced language.
Human annotation errors in cross-cultural micro-expression datasets can be significantly reduced by dynamically re-selecting keyframes, leading to more accurate recognition.