Search papers, labs, and topics across Lattice.
86 papers published across 4 labs.
VAANI's open-sourced dataset offers unprecedented coverage of India's linguistic landscape, finally giving researchers the data needed to build truly inclusive speech models.
Forget privacy concerns: you can train high-performing deep learning models for dynamic MRI reconstruction using *synthetic* fractal data.
Forget hand-crafted prompts and seed data: Simula lets you generate high-quality synthetic datasets at scale by simply defining the reasoning characteristics you want.
Chess transformers trained solely on move sequences face a "dual-capability bottleneck" where excelling at both state tracking and decision-making requires carefully balancing data diversity and quality, a tension that simple scaling cannot resolve.
An 8B open-source model, trained with a new closed-loop environment for 6G network management, achieves performance comparable to GPT-4, suggesting a viable path to autonomous network control.
Forget privacy concerns: you can train high-performing deep learning models for dynamic MRI reconstruction using *synthetic* fractal data.
Forget hand-crafted prompts and seed data: Simula lets you generate high-quality synthetic datasets at scale by simply defining the reasoning characteristics you want.
Chess transformers trained solely on move sequences face a "dual-capability bottleneck" where excelling at both state tracking and decision-making requires carefully balancing data diversity and quality, a tension that simple scaling cannot resolve.
An 8B open-source model, trained with a new closed-loop environment for 6G network management, achieves performance comparable to GPT-4, suggesting a viable path to autonomous network control.
Bilingual language models can achieve performance comparable to monolingual models in both languages, challenging the assumption that bilingual input poses significant learning obstacles.
AI-generated image forgery detection gets a major boost with PromptForge-350k, a dataset so large and well-annotated it pushes IoU scores 5% higher and generalizes to unseen models.
News agencies reuse content across languages far more than simple lexical overlap reveals, with over half of articles drawing on foreign sources through paraphrase and compositional techniques.
Physical AI systems struggle not with visual recognition, but with understanding space, physics, and action – and PRISM, a new retail video dataset, dramatically closes this gap.
Training NERL models on modern Italian won't cut it for historical texts: ENEIDE exposes the performance gap with a new multi-domain dataset spanning two centuries.
Unlock knowledge equity for underserved languages: L-ReLF offers a reproducible recipe for creating high-quality lexical datasets where they're needed most.
Thiomi slashes Swahili ASR error rates by 61% and unlocks nine more African languages for multimodal AI, proving community-driven data collection can leapfrog existing benchmarks.
Japanese entity linking gets a boost: CADEL offers a high-quality, Japan-specific corpus to tackle the unique challenges of linking entities in administrative web documents.
Proprietary language models trounce open-source alternatives by 3-6x on a new, large-scale corpus of Sinhala and Pali Buddhist texts.
The first publicly available dataset for Syrian Arabic Sign Language (SyArSL) opens the door for machine translation research to improve accessibility for a historically underserved community.
LLMs can better capture human semantic similarity by predicting sets of related concepts instead of single next tokens.
Stop treating inter-rater reliability as a simple green light for "ground truth" in AIED – your data's probably messier than you think, especially with LLMs in the mix.
Synthetic data, when carefully aligned with real-world characteristics, can boost hand-object interaction detection by over 11% even when real labeled data is scarce.
Vision-language models falter at the fine-grained temporal recognition crucial for surgical video understanding, while SurgRec excels.
YOLOv11 crushes the competition in form element detection, showcasing its potential for automating document processing across diverse real-world forms.
Stop training your image restoration models to mimic flawed ground truth; instead, explicitly optimize for perceptual quality using a plug-and-play module guided by No-Reference Image Quality Assessment.
VLMs struggle with Earth observation tasks involving complex land use, but a new dataset with nearly 10 million text annotations could change that.
FlowID enables forensic facial reconstruction on damaged faces with better identity preservation and lower computational cost than existing methods, potentially accelerating victim identification in violent deaths.
Stop averaging prototypes blindly: FedDBP uses Fisher information to intelligently fuse local prototypes, significantly boosting performance in heterogeneous federated learning.
Publicly available satellite imagery can now estimate building heights with state-of-the-art accuracy thanks to a new dataset and network architecture designed for the task.
Policies trained with GenSplat maintain robust performance under severe spatial perturbations where baseline methods completely fail, thanks to its novel 3D Gaussian Splatting-based augmentation.
Forget "spread" voicings: skewness is the key to clarity in piano chords, offering a fresh perspective on psychoacoustic principles.
Unbalanced class prevalence, not just disjoint label sets, is the dominant factor hindering federated learning performance under label-space heterogeneity.
Unlock new insights into rapid software development and collaboration with a massive dataset of over 100,000 hackathon projects.
Anticancer drugs, whether organic or inorganic, can now be understood through a single unified representation, unlocking knowledge transfer between previously siloed chemical domains.
Graph condensation, while shrinking massive datasets for GNN training, can inadvertently amplify biases – until now.
Multi-view learning with prototype-based correction significantly boosts the robustness of thyroid nodule ultrasound classification across different ultrasound devices and clinical environments.
Imperfect quantum data won't stop machine learning models: this work shows how unsupervised domain adaptation on classical shadows can bridge the gap.
Unlock hidden predictive power: NLP on unstructured clinical notes beats traditional EHR data for early disease prediction.
Federated learning can overcome data sparsity and privacy concerns to improve livestock growth prediction using real-world farm data.
Dataset condensation, already vulnerable to backdoor attacks, now faces a far stealthier threat: InkDrop leverages decision boundary uncertainty to hide malicious triggers, making detection significantly harder.
Generating realistic, safety-critical maritime scenarios at scale is now possible by combining generative trajectory modeling with automated encounter pairing, moving beyond limited historical data or handcrafted templates.
LLMs can now construct high-fidelity, disease-specific knowledge graphs from full-text biomedical literature, unlocking evidence-aware reasoning and hypothesis generation.
Data literacy isn't monolithic: K-12 learners navigate wildly different learning pathways depending on the context, challenging assumptions about a one-size-fits-all approach.
PReD leaps ahead by creating the first foundation model to close the loop on perception, recognition, and decision-making for electromagnetic signals.
Even a small, targeted dataset can bridge the gap in cross-dialect transfer learning for low-resource languages, significantly boosting dependency parsing accuracy.
LLMs' struggles with non-standard languages aren't just a technical problem, but reflect and reinforce historical power imbalances embedded in linguistic standardization.
You can now unmask LLM ghostwriters with a lightweight fingerprinting method that works even when they try to hide in new domains or use unseen models.
Demystifying LLMs for the masses might be as simple as turning their mechanics into a game.
Training data no longer needs to choose between realism and accuracy: SHOW3D delivers both for hand-object interaction.
Forget expensive, low-realism 3D renders: diffusion models can now generate photorealistic human datasets that boost model performance beyond real-world data.
A 40-point mIoU gap between supervised methods and zero-shot segmentation on Industrial3D reveals that foundation models are nowhere near ready for real-world industrial Scan-to-BIM workflows.
Wavelet decomposition offers a surprisingly effective way to disentangle anatomical structure from domain-specific noise in fundus images, leading to state-of-the-art generalization performance.
A new synthetic hyperspectral dataset lets researchers train and benchmark vegetation trait retrieval models with paired hyperspectral imagery and ground truth, all while controlling for environmental variability.
A new dataset of European landmarks offers researchers a challenging benchmark for training and evaluating 3D reconstruction pipelines, filling a critical gap in high-quality, diverse data.
Ghost points, often ignored in LiDAR processing, can be effectively identified and removed using full-waveform LiDAR data, leading to substantial improvements in downstream tasks like SLAM and object detection.
Bypass the need for predicate annotations in 3D scene graph pretraining with a novel topological layout learning approach that enforces predicate relation learning.
Injecting carefully-selected, reverse-ordered behavioral curricula into generative recommendation models can significantly boost conversion rates, as demonstrated by a 2% lift in online advertising revenue.
Forget painstakingly tuning data mixture ratios for continual pre-training: OptiMer lets you train individual models and then *optimize* their combination weights *afterward*, cutting search costs by up to 35x.
Unlock the secrets of scientific writing: EarlySciRev reveals how scientists *really* revise their work, offering a goldmine of early-stage revisions previously hidden in LaTeX comments.
Forget massive parallel datasets: cross-lingual alignment in multilingual models emerges almost as effectively without them.
Forget hand-tuning for each language: this recipe achieves state-of-the-art phone recognition across 100+ languages, revealing the surprising power of scaling data and SSL representations.
Sentiment models often disagree on Holocaust oral histories, not on the presence of positive or negative sentiment, but on the boundary of neutrality, revealing a critical gap in their ability to handle nuanced historical narratives.
Training on grounded reasoning traces doesn't just improve hypothesis generation—it makes models 100% structurally compliant and boosts spark cosine similarity by nearly 3x.
Classifying subtle orthographic variations in low-resource languages is now possible with 96% accuracy, paving the way for more robust NLP models.
Generating synthetic training data from limited confidential datasets can produce datasets that are superficially similar to the reference data and improve model training for short answer grading.
Forget hand-crafted KG traversal policies: GraphWalker uses automatically synthesized trajectories to train agents that achieve SOTA performance and generalize to unseen reasoning paths.
LLMs can now diagnose spleen-stomach disorders by integrating both traditional Chinese and Western medicine, achieving state-of-the-art results.
Finally, a way to represent the messy, collaborative syntax of real spoken language in treebanks.
Unlock centuries of East Asian philosophical insight: Graphilosophy uses knowledge graphs to make the Four Books accessible for cross-lingual retrieval and AI-assisted reasoning.
Forget manual blurring: Unsafe2Safe uses multimodal diffusion editing to automatically rewrite sensitive image regions, preserving utility while crushing privacy risks.
Securing LLM supply chains requires cryptographically binding training and release claims to artifacts, enabling verifiable enforcement of security policies across teams and stages.
Backdoor defenses can be baked into the pre-training phase of federated learning, achieving state-of-the-art attack mitigation with minimal impact on clean accuracy.
Generative super-resolution can significantly weaken forensic traces in text-guided inpainting forgeries, exposing a critical vulnerability in current forensic pipelines.
Slash malware detection labeling costs by 90% using combined active and semi-supervised learning, without sacrificing performance.
Flow-matching generative models can simultaneously defend against poisoning attacks and preserve privacy in federated learning, outperforming existing methods in accuracy and robustness.
Forget expensive full fine-tuning: this training-free data selection method uses in-context learning to slash MLLM training costs while maintaining performance.
Unlabeled LiDAR data can now drive state-of-the-art traffic simulation, unlocking scalable realism without costly annotations.
Synthetic data, often touted as a panacea, only shines for fruit detection when paired with real-world data, offering a practical path to reducing annotation effort without sacrificing too much accuracy.
Forget fixed memory budgets: dynamically allocating exemplar storage across federated clients boosts performance in class-incremental learning for heterogeneous medical data.
Achieve strong, controllable privacy in federated biomedical AI without sacrificing performance, thanks to a lightweight key-embedded implicit neural representation.
VAANI's open-sourced dataset offers unprecedented coverage of India's linguistic landscape, finally giving researchers the data needed to build truly inclusive speech models.
Ditch error-prone OCR: VERITAS slashes word error rates by 67% and triples processing speed for historical document digitization by integrating transcription, layout analysis, and semantic enrichment.
LLMs can scalably annotate motion capture data to produce semantically rich descriptions of bimanual interactions, enabling higher-quality generation of dexterous hand motions.
Augmenting IDS training data with a novel GAN framework boosts detection of unseen network attacks by nearly 4% AUROC, suggesting a promising path to more robust security systems.
LALMs still struggle to truly "hear" music, as revealed by a new expert-curated benchmark that exposes their reliance on non-musical shortcuts.
LLMs struggle to verbalize rare entities, exhibiting lower performance and higher uncertainty compared to common entities, even in multilingual settings.
Forget hand-picking your cross-lingual training data: a budget-constrained optimization can automatically allocate resources across multiple source languages, boosting performance on African languages by a large margin.
NPM malware detection tools often fail because they struggle to distinguish between innocuous code behavior and malicious intent, a problem addressable by analyzing behavioral chains.
VLMs can now get a million-scale boost in chart-understanding abilities thanks to a new dataset with paired code, images, data, and reasoning.
Unstructured text holds a wealth of untapped knowledge, yet remains largely ignored by existing data integration systems.
Achieve photorealistic and structurally consistent weather editing for autonomous driving videos without the massive datasets typically required by generative models.