Search papers, labs, and topics across Lattice.
Training data quality, synthetic data generation, data filtering, deduplication, and dataset construction.
#10 of 24
2
Forget privacy concerns: you can train high-performing deep learning models for dynamic MRI reconstruction using *synthetic* fractal data.
Forget hand-crafted prompts and seed data: Simula lets you generate high-quality synthetic datasets at scale by simply defining the reasoning characteristics you want.
Chess transformers trained solely on move sequences face a "dual-capability bottleneck" where excelling at both state tracking and decision-making requires carefully balancing data diversity and quality, a tension that simple scaling cannot resolve.
An 8B open-source model, trained with a new closed-loop environment for 6G network management, achieves performance comparable to GPT-4, suggesting a viable path to autonomous network control.
Bilingual language models can achieve performance comparable to monolingual models in both languages, challenging the assumption that bilingual input poses significant learning obstacles.
AI-generated image forgery detection gets a major boost with PromptForge-350k, a dataset so large and well-annotated it pushes IoU scores 5% higher and generalizes to unseen models.
News agencies reuse content across languages far more than simple lexical overlap reveals, with over half of articles drawing on foreign sources through paraphrase and compositional techniques.
Physical AI systems struggle not with visual recognition, but with understanding space, physics, and action – and PRISM, a new retail video dataset, dramatically closes this gap.
Training NERL models on modern Italian won't cut it for historical texts: ENEIDE exposes the performance gap with a new multi-domain dataset spanning two centuries.
Unlock knowledge equity for underserved languages: L-ReLF offers a reproducible recipe for creating high-quality lexical datasets where they're needed most.
Thiomi slashes Swahili ASR error rates by 61% and unlocks nine more African languages for multimodal AI, proving community-driven data collection can leapfrog existing benchmarks.
Japanese entity linking gets a boost: CADEL offers a high-quality, Japan-specific corpus to tackle the unique challenges of linking entities in administrative web documents.
Proprietary language models trounce open-source alternatives by 3-6x on a new, large-scale corpus of Sinhala and Pali Buddhist texts.
The first publicly available dataset for Syrian Arabic Sign Language (SyArSL) opens the door for machine translation research to improve accessibility for a historically underserved community.
LLMs can better capture human semantic similarity by predicting sets of related concepts instead of single next tokens.
Stop treating inter-rater reliability as a simple green light for "ground truth" in AIED – your data's probably messier than you think, especially with LLMs in the mix.
Synthetic data, when carefully aligned with real-world characteristics, can boost hand-object interaction detection by over 11% even when real labeled data is scarce.
Vision-language models falter at the fine-grained temporal recognition crucial for surgical video understanding, while SurgRec excels.
YOLOv11 crushes the competition in form element detection, showcasing its potential for automating document processing across diverse real-world forms.
Stop training your image restoration models to mimic flawed ground truth; instead, explicitly optimize for perceptual quality using a plug-and-play module guided by No-Reference Image Quality Assessment.
VLMs struggle with Earth observation tasks involving complex land use, but a new dataset with nearly 10 million text annotations could change that.
FlowID enables forensic facial reconstruction on damaged faces with better identity preservation and lower computational cost than existing methods, potentially accelerating victim identification in violent deaths.
Stop averaging prototypes blindly: FedDBP uses Fisher information to intelligently fuse local prototypes, significantly boosting performance in heterogeneous federated learning.
Publicly available satellite imagery can now estimate building heights with state-of-the-art accuracy thanks to a new dataset and network architecture designed for the task.
Policies trained with GenSplat maintain robust performance under severe spatial perturbations where baseline methods completely fail, thanks to its novel 3D Gaussian Splatting-based augmentation.
Forget "spread" voicings: skewness is the key to clarity in piano chords, offering a fresh perspective on psychoacoustic principles.
Unbalanced class prevalence, not just disjoint label sets, is the dominant factor hindering federated learning performance under label-space heterogeneity.
Unlock new insights into rapid software development and collaboration with a massive dataset of over 100,000 hackathon projects.
Anticancer drugs, whether organic or inorganic, can now be understood through a single unified representation, unlocking knowledge transfer between previously siloed chemical domains.
Graph condensation, while shrinking massive datasets for GNN training, can inadvertently amplify biases – until now.
Multi-view learning with prototype-based correction significantly boosts the robustness of thyroid nodule ultrasound classification across different ultrasound devices and clinical environments.
Imperfect quantum data won't stop machine learning models: this work shows how unsupervised domain adaptation on classical shadows can bridge the gap.
Unlock hidden predictive power: NLP on unstructured clinical notes beats traditional EHR data for early disease prediction.
Federated learning can overcome data sparsity and privacy concerns to improve livestock growth prediction using real-world farm data.
Dataset condensation, already vulnerable to backdoor attacks, now faces a far stealthier threat: InkDrop leverages decision boundary uncertainty to hide malicious triggers, making detection significantly harder.
Generating realistic, safety-critical maritime scenarios at scale is now possible by combining generative trajectory modeling with automated encounter pairing, moving beyond limited historical data or handcrafted templates.
LLMs can now construct high-fidelity, disease-specific knowledge graphs from full-text biomedical literature, unlocking evidence-aware reasoning and hypothesis generation.
Data literacy isn't monolithic: K-12 learners navigate wildly different learning pathways depending on the context, challenging assumptions about a one-size-fits-all approach.
PReD leaps ahead by creating the first foundation model to close the loop on perception, recognition, and decision-making for electromagnetic signals.
Even a small, targeted dataset can bridge the gap in cross-dialect transfer learning for low-resource languages, significantly boosting dependency parsing accuracy.
LLMs' struggles with non-standard languages aren't just a technical problem, but reflect and reinforce historical power imbalances embedded in linguistic standardization.
You can now unmask LLM ghostwriters with a lightweight fingerprinting method that works even when they try to hide in new domains or use unseen models.
Demystifying LLMs for the masses might be as simple as turning their mechanics into a game.
Training data no longer needs to choose between realism and accuracy: SHOW3D delivers both for hand-object interaction.
Forget expensive, low-realism 3D renders: diffusion models can now generate photorealistic human datasets that boost model performance beyond real-world data.
A 40-point mIoU gap between supervised methods and zero-shot segmentation on Industrial3D reveals that foundation models are nowhere near ready for real-world industrial Scan-to-BIM workflows.
Wavelet decomposition offers a surprisingly effective way to disentangle anatomical structure from domain-specific noise in fundus images, leading to state-of-the-art generalization performance.
A new synthetic hyperspectral dataset lets researchers train and benchmark vegetation trait retrieval models with paired hyperspectral imagery and ground truth, all while controlling for environmental variability.
A new dataset of European landmarks offers researchers a challenging benchmark for training and evaluating 3D reconstruction pipelines, filling a critical gap in high-quality, diverse data.
Ghost points, often ignored in LiDAR processing, can be effectively identified and removed using full-waveform LiDAR data, leading to substantial improvements in downstream tasks like SLAM and object detection.