Search papers, labs, and topics across Lattice.
100 papers published across 8 labs.
Quantifying the divergence between real and synthetic phoneme distributions via Kullback-Leibler divergence can pinpoint the most vulnerable phonemes for detecting audio deepfakes.
LLMs can be actively trained to master specific knowledge domains with 50% less data and computation by focusing on what they *don't* know, not what they already do.
Teaching LLMs to say "I don't know" is now possible via targeted SFT, slashing hallucination rates without sacrificing performance on other tasks.
Outliers aren't just noise: some are early harbingers of entirely new topics, detectable by tracking document trajectories.
Training on synthetically generated data can significantly boost both the diversity and quality of commonsense reasoning in LLMs, outperforming models trained on scarce human-annotated data.
LLMs can be actively trained to master specific knowledge domains with 50% less data and computation by focusing on what they *don't* know, not what they already do.
Teaching LLMs to say "I don't know" is now possible via targeted SFT, slashing hallucination rates without sacrificing performance on other tasks.
Outliers aren't just noise: some are early harbingers of entirely new topics, detectable by tracking document trajectories.
Training on synthetically generated data can significantly boost both the diversity and quality of commonsense reasoning in LLMs, outperforming models trained on scarce human-annotated data.
YouTube's platform defenses are a house of cards: circumventing one control often triggers a cascade of failures, demanding constant architectural adaptation for large-scale content replication.
Unlock faster, more accurate interlinear glossing for low-resource languages by treating morphemes as atomic units, outperforming existing methods and enabling user-guided lexicon expansion without retraining.
Synthetic data and virtual environments are rapidly becoming indispensable for autonomous driving, but realizing their full potential requires tackling challenges like Sim2Real transfer and scalable safety validation.
Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.
Counterintuitively, the most *unreliable* samples in medical imaging datasets—those with fluctuating confidence and frequent forgetting during training—are the *most* informative for building accurate decision boundaries.
Current CRL benchmarks often fail to provide a holistic view of model performance, hindering progress, but a new aggregate metric could change that.
Optimizing multilingual training? Shapley values reveal the hidden cross-lingual transfer effects that current scaling laws miss, leading to better language mixture ratios.
Current AI struggles to understand human values in real-world news events, often missing the who, what, and why – until now.
Pinpointing the training data behind an LLM's behavior is now possible without retraining, opening the door to precise debugging and targeted interventions.
Overcome scarce data and boost material classification accuracy by generating synthetic training data and distilling knowledge from vision-language foundation models.
Automated injection of realistic vulnerabilities and synthesis of PoV exploits finally makes scalable, precisely labeled, repository-level vulnerability datasets a reality.
Current PII detection models are blind to the transaction-level identifiers and partially-filled forms that computer-use agents readily expose, but a new benchmark closes the gap.
Stop benchmarking algorithm discovery on the same old saturated datasets: DiscoGen offers millions of fresh, configurable tasks to truly test your ADA.
Unlock scalable aerial scene understanding with SegFly, a massive RGB-T dataset generated via a novel 2D-3D-2D label propagation technique that requires minimal manual annotation.
A new RGB-T dataset and frequency-aware network exposes the surprising limitations of existing UAV detectors when faced with real-world camouflage and complex backgrounds.
Anonymized faces don't have to be expressionless blobs: this method preserves realistic expressions and lighting while scrambling identity.
RIS models struggle with motion-based queries, but a new data augmentation and contrastive learning approach closes the gap without sacrificing performance on appearance-based descriptions.
Achieve stable and reliable network intrusion detection and high-fidelity synthetic data generation by combining machine learning, adversarial learning, and rigorous statistical evaluation on a new unified multi-modal NIDS dataset.
Human-robot teams can get a boost: eye-tracking data alone can predict when a human teammate is struggling to understand the robot's situation with nearly 90% recall.
Ditch the data augmentation and decoders: R2-Dreamer's Barlow Twins-inspired objective delivers faster, more versatile MBRL, especially when spotting the small stuff matters.
Ditch LiDAR: 3D Gaussian Splatting, combined with semantic segmentation and stereo depth, enables real-time lunar mapping with centimeter-level accuracy.
Forget complex statistical models: this CUT network turns decades of fuzzy DMSP satellite data into sharp, VIIRS-like nighttime light maps with impressive accuracy.
Real-world images plagued by both raindrops and reflections finally get a dedicated benchmark dataset (RDRF) and a diffusion-based model (DiffUR$^3$) that actually works.
A small, synthetically generated dataset can dramatically improve LLM performance on low-resource languages, even when the data is noisy and imperfect.
Wikipedia editors can now get AI assistance to identify claims needing citations in 10 languages, improving content reliability at scale.
Contrary to claims that RLVR can handle noisy data, this work reveals that current RLVR methods still suffer significantly from data quality issues, with performance dropping 8-12% when trained on truly noisy data.
ASR-assisted transcription doesn't automatically improve accuracy in corpus creation, and its effectiveness hinges on factors like workflow design and transcriber expertise.
Unlock timbre-aware generative AI with a new dataset linking semantic descriptors to electric guitar sounds, enabling nuanced control over audio synthesis.
A new 1.25B-word Pashto corpus boosts NER performance by 10% and slashes training variance nearly 7x, highlighting the disproportionate value of Wikipedia data.
Forget perplexity – ZipCal uses Zipf's law to curate calibration data for LLM compression, matching state-of-the-art performance at 240x the speed.
By injecting real-world priors into a diffusion model, Iris achieves state-of-the-art monocular depth estimation with significantly improved generalization and detail, even with limited training data.
Data-centric methods can effectively identify and mitigate label noise in remote sensing data, but the best approach depends heavily on the specific noise characteristics and task objectives.
Achieve seamless vector map generation across all land-cover classes from aerial imagery by enforcing shared-edge consistency, outperforming class-specific methods.
LiDAR data boosts animal depth estimation accuracy by 10% RMSE, revealing the power of multimodal data for 3D wildlife perception.
Imagine cities teaching cars to see: this work demonstrates a label-free 3D perception pipeline where roadside sensors train autonomous vehicles, achieving impressive detection accuracy without manual annotation.
A new loss function lets you train a deep learning model to detect rare bee and wasp brood cells with minimal labeling effort, even when data is highly imbalanced.
A new diffusion architecture that explicitly disentangles demographic factors allows for generating higher-quality medical images for underrepresented groups and novel demographic intersections, outperforming standard fine-tuning and FairDiffusion.
Forget expensive distillation – aligning language models can be as simple as carefully choosing the right mix of pretraining data based on log-likelihood differences.
A new spoken user simulator, SpokenUS, trained on a large-scale dataset, finally captures the messiness of real human conversation, including barge-ins and disfluencies, to better train dialogue agents.
Federated learning can match or beat centralized models for predicting postoperative complications, all while keeping patient data siloed at each hospital.
By jointly estimating the mapping from calibration parameters to VAE-encoded image representations, this work achieves a 2x reduction in error when calibrating electron microscopes, demonstrating the power of bridging simulation and reality.
By explicitly modeling spectral channel variations and inter-channel similarity, SPDDA overcomes the realism-diversity tradeoff in hyperspectral data augmentation, achieving state-of-the-art domain generalization performance.
Unlock the potential of unlabeled plankton data with a CLIP-inspired cross-modal approach that achieves high recognition accuracy using minimal labeled images.
Rectified flows can generate synthetic skin lesion images that boost classification accuracy by up to 9% compared to diffusion models, offering a promising solution to data scarcity in dermatology.
Ensemble self-training with diverse auxiliary languages boosts unsupervised machine translation by up to 1.7 chrF, proving that shared supervision can overcome the limitations of single-model approaches.
A new smartphone protocol enables large-scale, privacy-preserving collection of prosodic speech data in the wild, opening doors to studying the subtle emotional nuances in everyday communication.
Surgical AI gets a major data boost: Surg$Σ$ unifies millions of surgical conversations with multimodal annotations, paving the way for more generalizable and interpretable models.
Unlock high-quality 3D part segmentation with minimal labeled data by repurposing existing 3D generative models.
Even with a realizable missing data model, estimating the mean of a high-dimensional Gaussian provably requires either exponentially more samples or exponential runtime, revealing a fundamental information-computation tradeoff.
Current time series foundation models struggle with millisecond-resolution 5G network data, revealing a critical gap in their ability to generalize to high-frequency real-world applications.
Forget relying on pretrained models or complex aggregation schemes: FederatedFactory achieves near-centralized performance in federated learning with extreme data heterogeneity by simply swapping generative priors.
Resource-constrained Arabic AI development can compete with systems built at far greater scale, as demonstrated by Fanar 2.0's performance gains using 8x fewer pre-training tokens than its predecessor.
Forget curated datasets – this work shows you can bootstrap AI scientists by training them on automatically generated, self-verified ML tasks, leading to significant performance gains on MLGym.
Achieve state-of-the-art semi-supervised crowd instance segmentation and counting by generating high-quality mask supervision from sparse annotations, effectively bridging the gap between these two tasks.
LLM benchmarks in low-resource languages are likely garbage, with synthetic or machine-translated data introducing severe flaws that skew results.
Forget expensive real-world data collection: a massive, diverse synthetic dataset enables surprisingly effective zero-shot transfer for robotic manipulation.
LLMs' apparent superhuman performance on benchmarks may be a mirage: contamination inflates scores by up to 20% in some domains, revealing a critical flaw in current evaluation practices.
LLM-assisted scientific writing is producing more confident but homogenized prose, as evidenced by a 23% decline in hedging in the post-LLM era.
Forget labor-intensive annotation or expensive motion capture: TrackDeform3D offers an affordable, autonomous RGB-D framework for high-quality 3D tracking and dataset collection of deformable objects.
Local LLMs can now anonymize text better than industry standards, preserving both privacy and utility for downstream tasks.
A new dataset of 2.56 million verses of Arabic lyrics and poetry opens the door for large-scale computational analysis of Arabic language evolution, cultural trends, and artistic expression.
Expect to pay an exponential sample complexity price for computationally efficient mean and covariance estimation with missing data, but not for linear regression.
Current telemetry falls woefully short in detecting advanced software supply chain attacks, with even the best single source capturing less than 40% of the attack chain, underscoring the critical need for multi-source data fusion.
By automatically generating fingertip workspace clouds, FSG enables real-time, human-like grasp generation for robotic hands with arbitrary structures, sidestepping the inverse kinematics bottleneck.
FastGAN can backfire in low-data regimes, actively *increasing* classifier bias by over 20% due to mode collapse, a stark warning against blindly applying generative augmentation.
Stop wasting your finetuning data: Specialized Pretraining (SPT) can outperform standard pretraining and finetuning, achieving better domain performance with fewer parameters and less compute.
Event cameras can now see in the dark: eAP, a new large-scale dataset, enables robust 3D object detection and time-to-contact estimation even under challenging illumination.
Forget static domain priors: the best way to rate AI-generated audio quality depends on *which* aspect of quality you're measuring.
This 8-year all-sky dataset with star-aware masks and alt-az calibration could unlock more reliable cloud prediction for ground-based telescopes.
Forget expensive fine-tuning: linguistically-informed prompting offers a lightweight, but sometimes unreliable, path to low-resource translation with LLMs.
Stop averaging over noisy robot data: PTR selectively trusts training samples based on how well their post-action consequences align with learned representations, leading to more robust offline policy learning.
Forget painstakingly creating 3D assets for robot training - ManiTwin automates the process, turning single images into simulation-ready objects at scale.
Multi-hop data synthesis using HopChain boosts VLM performance across a wide range of tasks, with gains of over 50 points in accuracy for ultra-long-context reasoning.
You can estimate the completeness of a web crawl using only its own historical data, without needing external datasets.
Imperfect knowledge graphs can lead to retrieval drift and hallucinations in multi-hop reasoning, but C2RAG offers a robust solution that improves EM by 3.4% and F1 by 3.9% over existing methods.
VLMs stumble when diagnosing Vietnamese chest X-rays, revealing a critical gap in their ability to handle diverse medical data and underscoring the need for datasets like ViX-Ray.
Achieve state-of-the-art pansharpening of thin-cloud contaminated remote sensing images with a unified model that disentangles frequency components and leverages NIR and PAN bands for robust restoration.
A new 320-hour corpus of French speech reveals how pronunciation has changed over six decades, including the surprising finding that voice pitch evolution doesn't differ by gender.
Even state-of-the-art multilingual transformers struggle with the pragmatic challenge of Indirect Question Answering, achieving low performance across English, German, and Bavarian.
A fine-tuned RoBERTa model with only 125M parameters can match the CVE-to-CWE classification accuracy of models 64x larger, proving that strategic fine-tuning and data curation can close the gap between small and large language models.
LLMs struggle with systematic cross-sentence knowledge of verb alternations, a weakness exposed by new Blackbird Language Matrices (BLMs) datasets in English, German, Italian, and Hebrew.
LLMs often fail to access knowledge uniquely available in lower-resource language varieties, even when closely related to high-resource languages, revealing a significant information asymmetry.
Despite a rapidly forming professional vocabulary, the AI field isn't coalescing into a distinct occupation, challenging assumptions about how new technologies translate into new job categories.
Adaptive prompting unlocks superior LLM-generated personality assessments, outperforming traditional methods and scaling effectively with model capability.
Quantifying the divergence between real and synthetic phoneme distributions via Kullback-Leibler divergence can pinpoint the most vulnerable phonemes for detecting audio deepfakes.
Sampling the wrong data in differentially private queries can inflate error by 10x, but a new method slashes that overhead by sampling aggregation units instead of users.
Forget photorealistic rendering; the next frontier in scene understanding is generating complete, traversable floorplans from a single egocentric image.
Force sensing gloves unlock a new dimension of self-supervision for video models, boosting action understanding without manual labels.
Training on a new multi-object dataset with explicit modeling of grasp offsets and pre-grasp configurations enables an end-to-end network to achieve significantly improved dexterous grasping performance in simulation and on a real robot.
Even with high AUROC scores for OOD detection, skeleton-based action recognition models can remain confidently incorrect when faced with domain shift, highlighting the limitations of standard uncertainty measures for safe deployment.
Causal-Residual Bootstrapping lets you inject more causal knowledge into your data augmentation pipeline than previous methods, leading to better model accuracy.
Answering complex biomedical questions like "Which biological pathways are disrupted by drugs currently in Phase 3 trials for breast cancer?" becomes possible in seconds by federating open-source knowledge graphs and enabling LLM access.
MT models struggle to appropriately handle passive voice in Chinese-English translation, often mirroring the source text's voice even when human translators would diverge.
By strategically guiding self-play with challenging real-world examples, GASP unlocks a 2.5% performance boost in coding LLMs and conquers previously unsolvable problems.
Adding more data from a new scanner can actually hurt your model by causing it to learn spurious correlations, even though clinical experts believe scanner variation is a key source of diversity.
Forget expensive ECG hardware: this dataset and benchmark show you can reconstruct clinically useful chest-lead ECGs from cheap vibrational sensors, but watch out for "hallucinated" heartbeats.