Search papers, labs, and topics across Lattice.
97 papers published across 5 labs.
Unlock face recognition with just one labeled example and a flood of unlabeled data, achieving state-of-the-art accuracy in a practical authentication scenario.
LLM-powered data augmentation combined with rule-based pre-processing unlocks surprisingly high NER accuracy in low-resource domains, even with limited training data.
Training on D3-Gym, a new dataset of real-world scientific tasks with verifiable environments, closes the gap between open-source and proprietary models on ScienceAgentBench by 7.8 points.
By intelligently perturbing class prototypes based on their discriminative power, VPDR achieves a superior privacy-utility trade-off in federated learning compared to naive Gaussian noise.
You can accurately predict steel hardness from nanoindentation data with a tiny dataset and some clever physics-based data augmentation, even when traditional methods fail.
LLM-powered data augmentation combined with rule-based pre-processing unlocks surprisingly high NER accuracy in low-resource domains, even with limited training data.
Training on D3-Gym, a new dataset of real-world scientific tasks with verifiable environments, closes the gap between open-source and proprietary models on ScienceAgentBench by 7.8 points.
By intelligently perturbing class prototypes based on their discriminative power, VPDR achieves a superior privacy-utility trade-off in federated learning compared to naive Gaussian noise.
You can accurately predict steel hardness from nanoindentation data with a tiny dataset and some clever physics-based data augmentation, even when traditional methods fail.
Homomorphic encryption can make federated learning nearly as accurate as centralized training on sensitive healthcare data, but at a steep computational cost, while differential privacy offers a less expensive but accuracy-sacrificing alternative.
Stop wasting compute on fine-tuning datasets with hidden capability gaps: GoalCover lets you diagnose and fix them *before* training.
Foundation model embeddings reveal hidden structure in federated datasets, enabling surprisingly effective client clustering without any training or communication overhead.
Forget training LLMs to understand privacy policies – a specialized, expert-annotated dataset and hybrid framework can do it better, achieving superior readability and reliability.
Even GPT-5.1 struggles to distinguish AI-generated academic images from real ones, achieving only 48.8% accuracy, revealing a significant gap between generative and forensic AI capabilities.
Stop wrestling with messy social media datasets: this toolkit streamlines standardization, anonymization, and enrichment, unlocking cross-platform insights with ease.
Instruction tuning on a new dataset, SecGoal, allows smaller 7B/9B parameter models to outperform much larger LLMs in extracting and formalizing security goals from protocol documents.
AI systems are built on a software house of cards, with 400M lines of code and 11,000 dependencies, yet lack basic supply chain protections like versioning and verifiability.
Current open-world semi-supervised learning methods fall short in practical applications because they fail to extract latent semantic information, but SECOS overcomes this by directly predicting textual labels from a candidate set, achieving state-of-the-art results.
A new test split for DeepSpaceYoloDataset helps push the boundaries of automated astronomical object detection by providing a more diverse and challenging evaluation benchmark.
Forget fully connected relation graphs: CasLayout's sparse relation modeling unlocks enhanced controllability and realism in 3D indoor scene synthesis.
A single self-supervised model trained on millions of unlabeled brain MRI slices can generalize across diverse neuroimaging tasks, rivaling or exceeding specialized models, even with limited labeled data.
Stop retrieving passages in your RAG system: NuggetIndex shows that retrieving and filtering atomic "nuggets" of information yields substantial gains in recall, temporal correctness, and reduced conflicts.
Current image forensics fall flat when faced with the subtle manipulations now possible in 3D Gaussian Splatting scenes, highlighting a critical gap in content authenticity assessment.
Existing synthetic image detectors fail to generalize because they're trained on biased data, but HiMix overcomes this with artifact-aware representations and mixup augmentation, achieving state-of-the-art generalization to unseen generators.
VLMs can get a boost in long-tail performance and train more efficiently by dynamically upsampling underrepresented data clusters each epoch.
Control over physical properties like friction and restitution in generated videos is now possible, paving the way for more realistic and controllable video synthesis.
Forget toy tasks: scaling synthetic computer environments unlocks surprisingly effective training data for agents tackling month-long, real-world productivity workflows.
Diffusion models struggle with multi-object generation not because of imbalanced concept representation, but primarily due to scene complexity and a surprising difficulty in counting, especially when training data is limited.
Unlock face recognition with just one labeled example and a flood of unlabeled data, achieving state-of-the-art accuracy in a practical authentication scenario.
Half of pedestrian crashes outside intersections happen surprisingly close to them, suggesting intersection design flaws may have a larger impact than previously thought.
Gradient attribution in AI weather models offers a computationally validated, model-informed approach to reward allocation in participatory weather sensing, but beware: adversarial inputs can game the system.
Federated learning can overcome data silos, but struggles when clients have different label relationships; FedHarmony shows how to harmonize these differences, leading to better performance.
Forget individual data points? Child's play. This work lets you surgically remove entire *classes* of data from CNNs without catastrophic forgetting.
Guaranteeing charge balance in generated amorphous materials is now possible without sacrificing accuracy or efficiency, thanks to AMGenC's novel approach.
Feature-level contrastive learning with dynamic masking unlocks superior performance on tabular remote sensing data, even when labels are scarce.
Forget scaling up data volume: repeating a smaller, high-quality German dataset yields superior language models compared to single-pass training on a larger, less filtered corpus.
A carefully crafted synthetic data pipeline and rubric-guided RL lets a 4B parameter model nearly match Gemini-3-Flash on wafer defect analysis, suggesting that data quality and targeted training can trump sheer model size.
General American English ASR performance doesn't guarantee similar accuracy across other English accents, as revealed by a new multi-accent call center dataset.
Unlock collaborative AI development in genomics without compromising patient privacy: this framework lets multiple institutions jointly train synthetic data generators on sensitive RNA-seq data using MPC and DP.
Code dataset watermarking gets a stealthy upgrade: PuzzleMark hides watermarks in variable names based on code complexity, making them nearly undetectable while guaranteeing perfect verification.
Ditch the garment masks: a simple human mask is all you need to nail video virtual try-on in the wild.
Seemingly innocuous augmentations like blur can cripple self-supervised learning for fine-grained tasks like plant identification, but domain-aware choices unlock surprisingly strong performance.
Stop wasting compute pre-training on domain-specific datasets; this simple strategy lets you pre-train on ImageNet and still achieve state-of-the-art results on diverse remote sensing segmentation tasks.
Achieve superior CT-MRI cervical spine registration by adaptively fusing Mamba-based global context with Swin Transformer-based local detail.
Nighttime off-road self-driving just got a boost: a new dataset and method robustly handles the dark by fusing infrared and RGB data with a novel memory-attention mechanism.
Forget tedious calibration – DOT-Sim lets you train tactile perception policies in simulation and deploy them directly to real robots with impressive accuracy, thanks to its physically accurate and rapidly calibrated model.
Annotating robot actions just got way faster and more accurate: ATLAS slashes annotation time and error by integrating robot sensor data with video.
Forget static emotion labels – EmoTransCap lets you generate speech captions that actually track how emotions evolve in a conversation.
Building agents that can reliably automate complex, multi-step workflows over local files and tools just got a whole lot easier.
Discover hidden biases in your speech datasets: this toolkit uses non-speech audio to reveal spurious correlations that inflate performance metrics.
Transferring phonetic knowledge from one language to another can dramatically improve automatic phonetic transcription, even enabling the recognition of entirely new phonetic features.
Curriculum learning flips the script on what language structures LMs find "easy," suggesting that training order is a critical factor in shaping their inductive biases.
LLMs still struggle to generate complete, internally structured classes from specifications, with even the best models failing more than half the time on a new benchmark designed to avoid data contamination.
LLMs can generate synthetic mental health records that are clinically coherent, lexically diverse, and privacy-safe, offering a promising solution to data scarcity in mental health research.
Defend against hardware Trojans in LLM-generated RTL code by structurally and semantically verifying training data, without needing to alter the underlying LLM.
Differentially private contrastive learning no longer needs to sacrifice so much accuracy, thanks to a new method that cleverly bounds gradient dependencies.
VideoLLMs leak training data: a novel black-box attack recovers membership with surprisingly high accuracy (AUC=0.68) by probing generation brittleness across temperatures.
A modular workflow achieves competitive, national-scale mapping of linear woody features in Germany from diverse Earth observation data without retraining, demonstrating surprising generalizability.
SkillSynth's skill graph approach lets you explicitly control the diversity of execution trajectories during terminal task synthesis, leading to more effective agent training.
Black-box knowledge distillation can be significantly improved by synthesizing diverse image priors and using contrastive learning to enhance the distinctions between synthetic samples.
Forget blindly chasing correlations – this paper reveals that the features you *think* are most important for model performance might not be the ones where data cleaning yields the biggest gains.
Achieve robust imbalanced classification with scarce minority samples by turning a generative VAE into a discriminative classifier using distribution-aware fine-tuning and statistically sound hypothesis testing.
Accurate landslide prediction is possible with sparse data by injecting geomorphic priors, unlocking geohazard risk assessment in data-scarce mountainous regions.
Ignoring the rank information in maxima nominated samples can lead to substantial performance degradation in fractionally supervised classification, a problem this paper elegantly solves with a new EM algorithm.
Generating realistic landslide datasets from sparse, imbalanced real-world data is now possible, thanks to a tabular foundation model that captures complex feature dependencies.
Real-world tabular data's messiness cripples zero-shot accuracy of powerful Tabular Foundation Models, but a new RL approach can clean up the problem.
Current machine learning models for semiconductor bandgap prediction fall short when faced with the messy reality of experimental data, highlighting a critical need for more robust and generalizable learning strategies.
Dutch NLP researchers, rejoice: a massive, freely available 35B token medical corpus has arrived to jumpstart your models.
Forget generic legal LLMs – LegalMidm shows that focusing on specific Korean legal use cases, with data curated by legal pros, unlocks real-world performance gains.
RAG models struggle to ignore their pre-trained knowledge, even when it contradicts the provided context, but a new dataset can help them learn to be more faithful.
DPO-based post-training can significantly boost the translation quality of pre-trained NMT models like gemma3-1b, even without additional parallel data.
Current cultural bias evaluations of LLMs rely on datasets that lack the nuance to distinguish between genuine cultural understanding and superficial mimicry, but this new dataset changes that.
Twitter strips C2PA provenance data from AI-generated images, making it impossible to cryptographically verify their origin on the platform.
Fragmented medical data hurts MLLM performance: this paper shows how a hierarchical medical knowledge graph can be used to engineer training data that substantially improves MLLM accuracy on complex clinical tasks.
Forget data scale, focus on influence: a new metric reveals that the best instruction tuning data isn't necessarily the most obvious or easiest.
A simple n-gram filter can effectively purge machine-generated content from Wikipedia dumps, yielding higher-quality training corpora.
Stop wrestling with fragmented MGT detection benchmarks: MGTEVAL offers a unified platform to build, attack, train, and evaluate detectors with ease.
Unlock expert developer reasoning: a new dataset distills complex GitHub issue discussions into structured trajectories, revealing the collaborative problem-solving process behind open-source software.
Deepfake detectors can be made far more robust to real-world image corruptions by training on heavily degraded data and ensembling complementary feature streams.
Domain generalization can yield surprisingly compact (3x smaller!), stable, and accurate image representations that transfer across magnifications, without requiring complex architectures or GANs.
Lunar mosaics riddled with radiometric inconsistencies? A deep learning approach can seamlessly blend multi-mission orbital imagery, outperforming traditional methods.
UnIte reveals that incorporating uncertainty into document sampling can lead to substantial improvements in retrieval performance with fewer training samples.
Self-supervised Vision Transformers can handily outperform domain-adapted CNNs when transferring weed detection models from ground-based to drone-based imagery.
Generating realistic 3D skull shapes for rare species is now possible with as few as four examples, thanks to a phylogenetically-informed neural generator that beats diffusion models and even allows for plausible reconstructions of ancestral forms.
MLLMs are better at understanding videos than directly grounding text queries within them, and a self-correction training loop can close the gap.
Achieve state-of-the-art shadow removal in remote sensing images without paired training data by unifying shadow detection and removal into a single framework.
Systematic variations in underwater object detection performance reveal hidden failure modes tied to intrinsic scene factors, challenging existing benchmarks based on synthetic style transfer.
Explicitly modeling fruit maturity as a continuous variable significantly improves robustness against label noise, challenging traditional classification approaches.
Retention models can now harness the power of post-conversion content without risking feature leakage, leading to more accurate predictions of user engagement.
Forget expensive human labeling: BARRED lets you train custom policy guardrails that outperform state-of-the-art LLMs using only synthetic data generated via multi-agent debate.
Forget painstakingly collecting real CAD data – Zero-to-CAD lets you bootstrap CAD program generation from multi-view images using a million-scale dataset synthesized entirely by an LLM agent.
A BiLSTM with a custom slang dictionary rivals AutoML in classifying the sentiment and emotion of messy, real-world Indonesian e-commerce reviews.
Training on semantically equivalent chart renderings in Python, R, and LaTeX unlocks surprisingly effective multi-lingual chart-to-code generation from a single model.
Forget painstakingly curating datasets – STELLAR-E auto-generates high-quality, domain-specific LLM benchmarks, rivaling real-world data in evaluation quality.
Even the largest language models still struggle to connect information across dispersed code segments, achieving only 74% accuracy on a new benchmark designed to test multi-hop code comprehension.
Finally, a dataset exists to train and benchmark algorithms for automatically detecting airway bifurcations in 3D CT scans, a crucial step towards understanding respiratory diseases.
Low-cost stereo vision can rival LiDAR for real-time windrow detection, paving the way for more accessible autonomous farming solutions.
Robots can now leverage human intuition for manipulation tasks, learning from a massive video dataset to improve motion plausibility and robustness, even when conditions change.
Simulate once, deploy anywhere: SPLIT lets you train tactile perception models on synthetic data and transfer them across different sensors without retraining.
Open-source diffusion models can now achieve state-of-the-art illumination control rivaling closed-source alternatives, thanks to a novel training pipeline and dataset.
LLMs can be systematically debugged and improved by treating training data as code, allowing for targeted "patches" that fix concept-level gaps and reasoning errors.
Unlock the secrets of the deep: OceanPile, a massive, meticulously curated multimodal dataset, finally brings the power of foundation models to the vast and underexplored ocean.