Search papers, labs, and topics across Lattice.
50 papers published across 5 labs.
Training data is not enough: reasoning traces from diverse cultural backgrounds are critical for safe and reliable autonomous driving in rare, long-tail scenarios.
Unlock the potential of full-duplex speech language models with Sommelier, a new open-source pipeline that tackles the messy reality of multi-speaker conversations.
Even with only 5% labeled data, Switch achieves ultrasound segmentation accuracy exceeding fully supervised methods, thanks to its clever multiscale and frequency-domain switching.
A million-sequence, high-quality, open-source motion dataset finally lets text-to-motion models generalize beyond toy benchmarks.
Multi-corpus training can actually *hurt* spoofing detection, unless you strip out dataset-specific biases with this clever domain-invariant feature extraction trick.
Training data is not enough: reasoning traces from diverse cultural backgrounds are critical for safe and reliable autonomous driving in rare, long-tail scenarios.
Unlock the potential of full-duplex speech language models with Sommelier, a new open-source pipeline that tackles the messy reality of multi-speaker conversations.
Even with only 5% labeled data, Switch achieves ultrasound segmentation accuracy exceeding fully supervised methods, thanks to its clever multiscale and frequency-domain switching.
A million-sequence, high-quality, open-source motion dataset finally lets text-to-motion models generalize beyond toy benchmarks.
Multi-corpus training can actually *hurt* spoofing detection, unless you strip out dataset-specific biases with this clever domain-invariant feature extraction trick.
Predictive policing algorithms can exhibit extreme racial bias, with one city showing a 157x higher detection rate for one racial group in a single year.
Unlock automated health literacy assessment from clinical notes with HEALIX, the first publicly available dataset of its kind.
Forget random data mixing: MOSAIC uses failure analysis to intelligently curate training data, leading to better safety, less over-refusal, and improved instruction following, all at once.
Automating web data integration for expert querying is now possible: SODIUM-Agent achieves a 2x accuracy boost over existing systems on a new benchmark of 105 real-world tasks.
Object detectors in new visual domains suffer from "astigmatism," but mimicking the human eye's foveal vision can bring them into focus.
Current video object removal methods leave distracting visual artifacts behind, but EffectErase tackles this problem head-on by jointly removing objects and their pesky visual effects.
Descriptor-guided sampling and active learning slashes the cost of simulating gas-surface interactions, enabling accurate molecular dynamics at scale.
Encoding realism as a knowledge graph of interpretable traits unlocks zero-shot sim2real image translation that outperforms state-of-the-art diffusion methods.
Diffusion models can now generate rare concepts and execute complex edits with greater fidelity, thanks to a training-free prompt blending technique that leverages statistical properties of the diffusion process itself.
Diffusion models, despite their generative prowess, may not offer the silver-bullet privacy guarantees often assumed when synthesizing tabular data, as demonstrated by novel membership inference attacks.
Stop guessing how much to pretrain vs. specialize your language model – scaling laws can now tell you the optimal compute allocation for maximizing performance on downstream tasks.
LLMs, when used to annotate social media for human values, systematically overestimate "Openness to Change" compared to human experts, revealing a potential bias in automated value detection.
Move over, topic models: this method discovers functional text categories like "courtroom cross-examination" and "lyrical meditation" by learning how text *does*, not just what it's *about*.
Automating linguistically-grounded sign language annotation is now possible, unlocking scalable dataset curation previously limited by manual effort.
Training a DNN to recover a reverberant signal from a *more* reverberant version surprisingly reduces reverberation in the original signal.
Escape the scripted feel of simulated conversations: Interplay trains independent user and recommender LLMs that interact in real-time, without pre-defined target items, for more realistic and diverse conversational recommendation data.
Aligning covariates across RCTs and observational studies via calibrated embeddings dramatically improves treatment effect estimation, especially when dealing with nonlinear relationships where traditional imputation struggles.
Federated learning can adapt to asynchronous data drift with up to 83% less retraining cost by using a Mixture-of-Experts architecture to selectively update local parameters.
Forget rephrasing: stitching synthetic text into "megadocs" unlocks surprisingly better pre-training, especially for long-context tasks, and keeps improving as you scale.
LLMs beat traditional metrics at judging PDF table extraction quality, finally offering a way to evaluate semantic correctness, not just structural similarity.
LLMs can be actively trained to master specific knowledge domains with 50% less data and computation by focusing on what they *don't* know, not what they already do.
Teaching LLMs to say "I don't know" is now possible via targeted SFT, slashing hallucination rates without sacrificing performance on other tasks.
Outliers aren't just noise: some are early harbingers of entirely new topics, detectable by tracking document trajectories.
Training on synthetically generated data can significantly boost both the diversity and quality of commonsense reasoning in LLMs, outperforming models trained on scarce human-annotated data.
YouTube's platform defenses are a house of cards: circumventing one control often triggers a cascade of failures, demanding constant architectural adaptation for large-scale content replication.
Unlock faster, more accurate interlinear glossing for low-resource languages by treating morphemes as atomic units, outperforming existing methods and enabling user-guided lexicon expansion without retraining.
Synthetic data and virtual environments are rapidly becoming indispensable for autonomous driving, but realizing their full potential requires tackling challenges like Sim2Real transfer and scalable safety validation.
Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.
Counterintuitively, the most *unreliable* samples in medical imaging datasets—those with fluctuating confidence and frequent forgetting during training—are the *most* informative for building accurate decision boundaries.
Current CRL benchmarks often fail to provide a holistic view of model performance, hindering progress, but a new aggregate metric could change that.
Optimizing multilingual training? Shapley values reveal the hidden cross-lingual transfer effects that current scaling laws miss, leading to better language mixture ratios.
Current AI struggles to understand human values in real-world news events, often missing the who, what, and why – until now.
Pinpointing the training data behind an LLM's behavior is now possible without retraining, opening the door to precise debugging and targeted interventions.
Overcome scarce data and boost material classification accuracy by generating synthetic training data and distilling knowledge from vision-language foundation models.
Automated injection of realistic vulnerabilities and synthesis of PoV exploits finally makes scalable, precisely labeled, repository-level vulnerability datasets a reality.
Current PII detection models are blind to the transaction-level identifiers and partially-filled forms that computer-use agents readily expose, but a new benchmark closes the gap.
Stop benchmarking algorithm discovery on the same old saturated datasets: DiscoGen offers millions of fresh, configurable tasks to truly test your ADA.
Unlock scalable aerial scene understanding with SegFly, a massive RGB-T dataset generated via a novel 2D-3D-2D label propagation technique that requires minimal manual annotation.
A new RGB-T dataset and frequency-aware network exposes the surprising limitations of existing UAV detectors when faced with real-world camouflage and complex backgrounds.
Anonymized faces don't have to be expressionless blobs: this method preserves realistic expressions and lighting while scrambling identity.
RIS models struggle with motion-based queries, but a new data augmentation and contrastive learning approach closes the gap without sacrificing performance on appearance-based descriptions.
Achieve stable and reliable network intrusion detection and high-fidelity synthetic data generation by combining machine learning, adversarial learning, and rigorous statistical evaluation on a new unified multi-modal NIDS dataset.
Human-robot teams can get a boost: eye-tracking data alone can predict when a human teammate is struggling to understand the robot's situation with nearly 90% recall.
Ditch the data augmentation and decoders: R2-Dreamer's Barlow Twins-inspired objective delivers faster, more versatile MBRL, especially when spotting the small stuff matters.
Ditch LiDAR: 3D Gaussian Splatting, combined with semantic segmentation and stereo depth, enables real-time lunar mapping with centimeter-level accuracy.