Search papers, labs, and topics across Lattice.
34 papers from Tsinghua AI on Data Curation & Synthetic Data
Tabular data synthesis no longer needs to sacrifice privacy for quality: pretraining on diverse datasets lets models generalize from limited context, breaking the traditional tradeoff.
Federated learning struggles when data quality varies across clients, but FedQual solves this with a novel approach that calibrates low-quality clients while preserving high-quality autonomy.
Forget fully connected relation graphs: CasLayout's sparse relation modeling unlocks enhanced controllability and realism in 3D indoor scene synthesis.
Federated learning can overcome data silos, but struggles when clients have different label relationships; FedHarmony shows how to harmonize these differences, leading to better performance.
Code dataset watermarking gets a stealthy upgrade: PuzzleMark hides watermarks in variable names based on code complexity, making them nearly undetectable while guaranteeing perfect verification.
MLLMs are better at understanding videos than directly grounding text queries within them, and a self-correction training loop can close the gap.
By unifying generative and discriminative approaches, UniGenDet achieves superior image generation and detection, suggesting that these tasks benefit from a symbiotic relationship previously hindered by architectural divergence.
Training-free diffusion models can now harmonize satellite imagery across diverse domains, enabling scalable remote-sensing synthesis without retraining.
GitHub abuse is more widespread and varied than previously thought, demanding a unified detection approach to safeguard software supply chains.
Synthesizing realistic anomaly images for industrial assembly is now possible thanks to a diffusion model that respects component pose and assembly relationships.
Extracting agricultural parcels from satellite imagery gets a whole lot harder (and more realistic) with a new dataset focused on the complex, irregular, and heterogeneous terrain of terraced farms.
By unifying contrastive and reconstructive learning with targeted augmentations, CoRe-ECG extracts more robust and physiologically meaningful representations from unlabeled ECG data than existing self-supervised methods.
Current Chinese AI-generated text detection benchmarks are too homogeneous; C-ReD fixes this with real-world prompts and diverse LLMs, enabling better generalization.
See how ideas like "democracy" or "freedom" have subtly shifted their meaning across different news sources and time periods, all within a single, comparable framework.
Forget human-annotated datasets: MathAgent synthesizes mathematical reasoning data so effectively that models trained on just 1K generated examples outperform those trained on existing datasets.
Current memory systems, despite their complexity, are surprisingly worse than naive RAG when applied to continuous lifelogging scenarios, revealing a critical need for better context preservation.
You can now train your capacitance extraction models on a diverse, multi-PDK dataset of open-source designs, but be ready to trade accuracy for speed when choosing between CNNs and GNNs.
Forget complex disentanglement architectures or low-quality synthetic targets: MimicLM achieves superior voice imitation by cleverly using synthetic speech as the *source* and real speech as the *target* in a pseudo-parallel training setup.
Unlock zero-shot generalization in robot manipulation by generating diverse, affordance-aware training data with 3D generative models and Vision Foundation Models.
LLMs can achieve competitive performance simply by optimizing data mixing strategies as a graph-constrained optimization problem.
Turns out, you can cut critical errors in VLM-generated image editing instructions in half with a clever two-stage training pipeline, leading to SOTA editing performance.
Synthesizing realistic anomalies for industrial inspection is now possible with just a few examples, thanks to spatially-grounded diffusion that outperforms existing inpainting techniques.
Generating coordinated bimanual grasps on diverse objects is now possible thanks to a dataset of nearly 10 million grasps and a model that adapts to object geometry and size.
Humans are still way better than LLMs at trial-and-error problem solving, and this new dataset of human problem-solving trajectories shows us why.
Synthesizing realistic human mobility in data-scarce regions is now possible thanks to a dual-LLM-agent framework that learns physical constraints via reinforcement learning.
Stop training your image restoration models to mimic flawed ground truth; instead, explicitly optimize for perceptual quality using a plug-and-play module guided by No-Reference Image Quality Assessment.
Multi-hop data synthesis using HopChain boosts VLM performance across a wide range of tasks, with gains of over 50 points in accuracy for ultra-long-context reasoning.
MLLMs can now reliably interpret electromagnetic signals even in noisy environments, thanks to a new training framework and benchmark designed specifically for this challenging domain.
Domain-specific knowledge hypergraphs can now be extracted with significantly improved quality by dynamically learning and applying extraction skills, outperforming static few-shot learning.
Training a robot foundation model on 30,000 hours of heterogeneous embodied data lets it outperform prior methods by up to 48% on complex manipulation tasks and even benefit from low-quality data.
Forget expensive human annotation: this dual-loop method automatically cleans remote sensing image-text datasets, boosting T2I model performance by over 35%.
LLMs still struggle to learn effectively from user feedback during service, as revealed by a new benchmark spanning multiple domains and languages.
High-quality data is all it takes: Bee-8B, trained on the new Honey-Data-15M dataset, leapfrogs existing fully open MLLMs to rival semi-open models.
LLMs still struggle to synthesize coherent scientific surveys, as evidenced by a new benchmark revealing significant performance gaps even with advanced agentic frameworks.