Tsinghua AI

×Data Curation & Synthetic Data

34 papers from Tsinghua AI on Data Curation & Synthetic Data

May 6, 2026

Tsinghua AI2w ago·also SEU, Siemens AI

Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning

Tabular data synthesis no longer needs to sacrifice privacy for quality: pretraining on diverse datasets lets models generalize from limited context, breaking the traditional tradeoff.

Xinyan Han, Yan Lu, Xiaoyu Lin +5

Data Curation & Synthetic Data Natural Language Processing

Tsinghua AI2w ago

Trustworthy Federated Label Distribution Learning under Annotation Quality Disparity

Federated learning struggles when data quality varies across clients, but FedQual solves this with a novel approach that calibrates low-quality clients while preserving high-quality autonomy.

Junxiang Wu, Zhi Kou, Hongwei Zeng +8

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Apr 30, 2026

Tsinghua AI3w ago·also Microsoft Research

CasLayout: Cascaded 3D Layout Diffusion for Indoor Scene Synthesis with Implicit Relation Modeling

Forget fully connected relation graphs: CasLayout's sparse relation modeling unlocks enhanced controllability and realism in 3D indoor scene synthesis.

Yingrui Wu, Youkang Kong, Mingyang Zhao +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Data Curation & Synthetic Data

Tsinghua AI3w ago·also SEU

FedHarmony: Harmonizing Heterogeneous Label Correlations in Federated Multi-Label Learning

Federated learning can overcome data silos, but struggles when clients have different label relationships; FedHarmony shows how to harmonize these differences, leading to better performance.

Zhi Kou, Zhiqiang Kou, Junxiang Wu +11

Data Curation & Synthetic Data Distributed Systems & Hardware Natural Language Processing

3w ago·also Tsinghua AI, CAS, NJU, NTU

PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

Code dataset watermarking gets a stealthy upgrade: PuzzleMark hides watermarks in variable names based on code complexity, making them nearly undetectable while guaranteeing perfect verification.

Haocheng Huang, Yuchen Chen, Weisong Sun +6

Code Generation & Program Synthesis Data Curation & Synthetic Data

Apr 28, 2026

3w ago·also Tsinghua AI, Huawei

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

MLLMs are better at understanding videos than directly grounding text queries within them, and a self-correction training loop can close the gap.

Minghang Zheng, Zihao Yin, Yi Yang +3

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Apr 23, 2026

Tsinghua AIApr 23, 2026

UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

By unifying generative and discriminative approaches, UniGenDet achieves superior image generation and detection, suggesting that these tasks benefit from a symbiotic relationship previously hindered by architectural divergence.

Yanran Zhang, Wenzhao Zheng, Yifei Li +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Data Curation & Synthetic Data

Apr 21, 2026

Tsinghua AIApr 21, 2026·also Sheffield

HarmoniDiff-RS: Training-Free Diffusion Harmonization for Satellite Image Composition

Training-free diffusion models can now harmonize satellite imagery across diverse domains, enabling scalable remote-sensing synthesis without retraining.

Xiaoqi Zhuang, Jefersson A. Dos Santos

Computer Vision Data Curation & Synthetic Data

Apr 20, 2026

Apr 20, 2026·also Tsinghua AI, NTU, SMU, University of Massachusetts

Weaponizing the Commons: A Taxonomy and Detection Framework of Abuse on GitHub

GitHub abuse is more widespread and varied than previously thought, demanding a unified detection approach to safeguard software supply chains.

Yuli Cheng, Xiaoyu Zhang, Jiongchi Yu +3

Code Generation & Program Synthesis Data Curation & Synthetic Data Open-Source Models & Weights

Apr 15, 2026

Apr 15, 2026·also Tsinghua AI, Li Auto, PolyU

PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

Synthesizing realistic anomaly images for industrial assembly is now possible thanks to a diffusion model that respects component pose and assembly relationships.

Zebei Tong, Hongchang Chen, Yujie Lei +4

Computer Vision Data Curation & Synthetic Data

Apr 14, 2026

Apr 14, 2026·also Tsinghua AI, CAU, Northeastern, Southwest Jiaotong University

GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality

Extracting agricultural parcels from satellite imagery gets a whole lot harder (and more realistic) with a new dataset focused on the complex, irregular, and heterogeneous terrain of terraced farms.

Zhiwei Zhang, Xingyuan Zeng, Xinkai Kong +6

Computer Vision Data Curation & Synthetic Data Multimodal Models

Apr 13, 2026

Tsinghua AIApr 13, 2026·also Chinese Academy, Department of Anesthesiology, Department of Cardiology, Fuzhou University Affiliated Provincial +5

CoRe-ECG: Advancing Self-Supervised Representation Learning for 12-Lead ECG via Contrastive and Reconstructive Synergy

By unifying contrastive and reconstructive learning with targeted augmentations, CoRe-ECG extracts more robust and physiologically meaningful representations from unlabeled ECG data than existing self-supervised methods.

Zehao Qin, Xiaojian Lin, Hongliang Wu +4

Data Curation & Synthetic Data Speech & Audio Training Efficiency & Optimization

Tsinghua AIApr 13, 2026·also HIT, Nankai University

C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

Current Chinese AI-generated text detection benchmarks are too homogeneous; C-ReD fixes this with real-world prompts and diverse LLMs, enabling better generalization.

Chenxi Qing, Junxi Wu, Yixiang Qiu +3

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Apr 13, 2026·also Tsinghua AI, NJU

HistLens: Mapping Idea Change across Concepts and Corpora

See how ideas like "democracy" or "freedom" have subtly shifted their meaning across different news sources and time periods, all within a single, comparable framework.

Yi Jing, Weiyun Qiu, Yihang Peng +1

Data Curation & Synthetic Data Natural Language Processing

Tsinghua AIApr 13, 2026·also HIT, Huawei

MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis

Forget human-annotated datasets: MathAgent synthesizes mathematical reasoning data so effectively that models trained on just 1K generated examples outperform those trained on existing datasets.

Zixiong Yu, Jun Rao, Guhan Chen +3

Data Curation & Synthetic Data Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Apr 13, 2026·also Tsinghua AI, RayNeo.AI, Shenzhen University, SJTU

Evaluating Memory Capability in Continuous Lifelog Scenario

Current memory systems, despite their complexity, are surprisingly worse than naive RAG when applied to continuous lifelogging scenarios, revealing a critical need for better context preservation.

Jianjie Zheng, Zhichen Liu, Zhanyu Shen +4

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

Tsinghua AIApr 13, 2026

CapBench: A Multi-PDK Dataset for Machine-Learning-Based Post-Layout Capacitance Extraction

You can now train your capacitance extraction models on a diverse, multi-PDK dataset of open-source designs, but be ready to trade accuracy for speed when choosing between CNNs and GNNs.

Hector R. Rodriguez, H. R. Rodriguez, Jiechen Huang +1

Data Curation & Synthetic Data Open-Source Models & Weights

Tsinghua AIApr 13, 2026

MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

Forget complex disentanglement architectures or low-quality synthetic targets: MimicLM achieves superior voice imitation by cleverly using synthetic speech as the *source* and real speech as the *target* in a pseudo-parallel training setup.

Tao Feng, Yuancheng Wang, Xueyao Zhang +4

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Apr 12, 2026

Tsinghua AIApr 12, 2026·also BAIR, Fudan, Shanghai Qi Zhi Institute

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence

Unlock zero-shot generalization in robot manipulation by generating diverse, affordance-aware training data with 3D generative models and Vision Foundation Models.

Kaizhe Hu, Yingqian Huang, Yuanchen Ju +2

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

Apr 9, 2026

Tsinghua AIApr 9, 2026·also HKUST

Rethinking Data Mixing from the Perspective of Large Language Models

LLMs can achieve competitive performance simply by optimizing data mixing strategies as a graph-constrained optimization problem.

Yuanjian Xu, Tianze Sun, Chang Xu +10

Data Curation & Synthetic Data Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Tsinghua AIApr 9, 2026

EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization

Turns out, you can cut critical errors in VLM-generated image editing instructions in half with a clever two-stage training pipeline, leading to SOTA editing performance.

Xiangyuan Wang, Honghao Cai, Yunhao Bai +6

Computer Vision Data Curation & Synthetic Data Multimodal Models

Apr 9, 2026·also Tsinghua AI, BJTU

GroundingAnomaly: Spatially-Grounded Diffusion for Few-Shot Anomaly Synthesis

Synthesizing realistic anomalies for industrial inspection is now possible with just a few examples, thanks to spatially-grounded diffusion that outperforms existing inpainting techniques.

Yishen Liu, Yisheng Liu, Hongcang Chen +8

Computer Vision Data Curation & Synthetic Data

Apr 8, 2026

Apr 8, 2026·also Tsinghua AI, PKU

BiDexGrasp: Coordinated Bimanual Dexterous Grasps across Object Geometries and Sizes

Generating coordinated bimanual grasps on diverse objects is now possible thanks to a dataset of nearly 10 million grasps and a model that adapts to object geometry and size.

Mu Lin, Yi-Lin Wei, Jiaxuan Chen +6

Data Curation & Synthetic Data Robotics & Embodied AI

Apr 8, 2026·also Tsinghua AI

TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving

Humans are still way better than LLMs at trial-and-error problem solving, and this new dataset of human problem-solving trajectories shows us why.

Xinkai Zhang, Jingtao Zhan, Jingtao Zhan +1

Data Curation & Synthetic Data Reasoning & Chain-of-Thought Tool Use & Agents

Apr 7, 2026

Tsinghua AIApr 7, 2026·also HKUST, SJTU, UESTC

ActivityEditor: Learning to Synthesize Physically Valid Human Mobility

Synthesizing realistic human mobility in data-scarce regions is now possible thanks to a dual-LLM-agent framework that learns physical constraints via reinforcement learning.

Chenjie Yang, Yutian Jiang, Anqi Liang +5

Data Curation & Synthetic Data Natural Language Processing Tool Use & Agents

Mar 31, 2026

Tsinghua AIMar 31, 2026·also Duke, EPFL

Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration

Stop training your image restoration models to mimic flawed ground truth; instead, explicitly optimize for perceptual quality using a plug-and-play module guided by No-Reference Image Quality Assessment.

Fengyang Xiao, Peng Hu, Lei Xu +7

Computer Vision Data Curation & Synthetic Data

Mar 17, 2026

Tsinghua AIMar 17, 2026·also DAMO, Fudan

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Multi-hop data synthesis using HopChain boosts VLM performance across a wide range of tasks, with gains of over 50 points in accuracy for ultra-long-context reasoning.

Shenzhi Wang, Shixuan Liu, Jing Zhou +7

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Mar 9, 2026

Tsinghua AIMar 9, 2026·also Artificial Intelligence Institute of China, Beihang, Beijing Information Science and Technology, BUPT +4

MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals

MLLMs can now reliably interpret electromagnetic signals even in noisy environments, thanks to a new training framework and benchmark designed specifically for this challenging domain.

Junyu Shen, Zhendong She, Chenghanyu Zhang +11

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Feb 23, 2026

Feb 23, 2026·also Tsinghua AI, BUPT, Shanghai University, Xiangtan

Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation

Domain-specific knowledge hypergraphs can now be extracted with significantly improved quality by dynamically learning and applying extraction skills, outperforming static few-shot learning.

Rizhuo Huang, Yifan Feng, Yifan Feng +9

Data Curation & Synthetic Data Natural Language Processing Reasoning & Chain-of-Thought+1

Feb 12, 2026

Tsinghua AIFeb 12, 2026·also NVIDIA, CAS, D VAE for spatiotemporal latent encoding, Galbot +2

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Training a robot foundation model on 30,000 hours of heterogeneous embodied data lets it outperform prior methods by up to 48% on complex manipulation tasks and even benefit from low-quality data.

Jiangran Lyu, Xuheng Zhang, Yusen Feng +9

Data Curation & Synthetic Data Robotics & Embodied AI World Models & Planning

Oct 31, 2025

Tsinghua AIOct 31, 2025·also MIT CSAIL

DLDC: A Dual Loop Data Cleaning Method for Fine-Tuning Remote Sensing Image Generative Models

Forget expensive human annotation: this dual-loop method automatically cleans remote sensing image-text datasets, boosting T2I model performance by over 35%.

Tian Xing, Hu Yan, Xinwei Wang +4

Computer Vision Data Curation & Synthetic Data Multimodal Models

Oct 20, 2025

Tsinghua AIOct 20, 2025

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

LLMs still struggle to learn effectively from user feedback during service, as revealed by a new benchmark spanning multiple domains and languages.

Qingyao Ai, Yichen Tang, Changyue Wang +311

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Oct 15, 2025

Oct 15, 2025·also Tsinghua AI

Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

High-quality data is all it takes: Bee-8B, trained on the new Honey-Data-15M dataset, leapfrogs existing fully open MLLMs to rival semi-open models.

Yi Zhang, Bolin Ni, Xin-Sheng Chen +75

Data Curation & Synthetic Data Multimodal Models Open-Source Models & Weights

Aug 21, 2025

Tsinghua AIAug 21, 2025·also Fudan, Shanghai Key Laboratory of Multimodal

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

LLMs still struggle to synthesize coherent scientific surveys, as evidenced by a new benchmark revealing significant performance gaps even with advanced agentic frameworks.

Weihang Su, Anzhe Xie, Qingyao Ai +5

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Search

Tsinghua AI