March 4 – March 11, 2026

Data Curation & Synthetic Data - Weekly Roundup

100 papers published across 10 labs.

72% acceleration

Selected Labs publishing this week

MIT CSAIL3 CMU ML2 BAIR1 ETH1 AI21

Top Papers

Mar 11, 2026

Minsak Nanang +23w ago

Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints

Automating museum video metadata curation is now possible with a locally deployable video language model, unlocking previously inaccessible audiovisual archives.

Minsak Nanang, Adrian Hilton, Armin Mustafa

Computer Vision Data Curation & Synthetic Data Multimodal Models

3w ago

ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Even the best LLMs struggle with multi-turn medical dialogues, with error rates tripling by the third turn and a single wrong answer significantly increasing the probability of subsequent errors.

Monica Munnangi, Saiph Savage

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

M. Rehman +23w ago

Incremental Federated Learning for Intrusion Detection in IoT Networks under Evolving Threat Landscape

Forget retraining from scratch: incremental federated learning can keep your IoT intrusion detection models sharp against evolving threats, but the right update strategy is crucial for balancing accuracy and speed.

M. Rehman, Hayretdin Bahs, Rajesh Kalakoti

Data Curation & Synthetic Data Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

BAIR3w ago

Chasing RATs: Tracing Reading for and as Creative Activity

Reading Activity Traces (RATs) reveal the hidden creative work lost when algorithms automate interpretation, offering a path to design AI that preserves human insight.

Sophia Liu, S. Almeda

Data Curation & Synthetic Data Natural Language Processing

Zhiyuan Zeng +143w ago

Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining

LLMs can be made better software engineers by pre-training them to reconstruct the messy, iterative development process that led to the final, clean code in repositories.

Zhiyuan Zeng, Yichi Zhang, Yong Shan +12

Code Generation & Program Synthesis Data Curation & Synthetic Data Reasoning & Chain-of-Thought

All Papers (100)

Mar 11, 2026

Minsak Nanang +23w ago

Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints

Automating museum video metadata curation is now possible with a locally deployable video language model, unlocking previously inaccessible audiovisual archives.

Minsak Nanang, Adrian Hilton, Armin Mustafa

Computer Vision Data Curation & Synthetic Data Multimodal Models

3w ago

ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Even the best LLMs struggle with multi-turn medical dialogues, with error rates tripling by the third turn and a single wrong answer significantly increasing the probability of subsequent errors.

Monica Munnangi, Saiph Savage

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

M. Rehman +23w ago

Incremental Federated Learning for Intrusion Detection in IoT Networks under Evolving Threat Landscape

M. Rehman, Hayretdin Bahs, Rajesh Kalakoti

Data Curation & Synthetic Data Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

BAIR3w ago

Chasing RATs: Tracing Reading for and as Creative Activity

Reading Activity Traces (RATs) reveal the hidden creative work lost when algorithms automate interpretation, offering a path to design AI that preserves human insight.

Sophia Liu, S. Almeda

Data Curation & Synthetic Data Natural Language Processing

Zhiyuan Zeng +143w ago

Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining

LLMs can be made better software engineers by pre-training them to reconstruct the messy, iterative development process that led to the final, clean code in repositories.

Zhiyuan Zeng, Yichi Zhang, Yong Shan +12

Code Generation & Program Synthesis Data Curation & Synthetic Data Reasoning & Chain-of-Thought

Jennifer D'Souza +73w ago

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took"Use of Practical AI in Digital Libraries"seriously?

A massive, bilingual, authority-grounded dataset could finally make AI-assisted cataloging a reality.

Jennifer D'Souza, Sameer Sadruddin, Maximilian Kahler +5

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval

Nevidu Jayatilleke +53w ago

SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0

A new, large-scale diachronic corpus for Sinhala, SiDiaC-v.2.0, offers a crucial resource for NLP research on this low-resource language, enabling studies of linguistic change and historical text analysis.

Nevidu Jayatilleke, Nisansa de Silva, U.D.C. Nimanthi +3

Data Curation & Synthetic Data Natural Language Processing

3w ago

mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

Multilingual math reasoning just got a serious upgrade: mAceReason-Math offers a meticulously translated and cleaned dataset of challenging problems across 14 languages, purpose-built for RLVR training.

Konstantin Dobler, Simon Lehnerer, Federico Scozzafava +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Nina Hosseini-Kivanani +13w ago

LuxBorrow: From Pompier to Pompjee, Tracing Borrowing in Luxembourgish

Luxembourgish news reveals a surge in code-switching and morphologically adapted borrowings, primarily from French, challenging simple document-level mixing indices.

Nina Hosseini-Kivanani, Fred Philippy

Data Curation & Synthetic Data Natural Language Processing

Sid Wang +23w ago·also Vrije Universiteit Amsterdam

Large Language Models as Annotators for Machine Translation Quality Estimation

Forget expensive LLM inference for MTQE: train a COMET model on GPT-4o-generated annotations and get competitive performance.

Sid Wang, Sophie Arnoult, Amir Kamran

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Hyungjoo Chae +23w ago

Safe and Scalable Web Agent Learning via Recreated Websites

Train web-navigating agents in safe, scalable, and verifiable synthetic environments automatically cloned from real websites, sidestepping the risks and limitations of real-world interaction.

Hyungjoo Chae, Jungsoo Park, Alan Ritter

Data Curation & Synthetic Data Tool Use & Agents

Yushan Bai +53w ago

FAR-Dex: Few-shot Data Augmentation and Adaptive Residual Policy Refinement for Dexterous Manipulation

Overcoming the data scarcity bottleneck in robotic arm-hand coordination, FAR-Dex achieves over 80% real-world success in fine-grained dexterous manipulation tasks.

Yushan Bai, Fulin Chen, Hongzheng Sun +3

Data Curation & Synthetic Data Robotics & Embodied AI

Mar 10, 2026

Bochra Al Agha +13w ago

Benchmarking Dataset for Presence-Only Passive Reconnaissance in Wireless Smart-Grid Communications

Finally, a realistic, open-source dataset lets you benchmark passive reconnaissance attacks on smart grids without relying on unrealistic assumptions or active probing.

Bochra Al Agha, Razane Tajeddine

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Yanshan Li +33w ago

M3GCLR: Multi-View Mini-Max Infinite Skeleton-Data Game Contrastive Learning For Skeleton-Based Action Recognition

State-of-the-art skeleton-based action recognition is now possible through a game-theoretic contrastive learning framework that maximizes action-relevant information while minimizing encoding redundancy.

Yanshan Li, Ke Ma, Miaomiao Wei +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Data Curation & Synthetic Data

Huawen Shen +33w ago

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

Skip the expensive proxy model training: this training-free method boosts VLLM performance by up to 4.8% using only 10-15% of the data, simply by measuring how much the question *changes* the model's view of the answer.

Huawen Shen, Yi Ban, Tianfan Fu +1

Data Curation & Synthetic Data Multimodal Models Training Efficiency & Optimization

3w ago·also Cornell, Soochow, University of Liverpool

GAST: Gradient-aligned Sparse Tuning of Large Language Models with Data-layer Selection

Forget laboriously sifting through layers or datasets for PEFT: GAST co-optimizes both, adaptively picking the most impactful data for each layer based on gradient alignment.

Kai Yao, Zhenghan Song, Kaixin Wu +5

Data Curation & Synthetic Data Natural Language Processing Training Efficiency & Optimization

3w ago

A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition

Modern speech enhancement algorithms may not improve ASR performance in realistic noisy environments, challenging assumptions about their effectiveness in real-world applications.

Dimme de Groot, Yuanyuan Zhang, Jorge Martinez +1

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

ETH3w ago

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Get 6x the RLHF alignment for your LLM with a new active learning pipeline that focuses on annotating the most informative response pairs.

Data Curation & Synthetic Data RLHF & Preference Learning Training Efficiency & Optimization

3w ago·also Oxford, Sydney

CycleULM: A unified label-free deep learning framework for ultrasound localisation microscopy

Achieve real-time super-resolution ultrasound without labeled data using CycleULM, a CycleGAN-based framework that boosts image contrast by 15.3 dB and localization precision by 46%.

Su Yan, Clara Rodrigo Gonzalez, Vincent C. H. Leung +11

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

Alex R. Mattukat +23w ago

Can ChatGPT Generate Realistic Synthetic System Requirement Specifications? Results of a Case Study

Despite ChatGPT's known flaws, it can generate surprisingly realistic synthetic system requirement specifications that fool experts more often than you'd expect.

Alex R. Mattukat, F. Braun, Horst Lichter

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Nguyen Anh Tuong +63w ago

AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

A new large-scale dataset could jumpstart Vietnamese VQA research by providing a crucial resource for training and evaluating multimodal models in a low-resource language.

Nguyen Anh Tuong, Phan Ba Duc, Nguyen Trung Quoc +4

Data Curation & Synthetic Data Multimodal Models Natural Language Processing

3w ago·also CUHK

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

VLMs can now self-evolve from *zero* data, thanks to a multi-agent RL framework that synthesizes its own visual concepts and reasoning tasks.

Zongxia Li, Hongyang Du, Chengsong Huang +10

Data Curation & Synthetic Data Multimodal Models Tool Use & Agents

A. Assadi +23w ago

Well Log-Guided Synthesis of Subsurface Images from Sparse Petrography Data Using cGANs

Bridge the gap between sparse core samples and continuous wellbore data with a cGAN that synthesizes realistic subsurface images conditioned on well log porosity.

A. Assadi, B. Bennett, A. Rabbani

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

3w ago

Optimal partition selection with R\'enyi differential privacy

Rényi differential privacy unlocks tighter privacy guarantees in partition selection, but releasing partition frequencies comes at a cost.

Charlie Harrison, Pasin Manurangsi

Constitutional AI & AI Ethics Data Curation & Synthetic Data Natural Language Processing

Jožef Stefan Institute3w ago

Beyond Fine-Tuning: Robust Food Entity Linking under Ontology Drift with FoodOntoRAG

Forget expensive fine-tuning: FoodOntoRAG links food entities with near SOTA accuracy while adapting to evolving ontologies using a clever RAG architecture with retrieval, selection, scoring, and synonym generation agents.

Jan Drole, Ana Gjorgjevikj, Barbara Korouši'c Seljak +1

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval

Charles University3w ago

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Forget expensive human annotations: LLMs can reliably generate synthetic data to validate NLP evaluation metrics, even outperforming human agreement in some multilingual tasks.

Lukáš Eigler, Jindřich Libovický, David Hurych

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Maike Zufle +93w ago·also Fondazione Bruno Kessler

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Text prompts might be inflating your SLLM's performance: spoken prompts reveal a significant performance gap, especially in low-resource languages.

Maike Zufle, Maike Züfle, Sara Papi +7

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

3w ago

A Unified Hierarchical Multi-Task Multi-Fidelity Framework for Data-Efficient Surrogate Modeling in Manufacturing

Achieve up to 23% better prediction accuracy in manufacturing surrogate modeling by jointly modeling inter-task similarity and data fidelity using a hierarchical Bayesian approach.

Manan Mehta, Zhiqiao Dong, Yuhang Yang +1

Data Curation & Synthetic Data Scientific Discovery & Drug Design Training Efficiency & Optimization

Université catholique de Louvain3w ago·also Universiteit Antwerpen

No evaluation without fair representation : Impact of label and selection bias on the evaluation, performance and mitigation of classification models

Evaluating classification models on biased data can mask true performance and fairness, but this work provides a framework to create unbiased test sets that reveal the real impact of different biases and mitigation strategies.

Magali Legast, Toon Calders, François Fouss

Constitutional AI & AI Ethics Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Linghu Ding +53w ago·also China Academy of Space Technology, Qian Xuesen Laboratory of Space Technology, State Key Laboratory of Space Information System and Integrated Application

Cognitively Layered Data Synthesis for Domain Adaptation of LLMs to Space Situational Awareness

Forget generic fine-tuning data — Bloom's Taxonomy-based data generation can boost LLM performance in complex engineering domains like space situational awareness by up to 176%.

Linghu Ding, Da Fan, Kaifeng Yin +3

Data Curation & Synthetic Data Natural Language Processing

3w ago

Proxy-Guided Measurement Calibration

Correcting systematic errors in aggregate data is now possible by using proxy variables to disentangle true signals from biases via a VAE-based framework.

Saketh Vishnubhatla, Shu Wan, Andre V. Harrison +2

Data Curation & Synthetic Data Natural Language Processing

3w ago·also PKU

You Didn't Have to Say It like That: Subliminal Learning from Faithful Paraphrases

Even when paraphrasing content that explicitly contradicts a teacher's preferences, language models can still subliminally learn those preferences, raising serious concerns about bias propagation in self-training scenarios.

Isaia Gisler, Zhonghao He, Tianyi Qiu

Data Curation & Synthetic Data Natural Language Processing Training Efficiency & Optimization

Yunnan Normal University3w ago·also CAS

PromptDLA: A Domain-aware Prompt Document Layout Analysis Framework with Descriptive Knowledge as a Cue

Domain-specific prompts can significantly boost document layout analysis, achieving state-of-the-art results by explicitly guiding models with dataset-aware cues.

Yaping Zhang, Lu Xiang, Yang Zhao +2

Computer Vision Data Curation & Synthetic Data Natural Language Processing

Maria Kunilovskaya +13w ago

EPIC-EuroParl-UdS: Information-Theoretic Perspectives on Translation and Interpreting

A meticulously curated, bidirectional English-German corpus of parliamentary proceedings now offers researchers a goldmine for dissecting the nuances of translation, interpreting, and language variation through an information-theoretic lens.

Maria Kunilovskaya, Christina Pollkläsener

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Wanchun Li +43w ago

Quantifying and extending the coverage of spatial categorization data sets

LLMs can generate spatial relation labels that align with human judgments, offering a scalable path to richer, multilingual spatial datasets.

Wanchun Li, A. Carstensen, Yang Xu +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

CJM3w ago

The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

A new OCR pipeline slashes error rates on noisy, polytonic Greek texts, opening up a vast historical corpus for NLP research and LLM training.

Chahan Vidal-Gorène, Bastien Kindt

Computer Vision Data Curation & Synthetic Data Natural Language Processing

UNICT (University of Catania)3w ago

ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios

Current methods struggle to understand human behavior in industrial settings, as evidenced by the challenging ENIGMA-360 dataset of synchronized ego-exo videos.

Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto +6

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

AI23w ago·also (Corresponding author: Rui Meng and Xiaodong, TCS Research, Yale

RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation

Stop generating superficial reviews: RbtAct leverages rebuttals to train LLMs to provide actionable feedback, leading to concrete revisions and improved author uptake.

Sihong Wu, Yi Ma, Yiling Ma +6

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Syed Izzat Ullah +13w ago

NanoBench: A Multi-Task Benchmark Dataset for Nano-Quadrotor System Identification, Control, and State Estimation

Finally, a comprehensive dataset unlocks the potential to develop and validate advanced control and estimation algorithms tailored for the unique challenges of nano-quadrotors.

Syed Izzat Ullah, Jose Baca

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Robotics & Embodied AI

Gauthier Miralles +33w ago·also ⋆ Primaa

Unsupervised Domain Adaptation with Target-Only Margin Disparity Discrepancy

Bridging the gap between CT and scarce CBCT data, a novel UDA framework achieves state-of-the-art liver segmentation by reformulating Margin Disparity Discrepancy.

Gauthier Miralles, Loïc Le Folgoc, Vincent Jugnon +1

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

Google Research3w ago·also Oxford

Democratising Clinical AI through Dataset Condensation for Classical Clinical Models

Dataset condensation, previously limited to neural networks, can now democratize access to clinical data by enabling privacy-preserving training of classical models like decision trees and Cox regression.

Anshul Thakur, Soheila Molaei, P. Nganjimi +5

Data Curation & Synthetic Data Training Efficiency & Optimization

Ümit Mert Çağlar +13w ago

Grounding Synthetic Data Generation With Vision and Language Models

Synthetic data, when grounded in vision-language models for evaluation, demonstrably boosts performance in remote sensing tasks like segmentation and captioning, outperforming models trained solely on real-world data.

Ümit Mert Çağlar, Alptekin Temizel

Computer Vision Data Curation & Synthetic Data Multimodal Models

3w ago

Task Aware Modulation Using Representation Learning for Upsaling of Terrestrial Carbon Fluxes

By integrating physical constraints with adaptive representation learning, TAM-RL substantially enhances the accuracy of global carbon flux estimates, outperforming existing methods.

Aleksei Rozanov, Arvind Renganathan, Vipin Kumar

Data Curation & Synthetic Data Scientific Discovery & Drug Design

3w ago·also Meta AI

Wrong Code, Right Structure: Learning Netlist Representations from Imperfect LLM-Generated RTL

Imperfect code from LLMs can still teach AI to understand circuit structure, unlocking a scalable path to netlist representation learning without expensive, clean datasets.

Siyang Cai, Cangyuan Li, Yinhe Han +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Data Curation & Synthetic Data

CMU ML3w ago·also SNU

Quality over Quantity: Demonstration Curation via Influence Functions for Data-Centric Robot Learning

Forget manual labeling: influence functions can automatically surface high-quality robot demonstrations, boosting policy performance by intelligently curating training data.

Haeone Lee, Taywon Min, Junsu Kim +4

Data Curation & Synthetic Data Robotics & Embodied AI

Mar 9, 2026

3w ago·also CSIRO, HUST, School of Engineering

Client-Cooperative Split Learning

Achieve near-perfect privacy against clustering and inversion attacks in split learning without sacrificing model accuracy by using differential privacy and secret label obfuscation.

Haiyu Deng, Yanna Jiang, Guangsheng Yu +5

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Leo Fillioux +73w ago

Information Maximization for Long-Tailed Semi-Supervised Domain Generalization

State-of-the-art semi-supervised domain generalization (SSDG) methods crumble when faced with the real-world challenge of long-tailed class distributions, but IMaX offers a simple, effective fix.

Leo Fillioux, Omprakash Chakraborty, Quentin Gopée +5

Data Curation & Synthetic Data Training Efficiency & Optimization

Xuesong Wang +13w ago

Synthetic Defect Image Generation for Power Line Insulator Inspection Using Multimodal Large Language Models

MLLMs can generate surprisingly effective synthetic training data for defect classification, boosting performance by 20% even with very limited real data.

Xuesong Wang, Caisheng Wang

Computer Vision Data Curation & Synthetic Data Multimodal Models

Urawee Thani +23w ago

Unsupervised Domain Adaptation for Audio Deepfake Detection with Modular Statistical Transformations

A modular statistical transformation pipeline boosts audio deepfake detection accuracy by 10.7% in cross-domain scenarios, without needing labeled target data.

Urawee Thani, Gagandeep Singh, Priyanka Singh

Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness Speech & Audio

MIT CSAIL3w ago

Training Language Models via Neural Cellular Automata

Forget more data: pre-training on just 164M tokens of synthetic data from Neural Cellular Automata can outperform pre-training on 1.6B tokens of natural language for downstream LLM tasks.

Daniel Lee, Seungwook Han, Akarsh Kumar +1

Architecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data Natural Language Processing+1

Rania Al-Sabbagh3w ago

Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS

Emirati Arabic finally gets a dedicated, sociolinguistically rich speech corpus, opening doors for better ASR/TTS in this low-resource language.

Rania Al-Sabbagh

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Sidharth Sinha +53w ago

AutoAdapt: An Automated Domain Adaptation Framework for LLMs

Stop wasting time on manual LLM domain adaptation: AutoAdapt automates the process and boosts accuracy by 25% over existing AutoML methods.

Sidharth Sinha, Anson Bastos, Xuchao Zhang +3

Data Curation & Synthetic Data Natural Language Processing Training Efficiency & Optimization

Hunor Laczkó +63w ago

MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data

Finally, a dataset that tackles the virtual try-on problem head-on with paired, multi-view fashion data, realistic garment dynamics, and rich annotations.

Hunor Laczkó, Libang Jia, Loc-Phat Truong +4

Computer Vision Data Curation & Synthetic Data Multimodal Models

3w ago

DeZent: Decentralized z-Anonymity with Privacy-Preserving Coordination

Decentralized z-anonymity is now practical: deZent achieves comparable performance to centralized approaches while minimizing reliance on a trusted central entity.

C. Brunn, Florian Tschorsch

Architecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data Distributed Systems & Hardware

Adam Hung +33w ago

3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos

Robots can now learn manipulation skills from ordinary human videos, thanks to a 3D point tracking method that bridges the embodiment gap and requires only 20 robot demonstrations.

Adam Hung, B. Duisterhof, Bardienus Pieter Duisterhof +1

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

3w ago·also MIT CSAIL, Max Planck, Tuebingen AI Center/University of Tuebingen, University of Siegen

MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

By dynamically adjusting contrastive learning temperatures based on data density, MM-TS achieves state-of-the-art results on multimodal long-tail datasets.

Siarhei Sheludzko, Dhimitrios Duka, Bernt Schiele +2

Data Curation & Synthetic Data Multimodal Models Training Efficiency & Optimization

Daniel M. Jimenez-Gutierrez +53w ago

FedLECC: Cluster- and Loss-Guided Client Selection for Federated Learning under Non-IID Data

FedLECC slashes communication overhead in federated learning by 50% while boosting accuracy by 12%, all by cleverly picking clients based on data similarity and loss.

Daniel M. Jimenez-Gutierrez, Giovanni Giunta, Mehrdad Hassanzadeh +3

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Tsinghua AI3w ago·also Artificial Intelligence Institute of China, Beihang, Beijing Information Science and Technology, BUPT +4

MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals

MLLMs can now reliably interpret electromagnetic signals even in noisy environments, thanks to a new training framework and benchmark designed specifically for this challenging domain.

Junyu Shen, Zhendong She, Chenghanyu Zhang +11

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Pol Buitrago +43w ago

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

Unlock AV speech recognition for any language, even with zero labeled video data, by training on synthetically generated talking-head videos.

Pol Buitrago, Pol Gàlvez, Pol Galvez +2

Data Curation & Synthetic Data Multimodal Models Speech & Audio

3w ago

A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

LLMs often prefer awkward, literal translations over natural-sounding alternatives, even when the original source text is removed.

Jenny Kunz, Anja Jarochenko, Marcel Bollmann

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

UNC Greensboro3w ago·also UVA

HeteroFedSyn: Differentially Private Tabular Data Synthesis for Heterogeneous Federated Settings

Federated differentially private data synthesis can now achieve utility comparable to centralized approaches, even with heterogeneous data distributions, thanks to a novel framework that smartly handles noise and redundancy.

Xiaochen Li, Fengyu Gao, Xizixiang Wei +3

Constitutional AI & AI Ethics Data Curation & Synthetic Data Distributed Systems & Hardware

Daniil Karzanov +13w ago

Geometrically Constrained Outlier Synthesis

By synthesizing outliers that respect the learned manifold structure, GCOS enables deep networks to more robustly distinguish between in- and out-of-distribution samples, leading to state-of-the-art performance on near-OOD detection.

Daniil Karzanov, Marcin Detyniecki

Computer Vision Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Xin-Cheng Wen +53w ago

SWE-Fuse: Empowering Software Agents via Issue-free Trajectory Learning and Entropy-aware RLVR Training

Noisy issue descriptions holding back your software agent? SWE-Fuse unlocks 60% higher solve rates by fusing issue-guided and issue-free training trajectories.

Xin-Cheng Wen, Binbin Chen, Haoxuan Lan +3

Code Generation & Program Synthesis Data Curation & Synthetic Data Tool Use & Agents

Fengyang Xiao +73w ago·also Duke, EPFL

QualiTeacher: Quality-Conditioned Pseudo-Labeling for Real-World Image Restoration

Instead of discarding noisy pseudo-labels in image restoration, QualiTeacher leverages them by teaching the model to understand and even surpass the quality levels they represent.

Fengyang Xiao, Jingjia Feng, Dingming Zhang +5

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

National Tutoring Observatory3w ago·also CMU ML, MIT CSAIL, Cornell, FreshCognate

Sandpiper: Orchestrated AI-Annotation for Educational Discourse at Scale

Scale qualitative analysis of educational discourse data without sacrificing rigor using a mixed-initiative system that orchestrates LLMs and human expertise.

Daryl Hedley, Doug Pietrzak, D. Pietrzak +14

Data Curation & Synthetic Data Natural Language Processing

Gustavo A. Dorrego3w ago

Beyond the Markovian Assumption: Robust Optimization via Fractional Weyl Integrals in Imbalanced Data

Achieve 40% better fraud detection by ditching standard gradient descent for a fractional calculus optimizer that remembers the past.

Gustavo A. Dorrego

Data Curation & Synthetic Data Training Efficiency & Optimization

3w ago·also Northwestern

SI-ChainFL: Shapley-Incentivized Secure Federated Learning for High-Speed Rail Data Sharing

A Shapley-incentivized blockchain boosts federated learning accuracy by 14% and thwarts 90% of malicious attacks in high-speed rail data sharing.

Mingjie Zhao, Cheng Dai, Fei Chen +5

Data Curation & Synthetic Data Distributed Systems & Hardware

3w ago

The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques

Reported successes in reconstructing PII from sanitized documents may be overstated due to data leakage, leaving the true vulnerability of PII removal techniques uncertain.

Sebastian Ochs, Ivan Habernal

Data Curation & Synthetic Data Natural Language Processing Red-Teaming & Adversarial Robustness

Yiping Xie +83w ago

SecAgent: Efficient Mobile GUI Agent with Semantic Context

A 3B model can match the performance of models more than twice its size in mobile GUI automation by distilling visual history into concise natural language summaries.

Yiping Xie, Jingxuan Xing, Wei Jiang +6

Data Curation & Synthetic Data Multimodal Models Tool Use & Agents

Manuscript received xxx xx3w ago

Global Cross-Modal Geo-Localization: A Million-Scale Dataset and a Physical Consistency Learning Framework

A million-scale dataset of globally diverse, cross-modal geo-location pairs, coupled with a novel physical-law-aware network, leapfrogs existing CMGL benchmarks and opens the door to truly universal positioning systems.

Yutong Hu, Jinhui Chen, Chaoqiang Xu +6

Computer Vision Data Curation & Synthetic Data Multimodal Models

Zekun Li +13w ago

Diffusion-Based Data Augmentation for Image Recognition: A Systematic Analysis and Evaluation

A unified framework and comprehensive evaluation reveals the surprisingly nuanced performance of diffusion-based data augmentation, showing where it shines and where it falls short in low-data image classification.

Zekun Li, Yinghuan Shi

Computer Vision Data Curation & Synthetic Data

Matěj Boxan +103w ago

FoMo: A Multi-Season Dataset for Robot Navigation in For\^et Montmorency

State-of-the-art SLAM algorithms can fail to re-localize in changing seasons, as highlighted by a new multi-modal, year-long boreal forest dataset.

Matěj Boxan, Matvej Boxan, Gabriel Jeanson +8

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

Cong Tai +143w ago

Seed2Scale: A Self-Evolving Data Engine for Embodied AI via Small to Large Model Synergy and Multimodal Evaluation

Forget expensive data collection: Seed2Scale leverages a small-model/large-model synergy to self-generate high-quality embodied AI training data, starting from just four seed demonstrations.

Cong Tai, Zhaoyu Zheng, Haixu Long +12

Data Curation & Synthetic Data Multimodal Models Robotics & Embodied AI

Ibrahim Baroud +63w ago

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

A new multilingual benchmark dataset with over 2,500 annotations of personal information enables privacy-preserving machine learning across ten languages, sidestepping the need for sensitive patient data.

Ibrahim Baroud, C. Otto, Vera Czehmann +4

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

University of North Texas3w ago

LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs

LLMs can bootstrap high-quality legal argument mining datasets at scale, but only with careful human-in-the-loop refinement to correct ~20% of initial errors.

Serene Wang, S. Wang, Lavanya Pobbathi +1

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Amazon Science3w ago·also TU Berlin, UvA

ERASE -- A Real-World Aligned Benchmark for Unlearning in Recommender Systems

Current machine unlearning methods for recommender systems struggle with robustness and sequential deletions, especially in attention-based and recurrent models, highlighting a critical gap ERASE helps to expose.

Pierre Lubitzsch, Maarten de Rijke, M. D. Rijke +1

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Recommendation & Information Retrieval

3w ago

HMR-1: Hierarchical Massage Robot with Vision-Language-Model for Embodied Healthcare

Forget generic robot demos – this work introduces a complete pipeline and dataset for AI-powered massage robots that can understand language and identify acupoints.

Mingming Yu, Xiaofeng Han, Kaiyi Hu +5

Data Curation & Synthetic Data Multimodal Models Robotics & Embodied AI

Tilde3w ago

TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

A new 30B open-weight LLM trained on 34 European languages achieves state-of-the-art performance on low-resource languages with significantly less compute, proving that clever training beats brute force.

Toms Bergmanis, Martins Kronis, Ingus Jānis Pretkalniņš +5

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Mar 8, 2026

Microsoft Research3w ago·also Cambridge, Qinzheng Sun1

Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems

Forget massive datasets – targeted training on a smaller, carefully curated dataset of challenging competitive programming problems yields 3x faster gains in code generation performance.

Zongqian Li, Tengchao Lv, Shaohan Huang +7

Code Generation & Program Synthesis Data Curation & Synthetic Data RLHF & Preference Learning+1

3w ago·also UJM-Saint-Étienne, Université de Lyon

Brexit Means Brexit: Selection Bias, Echo Chambers, and Entrenched Opinion on Reddit

Reddit's political echo chambers aren't just a vibe, they're a quantifiable force field that hardens opinions through self-selection, not softened by exposure to opposing views.

Marian-Andrei Rizoiu, Duy Khuu, Andrew Law +1

Data Curation & Synthetic Data Natural Language Processing

Abdessalam Bouchekif +73w ago

MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

LLMs still struggle with complex legal reasoning, as evidenced by their difficulty in solving Islamic inheritance cases, even with a new dataset designed to support step-by-step reasoning.

Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani +5

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Robin Doerfler +13w ago

Analysis-Driven Procedural Generation of an Engine Sound Dataset with Embedded Control Annotations

Forget expensive, noisy recordings: this procedural engine sound dataset offers 19 hours of clean, annotated audio for training better automotive sound AI.

Robin Doerfler, Lonce Wyse

Data Curation & Synthetic Data Speech & Audio

East China University of Science and Technology3w ago·also School of Information Science and Engineering

SeDa: A Unified System for Dataset Discovery and Multi-Entity Augmented Semantic Exploration

Tired of fragmented datasets? SeDa unifies 7.6M+ datasets from 200+ platforms with semantic annotation and provenance tracking, making cross-domain data discovery a breeze.

Kan Ling, Zhen Qin, Hengrun Zhang

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval

Rishikesh Kumar Sharma +63w ago

Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR

Forget massive multilingual models: fine-tuning on just 5 hours of speech data from a related language slashes ASR error rates for an endangered language, rivaling the performance of Whisper-Small.

Rishikesh Kumar Sharma, Safal Narshing Shrestha, Jenny Poudel +4

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

3w ago

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

Forget hand-annotated 3D datasets: a new automated pipeline generates massive, high-quality 3D spatial intelligence from raw video, unlocking better VLM reasoning.

Yuanyuan Gao, Xinhao Ji, Yuning Gong +5

Computer Vision Data Curation & Synthetic Data Multimodal Models

Mar 7, 2026

Chuxue Cao +73w ago

Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training

Forget scaling laws, targeted data engineering—specifically multi-stage distillation and difficulty-aware sampling—allows an 8B model to outperform larger open-source financial LLMs.

Chuxue Cao, Honglin Lin, Zhanping Zhong +5

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Inference & Quantization

Trong-Thang Pham +43w ago

MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering

Forget re-prompting or inversion: MedSteer lets you surgically edit endoscopic images by steering diffusion model activations, creating perfectly matched counterfactuals with 95% concept flip rates.

Trong-Thang Pham, Loc Nguyen, Anh Nguyen +2

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

Mar 5, 2026

3w ago

Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset

Achieve sharper, more accurate infrared super-resolution in real-world conditions by disentangling thermal and structural degradations with a novel autoregressive framework.

Yang Zou, Zhidong Jiao, Xingyuan Li +2

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Pradyumna Tambwekar +83w ago

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

Open-set corrective assistance, requiring models to inspect lengthy user behavior and provide corrective actions or language-based feedback, remains a significant challenge even with fine-tuning on diverse interactive data.

Pradyumna Tambwekar, Andrew Silva, D. Gopinath +6

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Robotics & Embodied AI

Amirabbas Afzali +43w ago

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Weak LLMs, when strategically leveraged via confidence-based sample weighting, can not only drastically cut preference alignment costs but also surpass the performance of models trained on full human-labeled datasets.

Amirabbas Afzali, Myeong-Hwan Jeon, Myeongho Jeon +2

Constitutional AI & AI Ethics Data Curation & Synthetic Data RLHF & Preference Learning

Stanford HAI3w ago

Replaying pre-training data improves fine-tuning

Replaying generic pre-training data during fine-tuning boosts target task performance by up to 2x, challenging the common practice of minimizing its use.

Suhas Kotha, Percy Liang

Data Curation & Synthetic Data Natural Language Processing Training Efficiency & Optimization

Nankai University3w ago·also Duke, NKIARI

Training Dynamics-Aware Multi-Factor Curriculum Learning for Target Speaker Extraction

Forget hand-crafted curricula: TSE-Datamap leverages training dynamics to automatically surface optimal learning schedules for target speaker extraction.

Yun Liu, Xuechen Liu, Xiaoxiao Miao +1

Data Curation & Synthetic Data Speech & Audio Training Efficiency & Optimization

3w ago

ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking

Omnidirectional imagery + language unlocks robust multi-object tracking that overcomes the field-of-view limitations plaguing conventional video datasets.

Sijia Chen, Zihan Zhou, Yanqiu Yu +2

Computer Vision Data Curation & Synthetic Data Multimodal Models

3w ago

BabAR: from phoneme recognition to developmental measures of young children's speech production

A new cross-linguistic phoneme recognition system, BabAR, finally unlocks scalable analysis of early childhood speech development.

Marvin Lavechin, Elika Bergelson, Roger P. Levy

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Xueyao Wang +43w ago·also D cubic B-spline basis. Further

Early Warning of Intraoperative Adverse Events via Transformer-Driven Multi-Label Learning

A new Transformer architecture, IAENet, predicts multiple interdependent surgical complications more accurately by explicitly modeling event co-occurrence and handling data heterogeneity.

Xueyao Wang, Xiuding Cai, Honglin Shang +2

Architecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data Natural Language Processing

Yuzhe Zhou +33w ago

Rethinking Representativeness and Diversity in Dynamic Data Selection

Forget local geometry – this dynamic data selection method uses a sparse autoencoder to prioritize samples covering frequent feature factors, leading to 2x training acceleration.

Yuzhe Zhou, Zhenglin Hua, Haiyun Guo +1

Data Curation & Synthetic Data Training Efficiency & Optimization

3w ago

DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces

Unlock scalable CAD generation from unannotated 3D meshes with DreamCAD, a framework that directly produces editable BREPs from point-level supervision, outperforming existing methods and achieving over 75% user preference.

Mohammad Sadil Khan, Muhammad Usama, Rolandos Alexandros Potamias +4

Computer Vision Data Curation & Synthetic Data Multimodal Models

3w ago·also B and, B LLM consistently underperforming, NSFC, Shanghai Engineering Research Center of Intelligent

Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Curriculum reinforcement learning closes the distributional gap between pre-trained MLLMs and KB-VQA, yielding SOTA results by strategically generating and sampling training data.

Shan Ning, Longtian Qiu, Xuming He

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Minduli Lasandi +13w ago

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

A new, meticulously cleaned corpus of Sinhala legal texts opens the door for NLP research in an under-resourced language.

Minduli Lasandi, N. Jayatilleke

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Feng Liu +43w ago

Evaluating and Correcting Human Annotation Bias in Dynamic Micro-Expression Recognition

Human annotation errors in cross-cultural micro-expression datasets can be significantly reduced by dynamically re-selecting keyframes, leading to more accurate recognition.

Feng Liu, Bingyu Nan, Xue Qian +2

Computer Vision Constitutional AI & AI Ethics Data Curation & Synthetic Data

Search

Data Curation & Synthetic Data - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)