Data Curation & Synthetic Data

Infrastructure

Training data quality, synthetic data generation, data filtering, deduplication, and dataset construction.

Keywords

data curationsynthetic datadata qualitydata filteringdeduplicationdataset constructiondata augmentationpretraining data

Recent Papers

Feb 12, 2026

Earth and Environmental2d ago

CAAL: Confidence-Aware Active Learning for Heteroscedastic Atmospheric Regression

The paper addresses the problem of efficiently estimating atmospheric particle properties from routine observations in a heteroscedastic regression setting, where noise varies with input. They introduce Confidence-Aware Active Learning (CAAL), which decouples the optimization of predictive mean and noise level during training and uses a confidence-aware acquisition function to weight epistemic uncertainty by predicted aleatoric uncertainty. Experiments on simulations and real data demonstrate that CAAL outperforms standard active learning baselines in expanding atmospheric particle property databases.

Introduces a confidence-aware active learning framework (CAAL) that dynamically weights epistemic uncertainty with predicted aleatoric uncertainty to improve sample selection in heteroscedastic regression problems.

Fei Jiang, Mingfei Sun, Hugh Coe +32602.11825

Scientific Discovery & Drug DesignData Curation & Synthetic Data

2d ago

Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation

The paper introduces a pedagogically-inspired knowledge distillation framework (IOA) for transferring knowledge from large language models (LLMs) to smaller student models. The framework incorporates Bloom's Mastery Learning Principles and Vygotsky's Zone of Proximal Development to dynamically identify knowledge deficiencies, organize knowledge delivery through progressive curricula, and adapt representations. Experiments using LLaMA and Qwen models demonstrate that IOA significantly outperforms baseline distillation methods, achieving higher performance on DollyEval, MATH, and HumanEval benchmarks while using significantly fewer parameters.

Introduces a novel three-stage knowledge distillation framework (IOA) that incorporates pedagogical principles to systematically improve student model performance by identifying knowledge gaps, organizing knowledge delivery, and adapting representations.

Yankai Chen, Xiaokun Zhang2602.12172

Training Efficiency & OptimizationInference & QuantizationData Curation & Synthetic Data

2d ago

DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling

The paper introduces DHPLT, a large-scale multilingual diachronic corpus comprising web-crawled data from 41 languages across three time periods (2011-2015, 2020-2021, 2024-present). The authors leverage web crawl timestamps as a proxy for document creation time, providing 1 million documents per time period per language. They also provide pre-computed word embeddings and lexical substitutions to facilitate semantic change modeling research, addressing the scarcity of such resources for many languages.

Introduces DHPLT, a novel multilingual diachronic corpus with pre-computed embeddings and lexical substitutions, designed to facilitate research in semantic change modeling across 41 languages.

Mariia Fedorova, Andrey Kutuzov, Khonzoda Umarova2602.11968

Data Curation & Synthetic DataNatural Language ProcessingOpen-Source Models & Weights

2d ago

A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

This paper introduces a subword embedding approach to detect lexical and orthographic variation in user-generated text, specifically addressing the challenges of "noisy" and low-resource settings without relying on normalization or predefined variant lists. The method trains subword embeddings on raw Luxembourgish user comments and clusters related forms using a combination of cosine similarity and n-gram similarity. The results demonstrate the effectiveness of distributional modeling in uncovering meaningful patterns of variation, aligning with existing dialectal and sociolinguistic research.

Introduces a novel subword embedding method that automatically discovers and clusters lexical variations in user-generated text, even in low-resource languages, without requiring prior normalization or predefined variant lists.

Anne-Marie Lutgen, Alistair Plum, Christoph Purschke2602.11795

Natural Language ProcessingData Curation & Synthetic DataRecommendation & Information Retrieval

2d ago

It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

The paper introduces TIME, a new time series forecasting benchmark designed to address limitations in existing benchmarks related to data composition, integrity, task formulation, and analysis perspectives. TIME comprises 50 fresh datasets and 98 forecasting tasks constructed using a human-in-the-loop pipeline to ensure data integrity and real-world alignment. The benchmark also introduces a pattern-level evaluation perspective based on structural time series features to provide generalizable insights into model capabilities, and the authors evaluate 12 TSFMs on TIME.

Introduces TIME, a novel task-centric time series forecasting benchmark with enhanced data integrity, real-world task formulations, and pattern-level evaluation.

Zhongzheng Qiao, Viktoriya Zhukova, Qingsong Wen +22602.12147

Eval Frameworks & BenchmarksData Curation & Synthetic Data

2d ago

Data-Driven Trajectory Imputation for Vessel Mobility Analysis

The paper introduces HABIT, a data-driven framework for imputing missing segments in vessel trajectories using historical Automatic Identification System (AIS) data. HABIT leverages H3 geospatial indexing to aggregate and analyze vessel motion patterns, enabling the imputation of missing trajectory segments based on learned historical behaviors. Empirical evaluation demonstrates that HABIT achieves comparable accuracy to existing methods while offering improved latency and better accounting for vessel characteristics.

Introduces HABIT, a novel H3 Aggregation-Based Imputation framework, to impute missing vessel trajectories by learning and leveraging historical vessel motion patterns.

G. Spiliopoulos, Alexandros Troupiotis-Kapeliaris, Kostas Patroumpas +42602.11890

Data Curation & Synthetic DataNatural Language Processing

2d ago

How to Sample High Quality 3D Fractals for Action Recognition Pre-Training?

The paper investigates the use of 3D fractals generated via Iterated Function Systems (IFS) as a synthetic pre-training dataset for action recognition models. It identifies limitations in standard fractal generation methods, including slow speed and degenerate fractal structures, and finds that overly restrictive filtering hurts downstream performance. The authors introduce Targeted Smart Filtering, a novel method that significantly accelerates fractal generation (100x speedup) while maintaining fractal diversity, leading to improved action recognition performance after pre-training.

Introduces Targeted Smart Filtering, a novel method for generating high-quality 3D fractals for action recognition pre-training that balances generation speed and fractal diversity.

Marko Putak, T. Moeslund, J. B. Haurum2602.11810

Data Curation & Synthetic DataComputer Vision

American University of Armenia2d ago

PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation

The paper introduces PLESS, a pseudo-label enhancement strategy for weakly supervised segmentation using scribble annotations, addressing the limitations of noisy and incomplete supervision. PLESS leverages a hierarchical partitioning of the image into spatially coherent regions to propagate scribble information and refine pseudo-labels within these regions. Experiments on cardiac MRI datasets demonstrate that PLESS consistently improves segmentation accuracy across different scribble-supervised algorithms.

Introduces a novel pseudo-label enhancement strategy, PLESS, that leverages hierarchical image partitioning to improve the reliability and spatial consistency of pseudo-labels in weakly supervised segmentation.

Yeva Gabrielyan, Varduhi Yeghiazaryan, Irina Voiculescu Akian College of Science +82602.11628

Computer VisionTraining Efficiency & OptimizationData Curation & Synthetic Data

2d ago

Affordance-Graphed Task Worlds: Self-Evolving Task Generation for Scalable Embodied Learning

The paper introduces Affordance-Graphed Task Worlds (AGT-World), a framework that automatically generates interactive simulated environments and robot task policies from real-world observations by formalizing the task space as a structured graph. This graph-based approach allows for hierarchical decomposition of complex goals into atomic primitives, addressing the limitations of random proposal or static replication methods. The authors further incorporate a self-evolution mechanism with hybrid feedback, combining Vision-Language Model reasoning and geometric verification, to refine policies.

Introduces a self-evolving framework for generating simulated task environments and robot policies by structuring the task space as an affordance graph and using hybrid feedback for policy refinement.

Xiang Liu, Guocai Yao2602.12065

World Models & PlanningRobotics & Embodied AIData Curation & Synthetic Data

2d ago

Fourier Transformers for Latent Crystallographic Diffusion and Generative Modeling

This paper introduces a reciprocal-space generative pipeline for crystalline materials, representing crystals via a truncated Fourier transform of the species-resolved unit-cell density. This Fourier representation inherently handles periodic boundary conditions and crystallographic symmetries, while also supporting variable atomic multiplicities. The pipeline is instantiated using a transformer variational autoencoder and a latent diffusion model, demonstrating effective reconstruction and unconditional generation of crystal structures.

Introduces a novel reciprocal-space generative pipeline using Fourier transforms to represent and generate crystalline materials, inherently addressing periodicity, symmetry, and variable atomic multiplicities.

J. Duersch, Elohan Veillon, Astrid Klipfel +22602.12045

Scientific Discovery & Drug DesignArchitecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data

2d ago

CitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting Minutes

The paper introduces CitiLink-Minutes, a novel multilayer dataset of 120 European Portuguese municipal meeting minutes from six municipalities, designed to address the lack of annotated datasets for NLP and IR research in this domain. The dataset features over one million tokens with de-identified personal information and includes manual annotations across metadata, subjects of discussion, and voting outcomes. Experiments demonstrate the dataset's utility for downstream tasks like metadata extraction, topic classification, and vote labeling, facilitating transparent access to municipal decisions.

Contributes CitiLink-Minutes, a unique multilayer annotated dataset of municipal meeting minutes, enabling NLP and IR research on local governance.

Ricardo Campos, Ana Filipa Pacheco, Ana Lu´ısa Fernandes +112602.12137

Data Curation & Synthetic DataNatural Language ProcessingRecommendation & Information Retrieval

2d ago

BlackCATT: Black-box Collusion Aware Traitor Tracing in Federated Learning

The paper introduces BlackCATT, a novel black-box traitor tracing method for federated learning that is resilient to collusion attacks. BlackCATT employs a collusion-aware embedding loss and iteratively optimizes trigger sets for watermark embedding, improving convergence and tracing performance. The authors also propose BlackCATT+FR, which incorporates functional regularization at the aggregator to address update incompatibility issues in models with batch normalization, maintaining tracing performance.

Introduces a collusion-resistant black-box traitor tracing method (BlackCATT) for federated learning that uses a novel collusion-aware embedding loss and iteratively optimized triggers.

Elena Rodr'iguez-Lois, Fabio Brau, Maura Pintor +22602.12138

Constitutional AI & AI EthicsRed-Teaming & Adversarial RobustnessData Curation & Synthetic Data

2d ago

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

The paper introduces Composition-RL, a method to improve reinforcement learning of LLMs by composing multiple verifiable prompts into a single, more complex prompt, addressing the issue of diminishing returns from easy (pass-rate-1) prompts as training progresses. This approach aims to better utilize limited verifiable prompts by creating new training examples that maintain a high pass rate while increasing complexity. Experiments on models ranging from 4B to 30B parameters demonstrate that Composition-RL enhances reasoning capabilities and enables more effective cross-domain RL when combined with a curriculum learning strategy that gradually increases compositional depth.

Introduces Composition-RL, a novel method that composes multiple verifiable prompts to create more complex training examples for reinforcement learning of LLMs, thereby improving reasoning capabilities and cross-domain generalization.

Clive Bai, Weijie Liu, Yang Wang +22602.12036

RLHF & Preference LearningTraining Efficiency & OptimizationData Curation & Synthetic Data

2d ago

Semantically Conditioned Diffusion Models for Cerebral DSA Synthesis

This paper introduces a semantically conditioned latent diffusion model (LDM) for synthesizing arterial-phase cerebral digital subtraction angiography (DSA) images, addressing the scarcity of DSA data due to its invasive nature. The LDM is conditioned on text embeddings representing anatomical circulation (anterior/posterior) and C-arm positions, enabling explicit control over the synthesis process. Evaluation by medical experts showed high clinical realism with Likert scores of 3.1-3.3 and a low Fréchet inception distance (FID) of 15.27, demonstrating the potential for generating realistic synthetic DSAs for research and training.

Demonstrates semantically controlled synthesis of realistic cerebral DSA images using a latent diffusion model conditioned on anatomical and geometric parameters.

Qiwen Xu, David Rugamer, H. Wenz +42602.11703

Computer VisionData Curation & Synthetic DataScientific Discovery & Drug Design

2d ago

Vascular anatomy-aware self-supervised pre-training for X-ray angiogram analysis

The paper introduces VasoMIM, a vascular anatomy-aware masked image modeling framework for self-supervised learning on X-ray angiograms, addressing the scarcity of annotated data in this domain. VasoMIM uses an anatomy-guided masking strategy and an anatomical consistency loss to improve the learning of vascular semantics and structural consistency. The framework is pre-trained on XA-170K, a newly curated large-scale X-ray angiogram dataset, and achieves state-of-the-art performance on four downstream tasks across six datasets, demonstrating its transferability.

Introduces VasoMIM, a novel self-supervised learning framework incorporating anatomy-guided masking and anatomical consistency loss, specifically designed for X-ray angiogram analysis.

De-Xing Huang, Xiaohu Zhou, Tian-Yu Xiang +72602.11536

Data Curation & Synthetic DataComputer VisionScientific Discovery & Drug Design

Amap2d ago

IntTravel: A Real-World Dataset and Generative Framework for Integrated Multi-Task Travel Recommendation

The authors introduce IntTravel, a large-scale dataset with 4.1 billion interactions for integrated travel recommendation, addressing the limitations of existing datasets that focus solely on next POI recommendation. To leverage this dataset, they propose a decoder-only generative framework that balances task collaboration and differentiation through information preservation, selection, and factorization. Experiments demonstrate state-of-the-art performance on IntTravel and another benchmark dataset, with a successful deployment on Amap resulting in a 1.09% CTR increase.

Introduces a large-scale dataset, IntTravel, and a novel generative framework for integrated multi-task travel recommendation, demonstrating improved performance and real-world impact.

Longfei Xu, Zheng Liu, Xiangxiang Chu2602.11664

Data Curation & Synthetic DataRecommendation & Information Retrieval

2d ago

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

The paper introduces LDA-1B, a robot foundation model that scales to 1B parameters by learning dynamics, policy, and visual forecasting from a new 30k-hour embodied interaction dataset (EI-30k) comprising diverse human and robot trajectories. LDA-1B leverages a structured DINO latent space for dynamics prediction to avoid pixel-space modeling and employs a multi-modal diffusion transformer to handle asynchronous vision and action streams. Experimental results demonstrate that LDA-1B outperforms existing methods on contact-rich, dexterous, and long-horizon tasks, while also enabling data-efficient fine-tuning by effectively utilizing low-quality trajectories.

Introduces a scalable robot foundation model, LDA-1B, capable of learning from diverse embodied data by predicting in a structured latent space and employing a multi-modal diffusion transformer.

Jiangran Lyu, Xuheng Zhang, Yusen Feng +92602.12215

Robotics & Embodied AIWorld Models & PlanningData Curation & Synthetic Data

2d ago

VIRENA: Virtual Arena for Research, Education, and Democratic Innovation

The paper introduces VIRENA, a virtual platform designed for controlled experimentation within realistic social media environments, addressing limitations in data access and ethical constraints in studying online dynamics. VIRENA allows researchers to simulate feed-based platforms and messaging apps, enabling interactions between human participants and LLM-powered AI agents with configurable personas. The platform's no-code interface facilitates manipulation of content moderation, scheduling of stimuli, and execution of experiments, making it accessible for studying human-AI interaction, moderation interventions, and group deliberation.

Introduces VIRENA, a novel virtual platform enabling controlled social media experiments with human and AI participants, featuring a no-code interface and realistic platform simulations.

Emma Hoes, K. J. Klueser, Fabrizio Gilardi2602.12207

Data Curation & Synthetic DataOpen-Source Models & WeightsConstitutional AI & AI Ethics

2d ago

Towards Personalized Bangla Book Recommendation: A Large-Scale Multi-Entity Book Graph Dataset

The authors introduce RokomariBG, a large-scale, multi-entity heterogeneous book graph dataset for personalized Bangla book recommendation, addressing the lack of resources in this low-resource language setting. They construct a knowledge graph comprising books, users, authors, categories, publishers, and reviews connected through eight relation types. Through benchmarking experiments on Top-N recommendation using collaborative filtering, matrix factorization, content-based methods, graph neural networks, and neural retrieval models, they demonstrate the dataset's utility and the importance of leveraging multi-relational structure and textual side information, achieving an NDCG@10 of 0.204 with neural retrieval models.

Introduces RokomariBG, a novel large-scale, multi-entity heterogeneous graph dataset for Bangla book recommendation, complete with benchmarking experiments.

Rahin Arefin Ahmed, Sakil Ahmed Sheikh Reza, Devnil Bhattacharjee +22602.12129

Data Curation & Synthetic DataRecommendation & Information RetrievalNatural Language Processing

2d ago

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

The paper introduces DICE, a diffusion large language model (dLLM) specifically designed for CUDA kernel generation, addressing the limitations of autoregressive models and the scarcity of training data. They construct CuKe, a supervised fine-tuning dataset optimized for CUDA kernels, and propose a bi-phase curated reinforcement learning (BiC-RL) framework for training. Experiments on KernelBench show that DICE models (1.7B, 4B, and 8B parameters) outperform existing autoregressive and diffusion LLMs, achieving state-of-the-art results in CUDA kernel generation.

Introduces DICE, a novel diffusion-based LLM architecture and training methodology, that significantly improves CUDA kernel generation performance compared to existing autoregressive and diffusion models.

Haolei Bai, Jianmian Wang, Zhiqiang Tao2602.11715

Code Generation & Program SynthesisArchitecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data

2d ago

UPDA: Unsupervised Progressive Domain Adaptation for No-Reference Point Cloud Quality Assessment

This paper introduces UPDA, an unsupervised progressive domain adaptation framework for no-reference point cloud quality assessment (NR-PCQA) to address performance degradation caused by distribution shifts between training and testing data. UPDA employs a two-stage coarse-to-fine alignment strategy, using a quality-discrepancy-aware hybrid loss for coarse alignment and a perception fusion approach with a conditional discriminator for fine-grained alignment. Experiments demonstrate that UPDA effectively improves NR-PCQA performance in cross-domain scenarios.

Introduces an unsupervised progressive domain adaptation (UPDA) framework featuring a novel two-stage coarse-to-fine alignment strategy to mitigate domain shifts in no-reference point cloud quality assessment.

Jincan Wu, Yonghui Liu, Weiqing Li2602.11969

Computer VisionData Curation & Synthetic Data

2d ago

DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition

The paper introduces DiffPlace, a diffusion-based framework for generating place-controllable street view images from text, BEV maps, and object bounding boxes, specifically addressing the challenge of generating background-consistent urban scenes. DiffPlace employs a place-ID controller, using linear projection, a perceiver transformer, and contrastive learning to map place-ID embeddings into a CLIP space, enabling control over background consistency while allowing foreground variations. Experiments demonstrate that DiffPlace achieves superior generation quality and improves visual place recognition performance when used for data augmentation compared to existing methods.

Introduces a place-ID controller within a multi-view diffusion model to enable place-controllable street view generation, enhancing background consistency and foreground flexibility.

Ji Li, Zhiwei Li, Shihao Li +12602.11875

Multimodal ModelsComputer VisionData Curation & Synthetic Data

2d ago

ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

The paper introduces ScalSelect, a training-free multimodal data selection method for visual instruction tuning (VIT) that addresses the computational expense and redundancy of large-scale datasets. ScalSelect constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM and then identifies samples whose representations best approximate the dominant subspace of the full dataset. Experiments demonstrate that ScalSelect achieves comparable or superior performance to full-data training using significantly less data (e.g., 16%).

Introduces ScalSelect, a scalable training-free multimodal data selection method that achieves high performance in visual instruction tuning while significantly reducing computational costs.

Shijie Lian, Xiaopeng Lin, Cong Huang +22602.11636

Multimodal ModelsTraining Efficiency & OptimizationData Curation & Synthetic Data

2d ago

A critical assessment of bonding descriptors for predicting materials properties

This paper investigates the impact of incorporating quantum-chemical bonding descriptors into machine learning models for predicting materials properties. They leverage an extended Quantum-Chemical Bonding Database for Solid-State Materials, encompassing approximately 13,000 materials, to derive a new set of bonding descriptors. Their systematic assessment demonstrates that including these descriptors enhances the predictive performance of models for elastic, vibrational, and thermodynamic properties and facilitates the discovery of intuitive expressions for properties like the projected force constant and lattice thermal conductivity through symbolic regression.

Demonstrates the utility of quantum-chemical bonding descriptors in improving the performance and interpretability of machine learning models for predicting materials properties.

A. Naik, Nidal Dhamrait, Katharina Ueltzen +42602.12109

Scientific Discovery & Drug DesignData Curation & Synthetic Data

Department of Biomedical Engineering and Physics2d ago

Synthesis of Late Gadolinium Enhancement Images via Implicit Neural Representations for Cardiac Scar Segmentation

This paper introduces a novel data augmentation framework for cardiac scar segmentation using implicit neural representations (INRs) and denoising diffusion models to synthesize late gadolinium enhancement (LGE) images and corresponding segmentation masks. INRs are trained to capture continuous spatial representations of LGE data and masks, compressed into latent embeddings, and then used by a diffusion model to generate new representations that are decoded into synthetic LGE images with anatomically consistent segmentation masks. Experiments demonstrate that augmenting training data with synthetic volumes improves fibrosis segmentation performance, increasing the Dice score from 0.509 to 0.524.

Introduces a novel annotation-free data augmentation method for cardiac scar segmentation by synthesizing LGE images and segmentation masks using INRs and diffusion models.

Soufiane Ben Haddou, Laura Alvarez-Florez, Erik J. Bekkers +32602.11942

Data Curation & Synthetic DataComputer VisionScientific Discovery & Drug Design

Allen Institute for Artificial2d ago·affiliated labs: Allen Institute for AI (AI2), Stanford HAI

Olmix: A Framework for Data Mixing Throughout LM Development

The paper introduces Olmix, a framework designed to address challenges in data mixing for language model training, specifically focusing on understanding the configuration space of mixing methods and efficiently adapting to evolving domain sets. Through an empirical study, the authors identify key design choices for effective mixing methods and propose "mixture reuse," a technique that leverages past mixture ratios to efficiently recompute mixtures after domain set updates. Experiments show that mixture reuse achieves comparable performance to full recomputation with significantly reduced compute (74% less) and outperforms training without mixing by 11.6% on downstream tasks.

Introduces and validates "mixture reuse," a novel technique for efficiently adapting data mixtures in language model training when the domain set evolves.

Tyler C. Murray, David Heineman, Matt Jordan +42602.12237

Data Curation & Synthetic DataTraining Efficiency & OptimizationNatural Language Processing

2d ago

SynthRAR: Ring Artifacts Reduction in CT with Unrolled Network and Synthetic Data Training

This paper addresses ring and streak artifacts in CT images caused by defective detectors by reformulating ring artifact reduction (RAR) as an inverse problem solved with an unrolled network that incorporates both non-ideal detector responses and CT geometry. The method leverages synthetic data generated from natural images to capture the correlation of ring artifacts between sinogram and image domains, eliminating the need for real clinical training data. Results demonstrate that the proposed SynthRAR method, trained solely on synthetic data, outperforms existing state-of-the-art RAR techniques across various scanning geometries and anatomical regions.

Introduces SynthRAR, an unrolled network trained on synthetic data, to effectively reduce ring artifacts in CT images by modeling non-ideal detector responses and leveraging correlations between sinogram and image domains.

Hongxu Yang, Levente Lippenszky, Edina Timko +12602.11880

Data Curation & Synthetic DataComputer VisionTraining Efficiency & Optimization

2d ago

Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

The paper introduces a reinforcement learning-based web crawling algorithm, SB-CLASSIFIER, designed to efficiently acquire statistical datasets (SDs) from websites. The algorithm addresses the challenge of inefficient or impossible SD retrieval at scale by learning which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Experiments on large websites demonstrate that SB-CLASSIFIER can retrieve a high fraction of a site's targets while crawling only a small part of the website.

Introduces a novel reinforcement learning-based web crawler, SB-CLASSIFIER, that leverages sleeping bandits to efficiently identify and extract statistical datasets from large websites.

Antoine Gauquier, I. Manolescu, Pierre Senellart2602.11874

Data Curation & Synthetic DataRecommendation & Information RetrievalNatural Language Processing

2d ago

KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall, Ranking, and Relevance

The authors introduce KuaiSearch, a large-scale e-commerce search dataset derived from Kuaishou user interactions, designed to address limitations in existing datasets such as anonymization and single-stage coverage. KuaiSearch includes authentic user queries, natural product texts, and covers cold-start users/long-tail products across recall, ranking, and relevance stages of the search pipeline. Through comprehensive analysis and benchmark experiments, the authors demonstrate KuaiSearch's value for advancing research in real-world e-commerce search, particularly for LLM-based approaches.

Introduces KuaiSearch, a novel large-scale e-commerce search dataset built from real-world Kuaishou user interactions spanning recall, ranking, and relevance stages.

Yupeng Li, Ben Chen, Zhiding Liu +32602.11518

Eval Frameworks & BenchmarksRecommendation & Information RetrievalData Curation & Synthetic Data

Feb 10, 2026

Lehigh University4d ago

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

The authors introduce LiveMedBench, a dynamically updated medical benchmark designed to address data contamination and temporal misalignment in LLM evaluation by continuously harvesting real-world clinical cases from online medical communities. They employ a Multi-Agent Clinical Curation Framework to filter noise and validate clinical integrity, and an Automated Rubric-based Evaluation Framework for granular, case-specific assessment. Evaluation of 38 LLMs on LiveMedBench reveals significant performance degradation on post-cutoff cases and identifies contextual application as a major bottleneck, highlighting the limitations of current LLMs in clinical reasoning.

Introduces LiveMedBench, a novel, continuously updated medical benchmark with automated rubric evaluation, to mitigate data contamination and improve the reliability of LLM evaluation in clinical settings.

Zhiling Yan, Zhe Fang, Xiang Li +22602.10367

Eval Frameworks & BenchmarksData Curation & Synthetic DataScientific Discovery & Drug Design

4d ago

Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning

The authors introduce Fine-T2I, a large-scale (6M image-text pairs, 2TB), high-quality, and openly licensed dataset for text-to-image fine-tuning, addressing limitations in existing datasets regarding resolution, alignment, and diversity. Fine-T2I combines synthetically generated images with curated real images, rigorously filtered for quality. Fine-tuning various pretrained diffusion and autoregressive models on Fine-T2I demonstrates consistent improvements in generation quality and instruction adherence, as validated through human evaluation and automatic metrics.

Introduces Fine-T2I, a meticulously curated and openly licensed dataset of 6 million image-text pairs, designed to overcome limitations in existing T2I fine-tuning datasets and improve model performance.

Xu Ma, Yitian Zhang, Qihua Dong +12602.09439

Multimodal ModelsData Curation & Synthetic DataComputer Vision

Feb 5, 2026

1w ago

GreekMMLU: A Native-Sourced Multitask Benchmark for Evaluating Language Models in Greek

The paper introduces GreekMMLU, a new native-sourced benchmark for evaluating LLMs in Greek, comprising 21,805 multiple-choice questions across 45 subjects with difficulty levels spanning primary to professional examinations. The benchmark addresses the lack of authentic Greek evaluation datasets by sourcing questions directly from academic, professional, and governmental exams in Greek. Evaluations of over 80 LLMs using GreekMMLU reveal performance gaps between frontier and open-weight models, and between Greek-adapted and general multilingual models, providing insights for improving LLM capabilities in Greek.

Introduces GreekMMLU, a novel native-sourced benchmark for evaluating multitask language understanding in Greek, designed to overcome limitations of machine-translated datasets.

Yang Zhang, Mersin Konomi, Christos Xypolopoulos +62602.05150

Eval Frameworks & BenchmarksNatural Language ProcessingData Curation & Synthetic Data

1w ago

Contour Refinement using Discrete Diffusion in Low Data Regime

This paper introduces a discrete diffusion contour refinement pipeline for boundary detection, specifically designed for low-data regimes common in medical imaging and environmental monitoring. The method employs a CNN with self-attention to iteratively denoise sparse contour representations conditioned on segmentation masks. By simplifying the diffusion process, customizing the model architecture, and minimizing post-processing, the approach achieves state-of-the-art or competitive performance on KVASIR, HAM10K, and a custom Smoke dataset, while also improving inference speed.

Introduces a lightweight discrete diffusion contour refinement pipeline tailored for robust boundary detection with limited training data.

F. Guan, Ian Keefe, Sophie Wilkinson +22602.05880

Computer VisionTraining Efficiency & OptimizationData Curation & Synthetic Data

Feb 4, 2026

1w ago

A Few-Shot Semantic Meta-Learning Framework with CRF for Skill Entity Recognition in Open Innovation Ecosystems

This paper introduces a Few-Shot Semantic Meta-Learning framework with CRF (FSM-CRF) for Indonesian skill entity recognition to address the scarcity of annotated data and the rapid evolution of skill expressions in open innovation ecosystems. The FSM-CRF model integrates semantic span representations, episodic meta-learning, and BIO-constrained CRF decoding to improve prototype stability and entity-boundary precision. Evaluated on the NERSkill.id dataset under a 3-way, 10-shot episodic setting, the model achieves a micro-F1 of 73.84%, outperforming traditional supervised and existing few-shot baselines, demonstrating the effectiveness of semantic meta-learning for skill-intelligence infrastructures.

Introduces a novel Few-Shot Semantic Meta-Learning framework with CRF (FSM-CRF) that leverages semantic span representations, episodic meta-learning, and BIO-constrained CRF decoding for improved Indonesian skill entity recognition in low-resource settings.

Nurchim, Muljono, E. Noersasongko +2

Natural Language ProcessingData Curation & Synthetic Data

Jan 31, 2026

2w ago

DuoGen: Towards General Purpose Interleaved Multimodal Generation

The paper introduces DuoGen, a general-purpose interleaved multimodal generation framework designed to improve the quality of models generating interleaved image and text sequences under general instructions. DuoGen constructs a large-scale instruction-tuning dataset from curated websites and synthetic examples and employs a two-stage decoupled training strategy using a pretrained multimodal LLM and a diffusion transformer (DiT). Experiments demonstrate that DuoGen outperforms existing open-source models in text quality, image fidelity, and image-context alignment, achieving state-of-the-art performance in text-to-image generation and image editing.

Introduces a two-stage decoupled training strategy for interleaved multimodal generation that combines a pretrained multimodal LLM for instruction understanding with a diffusion transformer (DiT) for image generation.

Min Shi, Xiaohui Zeng, Jiannan Huang +132602.00508

Multimodal ModelsData Curation & Synthetic DataArchitecture Design (Transformers, SSMs, MoE)

Jan 29, 2026

2w ago

Cross-Lingual Transfer Learning: Applications in Low-Resource Languages

This paper reviews cross-lingual transfer learning techniques for low-resource languages, focusing on applications in machine translation, text classification, and named entity recognition. It addresses the problem of data scarcity in low-resource languages by leveraging knowledge from high-resource languages. The review synthesizes technical approaches, identifies challenges, and outlines future directions, providing practical insights for researchers.

Systematically reviews and synthesizes cross-lingual transfer learning techniques applied to machine translation, text classification, and named entity recognition in low-resource languages.

Junfu Zhu

Natural Language ProcessingData Curation & Synthetic Data

Jan 15, 2026

BYOL: Bring Your Own Language Into LLMs

The paper introduces Bring Your Own Language (BYOL), a framework for developing language-aware LLMs tailored to languages' digital resource availability. BYOL classifies languages into resource tiers and applies different integration pathways: a data refinement and expansion pipeline for low-resource languages (demonstrated on Chichewa and Maori), and a translation-mediated approach for extreme-low-resource languages (demonstrated on Inuktitut). Experiments show that BYOL improves performance on low-resource languages by approximately 12% compared to multilingual baselines, while maintaining English and multilingual capabilities, and enables high-accuracy LLM access for extreme-low-resource languages via improved translation.

Introduces a tiered framework, BYOL, for language-aware LLM development that tailors integration pathways based on a language's digital resource availability.

Syed Waqas Zamir, W. Hamidouche, B. Amor +32601.10804

Data Curation & Synthetic DataNatural Language Processing

Jan 7, 2026

Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control

The authors introduce Muse, an open-source system for long-form song generation with fine-grained style conditioning, addressing the lack of reproducibility in academic research due to unavailable training data. They release a dataset of 116k fully licensed synthetic songs with lyrics and style descriptions paired with SunoV5-synthesized audio. Muse, a Qwen-based language model finetuned with discrete audio tokens, achieves competitive performance in phoneme error rate, text-music style similarity, and audio aesthetic quality, demonstrating controllable segment-level generation.

Releases Muse, a fully open-source system for long-form song generation, along with a licensed synthetic dataset and training/evaluation pipelines, to enable reproducible research.

Changhao Jiang, Jiahao Chen, Zhenghao Xiang +142601.03973

Speech & AudioOpen-Source Models & WeightsData Curation & Synthetic Data

Jan 6, 2026

Ant GroupJan 6, 2026

MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models

The authors introduce MedDialogRubrics, a new benchmark for evaluating multi-turn diagnostic capabilities of LLMs in medical consultations, consisting of 5,200 synthetic patient cases and 60,000 fine-grained evaluation rubrics. They use a multi-agent system with a Patient Agent augmented with a dynamic guidance mechanism to generate realistic patient records while mitigating privacy concerns. Evaluation of state-of-the-art models on MedDialogRubrics reveals significant challenges, suggesting that improvements in medical dialogue require advances in dialogue management architectures beyond simple model tuning.

Introduces MedDialogRubrics, a comprehensive benchmark and evaluation framework with synthetically generated patient cases and expert-refined rubrics, to rigorously assess multi-turn diagnostic capabilities of LLMs in medical consultations.

Lecheng Gong, Weimin Fang, Ting Yang +92601.03023

Eval Frameworks & BenchmarksNatural Language ProcessingData Curation & Synthetic Data

Jan 6, 2026

AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports

The paper introduces AfriEconQA, a new benchmark dataset for African economic analysis constructed from 236 World Bank reports, designed to evaluate numerical reasoning and temporal disambiguation capabilities of models. The dataset comprises 8,937 question-answer pairs, filtered from a larger synthetic pool to ensure high-quality evidence-answer alignment and temporal provenance. Benchmarking experiments using GPT-5 Mini, GPT-4o, and Qwen 32B in zero-shot and RAG configurations reveal a significant performance gap, highlighting the dataset's challenge for current LLMs and the need for domain-specific IR and RAG advancements.

Introduces AfriEconQA, a novel benchmark dataset specifically designed to evaluate the performance of information retrieval and question answering systems on African economic analysis using World Bank reports.

Edward Ajayi2601.15297

Eval Frameworks & BenchmarksData Curation & Synthetic DataNatural Language Processing

Dec 31, 2025

A Study on Performance Analysis of Electric Vehicle Fire and Smoke Object Detection Based on Different Labeling Methods Using the YOLOv11 Model

This paper compares the performance of YOLOv11 models for detecting EV fires and smoke using bounding box and instance segmentation annotations. The study trained all YOLOv11 variants on a dataset of 3,000 images and evaluated them based on speed and accuracy metrics like mAP50 and FPS. Results showed that bounding box models offer faster inference speeds, while segmentation models achieve higher accuracy, particularly for detecting irregularly shaped smoke plumes, making them more suitable for reliable early fire detection.

Demonstrates that instance segmentation labeling with YOLOv11 yields superior accuracy in detecting EV fires and smoke, especially for irregular smoke boundaries, compared to bounding box labeling, despite a computational cost.

Jiwon Choi, Eunji Yoon, Haiyoung Jung

Computer VisionTraining Efficiency & OptimizationData Curation & Synthetic Data

Dec 29, 2025

Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

The paper introduces InfTool, a multi-agent framework comprising a User Simulator, Tool-Calling Assistant, and MCP Server, designed to autonomously generate tool-use trajectories from raw API specifications. InfTool closes the loop by training a model using Group Relative Policy Optimization (GRPO) with gated rewards on the synthesized data, iteratively improving the model's ability to generate higher-quality training data. Experiments on the Berkeley Function-Calling Leaderboard (BFCL) show that InfTool significantly improves a 32B model's accuracy from 19.8% to 70.9%, surpassing larger models and rivaling Claude-Opus, using only synthetic data.

Introduces a fully autonomous, self-evolving multi-agent framework, InfTool, for synthesizing diverse and verified tool-use trajectories, eliminating the need for human annotation and enabling significant performance gains in tool-calling accuracy.

Yuwen Li, Wei Zhang, Ze-Jun Huang +82512.23611

Tool Use & AgentsData Curation & Synthetic DataRLHF & Preference Learning

Dec 26, 2025

NewsScope: Schema-Grounded Cross-Domain News Claim Extraction with Open Models

The paper introduces NewsScope, a new dataset and benchmark for schema-grounded news claim extraction across diverse domains, comprising 455 articles with in-domain and out-of-source splits. They fine-tuned LLaMA 3.1 8B using LoRA on a subset of the dataset, achieving 89.4% human-evaluated accuracy, comparable to GPT-4o-mini, and outperforming it on political claims. A numeric grounding filter further enhances accuracy, and the open-weight model allows for cost-effective deployment.

Introduces NewsScope, a novel dataset and benchmark for schema-grounded news claim extraction, and demonstrates effective fine-tuning of an open-weight model for this task.

Nidhi Pandya2601.08852

Eval Frameworks & BenchmarksData Curation & Synthetic DataNatural Language Processing

Dec 19, 2025

RadarGen: Automotive Radar Point Cloud Generation from Cameras

RadarGen, a diffusion model, generates realistic automotive radar point clouds from multi-view camera imagery by adapting image-latent diffusion to the radar domain. It represents radar measurements in bird's-eye-view form, encoding spatial structure, RCS, and Doppler attributes, and uses a lightweight recovery step to reconstruct point clouds. By conditioning on BEV-aligned depth, semantic, and motion cues extracted from pretrained foundation models, RadarGen aligns generation with the visual scene, leading to physically plausible radar patterns and improved performance on perception tasks.

Introduces RadarGen, a diffusion-based generative model that synthesizes realistic automotive radar point clouds conditioned on multi-view camera imagery and BEV-aligned cues.

Tomer Borreda, Fangqiang Ding, Sanja Fidler +22512.17897

Multimodal ModelsComputer VisionData Curation & Synthetic Data

Dec 17, 2025

Clarkson UniversityDec 17, 2025

An Open-Source Framework for Quality-Assured Smartphone-Based Visible Light Iris Recognition

The paper introduces CUVIRIS, a new dataset of ISO/IEC 29794-6 compliant visible light iris images captured via a custom Android application with real-time quality assessment, and benchmarks two iris recognition systems on this dataset. They also present LightIrisNet, a MobileNetV3-based segmentation model for on-device deployment, and adapt IrisFormer, a transformer-based matcher, to the visible light domain. Experiments demonstrate that the open-source OSIRIS system achieves a TAR of 97.9% at FAR = 0.01 on CUVIRIS, and IrisFormer, trained only on UBIRIS.v2, achieves an EER of 0.057%, indicating the feasibility of accurate smartphone-based iris recognition under controlled conditions.

Provides an open-source framework including a quality-assured VIS iris image dataset, a lightweight segmentation model, and a VIS-adapted transformer-based matcher to advance smartphone-based iris recognition.

Naveenkumar G. Venkataswamy, Yu Liu, Soumyabrata Dey +22512.15548

Computer VisionData Curation & Synthetic DataOpen-Source Models & Weights

Dec 11, 2025

AI-Assisted Exploration, Curation, and Extension of Biodiversity Data Using iChatBio

The authors introduce iChatBio, an agentic system designed to enhance biodiversity data interaction by incorporating expert knowledge and ensuring traceability. iChatBio employs a multi-agent architecture where a chat agent decomposes user requests, and specialized expert agents retrieve and process information from sources like iDigBio, GBIF, and iNaturalist. This system enables exploration, curation, and extension of biodiversity data through a natural language interface, leveraging APIs without requiring user expertise.

Introduces iChatBio, a distributed, multi-agent system that facilitates AI-assisted exploration, curation, and extension of biodiversity data by integrating domain expertise and providing traceable data provenance.

Michael Elliott, Manuel Luciano, José A. B. Fortes

Tool Use & AgentsData Curation & Synthetic DataScientific Discovery & Drug Design

Dec 8, 2025

PCMind-2.1-Kaiyuan-2B Technical Report

The paper introduces PCMind-2.1-Kaiyuan-2B, a 2B parameter open-source LLM, designed to improve training efficiency under resource constraints. They employ a Quantile Data Benchmarking method for data mixing, Strategic Selective Repetition for high-quality data leverage, and a Multi-Domain Curriculum Training policy for sample ordering. Kaiyuan-2B achieves competitive performance with state-of-the-art open-source models while using optimized data preprocessing and architectural modifications for FP16 stability.

Introduces a novel training methodology for resource-constrained LLMs, combining quantile data benchmarking, strategic selective repetition, and multi-domain curriculum training.

Kairong Luo, Zhenbo Sun, Xinyu Shi +92512.07612

Training Efficiency & OptimizationOpen-Source Models & WeightsData Curation & Synthetic Data

Dec 4, 2025

Impact of Synthetic Data on Deep Learning Models for Earth Observation: Photovoltaic Panel Detection Case Study

This paper investigates the impact of augmenting real Earth Observation (EO) data with synthetic data, generated both physically (Unity) and generatively (DALL·E 3, Stable Diffusion XL), for training a YOLOv8 model to detect photovoltaic panels. The study finds that combining real and synthetic data generally improves object detection performance, particularly when the total dataset size meets the model's minimum requirements. The best performance gains were observed when combining real data with both physically-based and generative synthetic data, resulting in improvements across precision, recall, and mAP metrics.

Demonstrates that combining real EO data with both physically-based (Unity) and generative (DALL·E 3, Stable Diffusion XL) synthetic data improves the performance of a YOLOv8 model for photovoltaic panel detection, while also highlighting the importance of careful data management to avoid overfitting.

E. Hisam, J. Gimeno, David Miraut +6

Data Curation & Synthetic DataComputer Vision

Dec 2, 2025

ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

The paper introduces ClimaDrive, a semantics-guided image-to-image translation framework, to generate physically realistic and weather-diverse synthetic data for training anomaly segmentation models. ClimaDrive combines structure-guided multi-weather generation with prompt-driven anomaly inpainting to create visually realistic training data. Experiments on the newly created ClimaOoD benchmark demonstrate that training with this synthetic data significantly improves the performance of state-of-the-art anomaly segmentation methods, as evidenced by improvements in AUROC, AP, and FPR95 metrics.

Introduces ClimaDrive, a novel framework for synthesizing physically plausible and semantically coherent out-of-distribution (OoD) driving data by unifying structure-guided multi-weather generation with prompt-driven anomaly inpainting.

Yuxing Liu, Yong Liu2512.02686

Data Curation & Synthetic DataComputer VisionRobotics & Embodied AI

Dec 1, 2025

Faculty of Electrical Engineering and Informatics of the Technical University of KošiceDec 1, 2025

Training of Large Language Model Mistral on Slovak Language Data

This paper fine-tunes the open-weight Mistral 7B LLM on the Araneum Slovacum VII Maximum corpus (5.3B tokens) to create Mistral-SK-7b, a specialized Slovak language model. The motivation is to address the lack of high-quality, open-source LLMs for low-resource languages like Slovak, where commercial models are proprietary. The resulting Mistral-SK-7b exhibits significantly improved grammatical correctness and contextual coherence in Slovak, eliminating issues like code-switching and repetition loops present in the original Mistral 7B.

Demonstrates the effective adaptation of a state-of-the-art LLM for a low-resource language through fine-tuning on a large, relevant corpus, resulting in a publicly available model with improved performance.

Peter Bednár, Marek Dobeš, R. Garabík

Open-Source Models & WeightsNatural Language ProcessingData Curation & Synthetic Data

Lattice is designed for desktop

Data Curation & Synthetic Data

Keywords

Top Labs in This Topic

Recent Papers