Search papers, labs, and topics across Lattice.
This paper introduces Agentic Data Tailoring, a novel approach to refining unstructured multimodal data streams to enhance AI training and human knowledge acquisition. By employing a two-stage pipeline that utilizes deterministic Factual Anchors for generative semantic synthesis, the authors create a large-scale dataset that spans five domains, which is then used to train the DataClaw_0-9B model. Evaluations demonstrate that this model achieves high-information-density data refinement, significantly improving task adaptation in scenarios with limited training data.
DataClaw_0 can transform chaotic multimodal data into structured, high-quality datasets, enhancing AI's ability to learn from less information.
Massive unstructured multimodal streams suffer from high "data entropy," impeding both efficient human knowledge acquisition and high-quality AI post-training. Existing passive annotation paradigms, heavily reliant on heuristic rules or general VLMs, are costly, monotonous, and fail to unlock the deep procedural logic embedded in raw data. We elevate data processing to a learnable capability, proposing a paradigm shift towards Agentic Data Tailoring, which actively refining and structuring data to align with diverse user and downstream intents. To overcome the data scarcity bottleneck in training such high-order capabilities, we design a two-stage pipeline grounding generative semantic synthesis in deterministic Factual Anchors, yielding a large-scale dataset spanning five core physical and digital domains. Building upon this, DataClaw_0-9B model synergizes Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), achieving robust alignment with complex refinement and tailoring intents. To systematically quantify this capability, we construct DataClaw_0-val, the first benchmark dedicated to data refinement. Crucially, we adopt downstream post-training as the ultimate validation touchstone. Evaluations on video generation, real-world VQA, and GUI navigation confirm that DataClaw_0 delivers high-information-density tailored data, facilitating efficient model adaptation to new tasks under limited training data regimes. Project page: https://czjdsg.github.io/MakeAnyData