CarizonChina Academy of Space TechnologyQian Xuesen Laboratory of Space TechnologyState Key Laboratory of Space Information System and Integrated ApplicationMar 10, 2026arXiv:2603.09231

Cognitively Layered Data Synthesis for Domain Adaptation of LLMs to Space Situational Awareness

Linghu Ding, Da Fan, Kaifeng Yin, Xiaoliang Xue, Haiyi Ren, Cong Zhang

AI Summary

The paper introduces BD-FDG, a framework for generating high-quality supervised fine-tuning (SFT) datasets for adapting LLMs to complex engineering domains, specifically Space Situational Awareness (SSA). BD-FDG employs structured knowledge organization, cognitively layered question modeling based on Bloom's Taxonomy, and automated quality control to address limitations in existing SFT data construction. Fine-tuning Qwen3-8B with the SSA-SFT dataset generated by BD-FDG yields SSA-LLM-8B, which significantly outperforms baselines on domain-specific tasks while maintaining general performance.

Key Contribution

Forget generic fine-tuning data — Bloom's Taxonomy-based data generation can boost LLM performance in complex engineering domains like space situational awareness by up to 176%.

Abstract

Large language models (LLMs) demonstrate exceptional performance on general-purpose tasks. however, transferring them to complex engineering domains such as space situational awareness (SSA) remains challenging owing to insufficient structural alignment with mission chains, the absence of higher-order cognitive supervision, and poor correspondence between data quality criteria and engineering specifications. The core bottleneck is the construction of high-quality supervised fine-tuning (SFT) datasets. To this end, we propose BD-FDG (Bloom's Taxonomy-based Domain-specific Fine-tuning Data Generation), a framework that addresses incomplete knowledge coverage, shallow cognitive depth, and limited quality controllability through three mechanisms: structured knowledge organization, cognitively layered question modeling, and automated quality control. The framework uses a knowledge tree to ensure structured corpus coverage, designs a question generation scheme spanning nine categories and six cognitive levels from Remember to Create to produce samples with a continuous difficulty gradient, and applies a multidimensional scoring pipeline to enforce domain rigor and consistency. Using BD-FDG, we construct SSA-SFT, a domain dataset of approximately 230K samples, and fine-tune Qwen3-8B to obtain SSA-LLM-8B. Experiments show that SSA-LLM-8B achieves relative BLEU-1 improvements of 144\% (no-think) and 176\% (think) on the domain test set and a win rate of 82.21\% over the baseline in arena comparisons, while largely preserving general benchmark performance (MMLU-Pro, MATH-500). These results validate SFT data construction driven by cognitive layering as an effective paradigm for complex engineering domains and provide a transferable framework for domain-specific LLM adaptation.

Data Curation & Synthetic Data Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References51

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Cognitively Layered Data Synthesis for Domain Adaptation of LLMs to Space Situational Awareness

Related Papers