Search papers, labs, and topics across Lattice.
The paper introduces Knowledge-Aware Active Learning (KA2L), a framework that assesses LLMs' knowledge mastery using latent space analysis to generate unanswerable questions for targeted fine-tuning. KA2L probes the hidden states of Transformer layers to identify the distribution of known and unknown knowledge, then decodes the latent knowledge space to generate natural language questions focused on areas where the model is deficient. Experiments across nine open-source LLMs demonstrate that KA2L reduces annotation and computation costs by 50% while achieving superior performance on open-domain and vertical-domain datasets.
LLMs can be actively trained to master specific knowledge domains with 50% less data and computation by focusing on what they *don't* know, not what they already do.
Fine-tuning large language models (LLMs) with high-quality knowledge has been shown to enhance their performance effectively. However, there is a paucity of research on the depth of domain-specific knowledge comprehension by LLMs and the application of targeted active learning to improve their expertise. To address this gap, we introduce the Knowledge-Aware Active Learning (KA2L) framework. This framework assesses LLMs' mastery of specific knowledge points to aid in constructing unanswerable or unknowable questions through latent space analysis. This active learning strategy enhances training efficiency by focusing on knowledge the model has yet to master, thereby minimizing redundancy in learning already acquired information. This study innovatively employs a knowledge distribution probing technique to examine the hidden states of specific Transformer layers and identify the distribution of known and unknown knowledge within the LLM. Additionally, a hidden-state decoding method is proposed to generate numerous unknown questions in natural language from the latent knowledge space. In our experiments, we selected nine open-source LLMs to validate the effectiveness of the proposed framework. Results indicate that KA2L not only significantly reduces 50% annotation and computation costs across two open-domain and one vertical-domain dataset but also achieves better performance, offering valuable insights into active learning strategies for LLMs. The code is available at https://anonymous.4open.science/r/KA2L-F15C.