OpenAITrend Micro AI LabFeb 16, 2025arXiv:2502.11191

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao

AI Summary

The paper introduces Primus, a suite of open-source datasets designed to improve cybersecurity LLMs across pretraining, instruction fine-tuning, and reasoning distillation. They address the lack of high-quality cybersecurity pretraining data by curating datasets and demonstrating their effectiveness through ablation studies. Continual pre-training with Primus yields a 15.9% improvement on aggregate cybersecurity benchmarks, while reasoning distillation improves security certification (CISSP) scores by 15.8%.

Key Contribution

Cybersecurity LLMs get a major open-source boost with the release of Primus, a comprehensive dataset suite that demonstrably improves performance across pretraining, fine-tuning, and reasoning tasks.

Abstract

Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.9% improvement in the aggregate score, while reasoning distillation leads to a 15.8% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer to https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243.

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations14

Influential citations4

References58

Year2025

VenueConference on Empirical Methods in Natural Language Processing

Related Papers

Finding related papers...

Search

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Related Papers