Search papers, labs, and topics across Lattice.
This paper investigates methods to improve the Linux privilege escalation capabilities of local, open-weight LLMs, addressing security and privacy concerns associated with cloud-based models. They analyze failure modes and implement five interventions (CoT, RAG, structured prompts, history compression, and reflective analysis) within the hackingBuddyGPT framework. Results demonstrate that Llama3.1 70B, with enhancements, matches or exceeds GPT-4o performance, exploiting 83% of vulnerabilities, with reflection-based treatments being the most impactful.
Local LLMs can now rival cloud-based giants like GPT-4o in Linux privilege escalation tasks, thanks to targeted system-level and prompting interventions.
Recent research has demonstrated the potential of Large Language Models (LLMs) for autonomous penetration testing, particularly when using cloud-based restricted-weight models. However, reliance on such models introduces security, privacy, and sovereignty concerns, motivating the use of locally hosted open-weight alternatives. Prior work shows that small open-weight models perform poorly on automated Linux privilege escalation, limiting their practical applicability. In this paper, we present a systematic empirical study of whether targeted system-level and prompting interventions can bridge this performance gap. We analyze failure modes of open-weight models in autonomous privilege escalation, map them to established enhancement techniques, and evaluate five concrete interventions (chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis) implemented as extensions to hackingBuddyGPT. Our results show that open-weight models can match or outperform cloud-based baselines such as GPT-4o. With our treatments enabled, Llama3.1 70B exploits 83% of tested vulnerabilities, while smaller models including Llama3.1 8B and Qwen2.5 7B achieve 67% when using guidance. A full-factorial ablation study over all treatment combinations reveals that reflection-based treatments contribute most, while also identifying vulnerability discovery as a remaining bottleneck for local models.