Feb 25, 2026arXiv:2602.22291

Manifold of Failure: Behavioral Attraction Basins in Language Models

Sarthak Munshi, Manish Bhatt, Vineeth Sai Narajala, Idan Habler, AmmarnAl-Kahfah, Ken Huang, Blake Gatto

AI Summary

This paper introduces a framework using MAP-Elites to map the "Manifold of Failure" in LLMs, reframing vulnerability search as a quality diversity problem. The method uses Alignment Deviation as a quality metric to guide the search for regions where model behavior diverges from intended alignment. Experiments across Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini reveal distinct model-specific topological signatures of vulnerability, demonstrating that MAP-Elites achieves up to 63% behavioral coverage and discovers up to 370 distinct vulnerability niches.

Key Contribution

LLMs have dramatically different and surprisingly structured safety landscapes, with some models exhibiting near-universal vulnerability plateaus while others show fragmented basins of failure.

Abstract

While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model's safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Manifold of Failure: Behavioral Attraction Basins in Language Models

Related Papers