Search papers, labs, and topics across Lattice.
4 papers from Anthropic on Red-Teaming & Adversarial Robustness
Privacy-preserving LLM insight systems like Anthropic's Clio can be tricked into leaking a user's medical history with just a single symptom and basic demographics, even with layered heuristic defenses.
Reasoning models are surprisingly bad at controlling their own thoughts: Claude Sonnet 4.5 can control its chain-of-thought only 2.7% of the time, raising questions about the reliability of CoT monitoring.
LLMs harbor surprisingly consistent hidden beliefs on sensitive topics like mass surveillance and torture, even when direct questioning suggests otherwise.
Coding agents are vulnerable to a new class of stealthy, automated prompt injection attacks via poisoned skills, achieving high success rates even in realistic software engineering tasks.