Search papers, labs, and topics across Lattice.
Adversarial testing of AI systems, jailbreaking research, prompt injection defense, and robustness evaluation.
#15 of 24
2
Current benchmarks mislead on AI agent security; robust defenses against indirect prompt injection require dynamic replanning, constrained LLM usage, and human oversight.
Learned approval in MONA can eliminate reward hacking, but at the cost of significantly under-optimizing for the intended task, revealing a critical trade-off in safe RL.
Even Gemini can understand you if you speak its language: structured intent prompting slashes cross-language performance variance and boosts weaker models more than stronger ones.
LLMs can be rigorously evaluated for metacognitive abilities like confidence assessment and risk-aware decision-making using psychophysical frameworks borrowed from human cognition research.
MLLMs are more vulnerable than we thought: imperceptible visual prompts can effectively hijack their behavior.
LLM-as-a-Judge, while improving evaluation scalability, introduces critical security vulnerabilities that can compromise the trustworthiness of entire evaluation pipelines.
Adversarial training doesn't have to destroy VLMs' zero-shot abilities: aligning adversarial visual features with textual embeddings using the original model's probabilistic predictions can actually *improve* robustness.
FL systems are far more vulnerable to backdoor attacks using realistic, semantically-aligned triggers (like sunglasses) than previously thought based on simple corner patches.
LLMs are surprisingly bad at strategic communication, leaking sensitive information even when trying to be secretive.
LLM-generated authorial impersonations, despite their sophistication, are surprisingly detectable by existing authorship verification methods, even outperforming on some genuine negative samples.
LLMs struggle to handle common, challenging patient behaviors like contradictory statements and inaccurate medical information, revealing critical safety gaps in medical consultation applications.
You don't need a massive model to beat Gemini-2.5-Pro in real-world content moderation: Xuanwu VL-2B achieves superior recall on policy-violating text using only 2B parameters.
Despite using similar cryptographic protocols, popular messaging apps like Messenger, Signal and Telegram exhibit stark differences in attack surface, network activity, and permission requests, raising questions about their overall security and privacy postures.
Get faithful and robust explanations for random subspace methods – a cornerstone of defense against adversarial attacks – without sacrificing computational efficiency.
Diffusion-based watermarks, thought to be secure, can be completely bypassed with a simple stochastic resampling trick that breaks trajectory reconstruction.
Quantum Key Distribution, often considered unconditionally secure, crumbles under a new "Manipulate-and-Observe" attack that exploits vulnerabilities in classical post-processing, potentially leaking the entire key.
Deep learning cracks lightweight stream ciphers, pinpointing fault locations with near-perfect accuracy and slashing the number of fault injections needed to recover secrets.
Diffusion-based feature denoising can significantly bolster the robustness of handwritten digit classifiers against adversarial attacks, even outperforming standard CNNs.
Fusing low-level statistical anomalies, high-level semantic coherence, and mid-level texture patterns makes AI-generated image detection far more reliable across diverse generative models.
Retraining just the classifier head of a frozen feature extractor can be dramatically improved by meta-learning feature-space augmentations that target hard examples, leading to state-of-the-art robustness against spurious correlations.
Multimodal models surprisingly falter when applied to presentation attack detection on ID documents, challenging the assumption that combining visual and textual data inherently improves security.
State-of-the-art Large Audio Language Models are surprisingly vulnerable to hallucination attacks, with success rates as high as 95%, revealing a critical reliability gap masked by standard benchmarks.
Compromised 5G networks can be weaponized with chained, undetectable command and control channels, enabling attacks that bypass existing security measures.
Dummy Class defenses, which appear robust under standard adversarial attacks, crumble when attacked with a novel DAWA method that targets both the true and dummy labels.
Bounded context windows in next-token prediction models can be fundamentally incompatible with low adversarial regret, even with long context lengths.
Even with corrupted human feedback, surprisingly tight guarantees for multi-agent reinforcement learning are possible.
Forget expensive verification: training networks to be *trivially* verifiable yields state-of-the-art Lipschitz bounds and adversarial robustness.
Dataset condensation, already vulnerable to backdoor attacks, now faces a far stealthier threat: InkDrop leverages decision boundary uncertainty to hide malicious triggers, making detection significantly harder.
Forget hand-crafted environments: COvolve uses LLMs to automatically co-evolve challenging environments and robust policies, paving the way for open-ended learning.
A novel ensemble method substantially improves the reliability of detecting Chinese LLM-generated text, even against adversarial examples.
You can ditch the CAPTCHA: this passive bot detection method spots two-thirds of bots with minimal false positives, using just server logs and favicon analysis.
Forget retraining: In-Context Learning lets you detect novel online scams and illicit content with near fine-tuned performance, zero-shot across platforms.
MLLMs are riddled with shared vulnerabilities across modalities, meaning a single weakness can be exploited to jailbreak safety filters, hijack instructions, or even poison training data.
VLMs can be devastatingly fooled by modifying less than 2% of image pixels in a fixed, X-shaped pattern, causing them to fail spectacularly across diverse tasks like classification, captioning, and question answering.
Fuzzy logic bridges the gap between LLM reasoning and low-level artifact detection, creating a surprisingly effective AI-generated image detector.
Demorphing faces with StyleGANs can now reliably unmask morphed identities across diverse real-world conditions, even when trained primarily on synthetic data.
Forget printed posters – now a smartphone screen displaying a dynamically generated adversarial patch can reliably spoof face recognition systems in real-time.
Tilting your drone's propellers isn't just for agility – it can be a game-changer for maintaining comms under jamming attacks, boosting link reliability by orders of magnitude.
Adversarial fine-tuning can now bypass Constitutional AI safety measures with almost no performance penalty, enabling models to provide detailed instructions on dangerous topics like CBRN warfare.
Backdoor defenses can be baked into the pre-training phase of federated learning, achieving state-of-the-art attack mitigation with minimal impact on clean accuracy.
Model reprogramming can be weaponized to create membership inference attacks that are significantly more effective, especially when high precision is needed.
Existing differential privacy methods struggle with symbolic trajectory data, but this new mechanism slashes error by up to 55% on real-world data.
Stop AI-driven malware and data leaks by embedding hidden, verifiable "canaries" in your documents that expose unauthorized LLM processing, even after adversarial attacks.
You can now pinpoint the network traffic features most responsible for triggering anomaly detection, thanks to SHAP-guided ensemble learning.
Smart contract vulnerability detection gets a 39% accuracy boost and adversarial robustness with ORACAL, a framework that uses RAG-enhanced LLMs to inject expert security context into heterogeneous graphs.
FedBBA slashes backdoor attack success rates to as low as 1.1% in federated learning, leaving existing defenses in the dust.
Adversarial attacks can cripple robotic perception systems, demanding specialized defenses beyond standard image classification techniques.
LLM agents controlling real-world tools are alarmingly easy to manipulate, with an 85% success rate for privilege escalation attacks, despite exhibiting basic security awareness.
A lightweight RFID authentication protocol touted for low-cost mobile systems crumbles under a multi-session algebraic attack, revealing its structural insecurity.
Model safety isn't about whether adversarial content is seen, but whether it spreads: Claude strips injections at write_memory, while GPT-4o-mini propagates them flawlessly.