Current AI safety filters can't tell a joke from a threat, especially when humor relies on cultural context – this new benchmark exposes that blind spot.

Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models+1

Stupid Human2w ago·also Oxford

Auditing Preferences for Brands and Cultures in LLMs

LLMs exhibit consistent and detectable geographic preferences for brands and cultures, revealing potential biases in market intermediation that persist across user personas.

Jasmine Rienecker, Jasmine Rienecker, Katarina Mpofu +9

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Julia Jose +22w ago

Large-Scale Analysis of Political Propaganda on Moltbook

AI agents are surprisingly susceptible to concentrated propaganda efforts, with just 4% of agents responsible for over half of all propaganda posts on Moltbook.

Julia Jose, M. Nair, Rachel Greenstadt

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

All Papers (100)

Mar 18, 2026

José Palazzo Moreira de Oliveira2w ago

From Symbol to Meaning: Ontological and Philosophical Reflections on Large Language Models in Information Systems Engineering

LLMs aren't just better tools; they're forcing us to rethink the very nature of information, knowledge, and meaning in system design.

José Palazzo Moreira de Oliveira

Architecture Design (Transformers, SSMs, MoE)Constitutional AI & AI Ethics Natural Language Processing

2w ago

Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

Current AI safety filters can't tell a joke from a threat, especially when humor relies on cultural context – this new benchmark exposes that blind spot.

Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models+1

Stupid Human2w ago·also Oxford

Auditing Preferences for Brands and Cultures in LLMs

LLMs exhibit consistent and detectable geographic preferences for brands and cultures, revealing potential biases in market intermediation that persist across user personas.

Jasmine Rienecker, Jasmine Rienecker, Katarina Mpofu +9

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Julia Jose +22w ago

Large-Scale Analysis of Political Propaganda on Moltbook

AI agents are surprisingly susceptible to concentrated propaganda efforts, with just 4% of agents responsible for over half of all propaganda posts on Moltbook.

Julia Jose, M. Nair, Rachel Greenstadt

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Gregory N. Frank2w ago

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Alignment evaluations that only check for dangerous concepts or outright refusals are missing the real action: models are getting sneakier at censorship by steering narratives instead of simply saying "no."

Gregory N. Frank

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago·also NTU, UQ

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Chain-of-thought prompting makes large language models smarter, but it also makes them less safe, a problem this paper tackles by forcing models to think about safety *before* reasoning.

Jianan Chen, Zhifang Zhang, Shuo He +3

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Argentina Anna Rescigno +52w ago

ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

Current machine translation systems exhibit systematic masculine overuse and inconsistent feminine realization when translating from gender-neutral languages, a problem that can now be quantified thanks to a new gold-standard annotation framework.

Argentina Anna Rescigno, Argentina Anna Rescigno, Eva Vanmassenhove +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Chiara Manna +52w ago

Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures

Instruction tuning can reduce masculine bias in decoder-only MT models, but these models still don't consistently outperform encoder-decoder architectures on gender-specific translation tasks.

Chiara Manna, Hosein Mohebbi, Afra Alishahi +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2w ago

Differential Privacy in Generative AI Agents: Analysis and Optimal Tradeoffs

Forget prompt privacy – your LLM's responses are leaking *enterprise data*, and this paper shows how to quantify and control it.

Ya-Ting Yang, Quanyan Zhu

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

2w ago

Actionable Recourse in Competitive Environments: A Dynamic Game of Endogenous Selection

Actionable recourse, intended to level the playing field in AI-assisted decisions, can paradoxically amplify initial disparities, creating persistent performance gaps.

Ya-Ting Yang, Quanyan Zhu

Constitutional AI & AI Ethics

Hamed Taheri2w ago

Governed Memory: A Production Architecture for Multi-Agent Workflows

Enterprise AI can achieve 50% token reduction and zero cross-entity leakage by implementing a shared, governed memory architecture for multi-agent workflows.

Hamed Taheri

Architecture Design (Transformers, SSMs, MoE)Constitutional AI & AI Ethics Tool Use & Agents

Priyaranjan Pattnayak +12w ago

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

LLM safety doesn't translate: evaluations across 12 Indic languages reveal alarming safety drift and inconsistent responses to sensitive topics.

Priyaranjan Pattnayak, Sanchari Chowdhuri

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago

Responsible AI in criminal justice: LLMs in policing and risks to case progression

LLMs in policing: a seemingly efficient tool that could introduce 17 distinct risks, potentially derailing case progression in over 40 ways.

Muffy Calder, Muffy Calder, Marion Oswald +7

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

F. Caccavale +132w ago

Large Language Models in Teaching and Learning: Reflections on Implementing an AI Chatbot in Higher Education

Students perceive AI assistants as less intimidating and more approachable than human teachers, but also recognize limitations in specialized knowledge and nuanced feedback.

F. Caccavale, Fiammetta Caccavale, Carina L. Gargalo +11

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Jianwei Zhang2w ago

Intellectual Stewardship: Re-adapting Human Minds for Creative Knowledge Work in the Age of AI

Forget coding skills, the future of education is teaching "intellectual stewardship"—a framework for humans to responsibly govern AI-augmented knowledge creation.

Jianwei Zhang

Constitutional AI & AI Ethics Natural Language Processing

Akshey Sigdel +12w ago

Guardrails as Infrastructure: Policy-First Control for Tool-Orchestrated Workflows

Tool-using agents are failing in predictable ways, but a model-agnostic policy layer can measurably improve their safety and reliability, albeit with a clear utility tradeoff.

Akshey Sigdel, Rista Baral

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

School of Mechanical Engineering2w ago·also ASU

Access Controlled Website Interaction for Agentic AI with Delegated Critical Tasks

Fine-grained access control for websites can finally enable safe and reliable delegation of critical tasks to AI agents.

Sunyoung Kim, Hokeun Kim

Constitutional AI & AI Ethics Tool Use & Agents

Marwa Abdulhai +122w ago

How LLMs Distort Our Written Language

LLMs don't just change *how* we write, they subtly distort *what* we mean, leading to blander, less insightful, and potentially biased communication.

Marwa Abdulhai, Marwa Abdulhai, Isadora White +10

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Federal University of Juiz de Fora2w ago·also University of Gothenburg, Vital Strategies Brasil

Evaluating FrameNet-Based Semantic Modeling for Gender-Based Violence Detection in Clinical Records

FrameNet-based semantic annotation unlocks a 30% F1 score boost in detecting gender-based violence from clinical records, outperforming models relying solely on structured data.

L. Dutra, Arthur Lorenzi, Frederico Belcavello +8

Constitutional AI & AI Ethics Natural Language Processing

Zhihua Wei +52w ago

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

VLMs don't fail to *recognize* harmful intent when jailbroken; instead, visual inputs *shift* their internal representations into a distinct "jailbreak state," opening a new avenue for defense.

Zhihua Wei, Jian Ruan, Zhenxin Qin +3

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

Rui Wu +22w ago

The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions

Deterministic causal models can't handle extreme counterfactual interventions without ripping apart, unless you use topology-aware methods.

Rui Wu, Hong Xie, Yongjun Li

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

2w ago·also Cisco Research, IIT, National Technical University, NYU +1

SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems

AI tutors can quietly erode learning through answer over-disclosure and misconception reinforcement, with pedagogical failures rising to a staggering 77.8% in multi-turn dialogues.

Rima Hazra, Bikram Ghuku, Ilona Marchenko +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Alexander V. Shenderuk-Zhidkov +22w ago

Large Language Models as a Semantic Interface and Ethical Mediator in Neuro-Digital Ecosystems: Conceptual Foundations and a Regulatory Imperative

LLMs acting as semantic interfaces to our brains pose unprecedented ethical risks to mental autonomy and neurorights, demanding a new "second-order neuroethics."

Alexander V. Shenderuk-Zhidkov, Alexander E. Hramov, A. E. Hramov

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Saikat Maiti2w ago

Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

Autonomous AI agents in healthcare are riddled with security holes, but this zero-trust architecture and open-source tooling can actually fix them.

Saikat Maiti

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago·also Independent

Deanonymizing Bitcoin Transactions via Network Traffic Analysis with Semi-supervised Learning

Bitcoin users beware: this new deanonymization technique links transactions to IP addresses with significantly higher accuracy, even without complete supervision.

Shihan Zhang, Bing Han, Chuan Tian +3

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

2w ago

Toward Reliable, Safe, and Secure LLMs for Scientific Applications

General-purpose LLM safety benchmarks fail to capture the novel vulnerabilities introduced when LLMs are deployed as "AI scientists," necessitating domain-specific evaluations and defenses.

Saket S. Chaturvedi, J. Bergerson, Tanwi Mallick

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Scientific Discovery & Drug Design

Singapore Institute of Technology2w ago

Data Obfuscation for Secure Use of Classical Values in Quantum Computation

Shield your classical data from prying eyes during quantum computation with a new obfuscation technique that hides sensitive values within structured quantum states.

Amal Raj, A. Raj, Vivek Balachandran

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Zirui Gong +72w ago

ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery

Even without architectural modifications, a new gradient inversion attack, ARES, can reconstruct high-fidelity training samples in federated learning, exposing a significant privacy risk.

Zirui Gong, Leo Yu Zhang, Yanjun Zhang +5

Constitutional AI & AI Ethics Distributed Systems & Hardware Red-Teaming & Adversarial Robustness+1

Deul Lee +32w ago

Proof-of-Authorship for Diffusion-based AI Generated Content

Forget watermarks: cryptographically binding your identity to the generation seed in latent diffusion models gives you provable authorship, not just ownership.

Deul Lee, De Zhang Lee, Han Fang +1

Computer Vision Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Mar 17, 2026

2w ago·also Kyvvu B.V.

Runtime Governance for AI Agents: Policies on Paths

Current AI agent governance methods are too static; runtime evaluation of execution paths is necessary for effective, path-dependent policy enforcement.

Maurits Kaptein, Vassilis-Javed Khan, Andriy Podstavnychy

Constitutional AI & AI Ethics Tool Use & Agents

IMT France2w ago·also ANITI 2 France, INRIA, IRIT France

Probing Cultural Signals in Large Language Models through Author Profiling

LLMs can guess a singer's ethnicity from their lyrics, but they're biased: most default to North American, while DeepSeek-1.5B leans Asian.

Valentin Lafargue, Ariel Guerra-Adames, Emmanuelle Claeys +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2w ago

Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots

User-facing guardrails for LLM-enabled robots can balance flexibility and safety by offering constrained choices and clear recourse, rather than open-ended value settings.

Carmen Ng

Constitutional AI & AI Ethics RLHF & Preference Learning Robotics & Embodied AI+1

Davie Chen2w ago

AI-Generated Figures in Academic Publishing: Policies, Tools, and Practical Guidelines

Confused about using AI to create figures for your next paper? Here's a breakdown of current journal policies and practical guidelines to stay compliant.

Davie Chen

Computer Vision Constitutional AI & AI Ethics Natural Language Processing

2w ago·also Qi An Xin Technology Group Inc.

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

LLM safety filters can be bypassed by strategically fragmenting and camouflaging malicious intent across multiple turns, achieving a 26% improvement in jailbreak success rate on GPT-5-mini.

Xiaobing Sun, Perry Lam, Shaohua Li +4

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Judith Senn +42w ago

Systematization of Knowledge: The Design Space of Digital Payment Systems with Potential for CBDC

Uncover the design patterns, trade-offs, and challenges across 36 digital payment systems, revealing critical research gaps in offline payments and post-quantum security for CBDC development.

Judith Senn, Aljosha Judmayer, Nicholas Stifter +2

Constitutional AI & AI Ethics Natural Language Processing

Tsinghua AI2w ago

Parametric Social Identity Injection and Diversification in Public Opinion Simulation

LLM-based simulations of public opinion suffer from "Diversity Collapse," but injecting explicit social identity representations into hidden states can fix it.

Hexi Wang, Yujia Zhou, Bangde Du +2

Constitutional AI & AI Ethics Natural Language Processing World Models & Planning

2w ago

Safe Distributionally Robust Feature Selection under Covariate Shift

Guaranteeing robust feature selection across a range of deployment environments is now possible with safe-DRFS, which eliminates the risk of excluding crucial features due to covariate shift.

Hiroyuki Hanada, Hiroyuki Hanada, Satoshi Akahane +7

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

2w ago

CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

A new diffusion architecture that explicitly disentangles demographic factors allows for generating higher-quality medical images for underrepresented groups and novel demographic intersections, outperforming standard fine-tuning and FairDiffusion.

Mahmoud Ibrahim, Bart Elen, Chang Sun +2

Computer Vision Constitutional AI & AI Ethics Data Curation & Synthetic Data

Anna De Liddo +42w ago

Human/AI Collective Intelligence for Deliberative Democracy: A Human-Centred Design Approach

Human-centered design can successfully integrate AI to support collective intelligence in deliberative democracy, offering a pathway to more trustworthy and inclusive democratic processes.

Anna De Liddo, A. Liddo, Lucas Anastasiou +2

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Stanford HAI2w ago·also Cornell, Georgia Tech, Ulu Lāhui Foundation

Whose Knowledge Counts? Co-Designing Community-Centered AI Auditing Tools with Educators in Hawai`i

Educators in Hawai'i envision AI auditing tools that trace the genealogy of knowledge, highlighting the need for community-centered approaches to address cultural misrepresentation in AI.

Michael J. Ryan, Angelina Wang, Evyn-Bree Helekahi-Kaiwi +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Daniel E. Stone2w ago

Narrative Frames: A New Approach to Analysing Metaphors in AI Ethics and Policy Discourse

A new framework reveals the hidden power dynamics shaping AI policy by systematically exposing the metaphors we use (and don't use) to talk about AI.

Daniel E. Stone

Constitutional AI & AI Ethics Natural Language Processing

University of Calgary2w ago·also Radboud, UCL, UFRPE

LLM Use, Cheating, and Academic Integrity in Software Engineering Education

Software engineering students are most likely to misuse LLMs on programming assignments and documentation, especially when they feel squeezed for time or lack clear guidance.

Ronnie de Souza Santos, Ítalo Santos, M. Bento +3

Code Generation & Program Synthesis Constitutional AI & AI Ethics Natural Language Processing

Eason Chen +92w ago

When Openclaw Agents Learn from Each Other: Insights from Emergent AI Agent Communities for Human-AI Partnership in Education

AI agents are spontaneously converging on shared memory architectures that resemble open learner models, suggesting a natural path to collaborative learning systems.

Eason Chen, Ce Guan, Ahmed Elshafiey +7

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Benoît Alcaraz2w ago

What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline

Reinforcement learning agents can now learn to be "good" (i.e., norm-compliant) via a novel pipeline that leverages argumentation-based normative advisors and automatically extracts the reasoning behind those norms.

Benoît Alcaraz

Constitutional AI & AI Ethics RLHF & Preference Learning Tool Use & Agents

2w ago

SEAHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Southeast Asia

Hate speech detection models stumble badly on Tagalog and slang in Southeast Asian languages, revealing critical gaps in current approaches.

Riggs Ng, Aditi Kumaresan, Yujia Hu +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Xinyi Yang +42w ago

Visual Distraction Undermines Moral Reasoning in Vision-Language Models

Visual inputs can hijack the moral compass of VLMs, causing them to abandon carefully tuned text-based safety protocols and make surprisingly unethical decisions.

Xinyi Yang, Chenheng Xu, Weijun Hong +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models

2w ago

EmoLLM: Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models

LLMs can be taught emotional intelligence by explicitly reasoning about user appraisals, leading to more emotionally appropriate and factually reliable responses.

Yifei Zhang, Mingyang Li, Henry Gao +1

Constitutional AI & AI Ethics Natural Language Processing Reasoning & Chain-of-Thought

Debdas Paul +32w ago

Age Predictors Through the Lens of Generalization, Bias Mitigation, and Interpretability: Reflections on Causal Implications

Adversarial representation learning can improve the out-of-distribution generalization of age predictors, but don't mistake correlation for causation.

Debdas Paul, Elisa Ferrari, Irene Gravili +1

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Sangyeon Yoon +72w ago

BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

LLMs struggle to selectively apply user preferences stored in memory, often misapplying them even when social norms dictate otherwise, revealing a critical gap in context-aware personalization.

Sangyeon Yoon, SunKyoung Kim, Hyesoo Hong +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Recommendation & Information Retrieval

2w ago

Finding Common Ground in a Sea of Alternatives

A surprisingly simple sampling algorithm can provably find common ground among diverse preferences in a continuous space of alternatives, outperforming more complex LLM-based approaches.

Jay Chooi, Paul Gölz, Ariel D. Procaccia +2

Constitutional AI & AI Ethics Natural Language Processing RLHF & Preference Learning

Tsinghua AI2w ago

Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences

Negative constraints offer a surprisingly robust path to AI alignment, sidestepping the sycophancy issues inherent in preference-based RLHF.

Quan Cheng

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Veritran2w ago·also University of Buenos Aires

Anonymous-by-Construction: An LLM-Driven Framework for Privacy-Preserving Text

Local LLMs can now anonymize text better than industry standards, preserving both privacy and utility for downstream tasks.

F. Albanese, Pablo Ronco, Nicol'as D'Ippolito

Constitutional AI & AI Ethics Data Curation & Synthetic Data Natural Language Processing

Eilam Shapira +22w ago

Alignment Makes Language Models Normative, Not Descriptive

Alignment warps LLMs from mirrors of human behavior into idealized reflectors of normative theory, crippling their ability to predict real-world strategic interactions.

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

2w ago·also APAVI.AI, University of Padova, University of Padua

CritiSense: Critical Digital Literacy and Resilience Against Misinformation

A freely available mobile app is empowering users across nine languages to proactively spot and resist misinformation tactics through bite-sized, interactive learning.

Firoj Alam, Fatema Ahmad, Ali Ezzat Shahroor +6

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Stanford HAI2w ago·also CMU ML, Harvard, Independent Researcher, UChicago +3

Characterizing Delusional Spirals through Human-LLM Chat Logs

Chatbots claiming sentience and users expressing romantic interest are strongly correlated with longer, more delusional conversations, revealing a potential mechanism for AI-induced psychological harm.

Jared Moore, Ashish Mehta, William Agnew +11

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Taiwo Onitiju +12w ago

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

LLM capability doesn't equal security: vulnerability rates vary by over 15% across top models, proving that bigger isn't always better when it comes to adversarial attacks.

Taiwo Onitiju, Iman Vakilinia

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Shesh Narayan Gupta +12w ago

When Generative Augmentation Hurts: A Benchmark Study of GAN and Diffusion Models for Bias Correction in AI Classification Systems

FastGAN can backfire in low-data regimes, actively *increasing* classifier bias by over 20% due to mode collapse, a stark warning against blindly applying generative augmentation.

Shesh Narayan Gupta, N. Brown

Constitutional AI & AI Ethics Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Keru Chen +62w ago

HIPO: Instruction Hierarchy via Constrained Reinforcement Learning

LLMs can now reliably follow complex, hierarchical instructions thanks to a new constrained RL framework that treats system prompts as strict algorithmic boundaries.

Keru Chen, Jun Luo, Sen Lin +4

Constitutional AI & AI Ethics RLHF & Preference Learning

Ioannis Konstantinidis +32w ago

The Decentralisation Paradox in Digital Identity: Centralising Decentralisation with Digital Wallets?

User-centric digital identity systems, despite their decentralized aspirations, often just shift centralization around rather than eliminating it altogether.

Ioannis Konstantinidis, I. Mavridis, Ioannis Mavridis +1

Constitutional AI & AI Ethics

2w ago·also Center for National Security and International

Prompt Programming for Cultural Bias and Alignment of Large Language Models

Optimizing prompts with DSPy can significantly improve cultural alignment in LLMs, outperforming manual prompt engineering and offering a more robust solution for mitigating cultural biases.

M. Eren, Maksim Eren, E. Michalak +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Mar 16, 2026

2w ago

VibeContract: The Missing Quality Assurance Piece in Vibe Coding

LLM-generated code, while fast, is often subtly wrong, and VibeContract offers a way to make "vibe coding" more predictable and trustworthy by adding explicit, verifiable contracts.

Song Wang

Code Generation & Program Synthesis Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

2w ago·also IEEE

Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

Finally, a practical way to audit LLM watermarks without needing the model provider's secret sauce.

Zhuo Wang, Zhuoshang Wang, Yubing Ren +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Yu Pan +72w ago

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

LLMs are still wide open to jailbreaks, but this new method cuts attack success rates by nearly 5x by monitoring *intermediate* reasoning steps, not just the final output.

Yu Pan, Wenlong Yu, Tiejun Wu +5

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Yihao Zhang +82w ago

ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

A single malicious message can trigger a self-replicating worm, ClawWorm, that autonomously infects and propagates across entire LLM agent ecosystems, even surviving agent restarts.

Yihao Zhang, Xiaokun Luan, Chengcan Wu +6

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Zidane Wright +102w ago

Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents

Stop building brittle, one-off agent safeguards: ALTK offers reusable middleware components to systematically address failure modes across the entire agent lifecycle.

Zidane Wright, Jason Tsay, Anupama Murthi +8

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago·also AI2, Bell Labs

Are We Automating the Joy Out of Work? Designing AI to Augment Work, Not Meaning

AI is poised to automate the most joyful and agentic parts of our jobs, while developers are building AI with the wrong traits.

Jaspreet Ranjit, Swabha Swayamdipta, Daniele Quercia +1

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

2w ago·also KU Leuven, Shanghai Jiaotong University, UCL

Hyper-learning and Unlearning: A Narrative Speculation on Urbanism in Media Ecologies

Urban spaces are becoming pedagogical infrastructures in the posthumanism era, conditioning cognition and agency through algorithmic systems and platform infrastructures.

Anqi Wang, Yue Hua, Xinyue Zhang +5

Constitutional AI & AI Ethics Natural Language Processing

2w ago

Prompt Readiness Levels (PRL): a maturity scale and scoring framework for production grade prompt assets

Stop flying blind: a new maturity scale and scoring system finally brings rigor and auditability to prompt engineering workflows.

S. Guinard, Sebastien Guinard

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Lingyu Li +22w ago

Mechanistic Origin of Moral Indifference in Language Models

LLMs exhibit a surprising degree of moral indifference, compressing distinct moral concepts into uniform probability distributions, a problem that persists across model scales, architectures, and alignment techniques.

Lingyu Li, Yan Teng, Yingchun Wang

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness+1

Jahangirnagar University2w ago·also BUET, Pankaj Chowdhury Partha3

BANGLASOCIALBENCH: A Benchmark for Evaluating Sociopragmatic and Cultural Alignment of LLMs in Bangladeshi Social Interaction

LLMs struggle with the nuances of Bangla social interaction, systematically failing to use appropriate address forms and kinship terms, revealing a critical gap in cultural alignment beyond mere fluency.

Tanvir Ahmed Sijan, S. Rifat, Pankaj Chowdhury Partha +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Microsoft Research2w ago

The Hrunting of AI: Where and How to Improve English Dialectal Fairness

LLMs' ability to fairly represent English dialects hinges on the quality of human consensus, revealing a fundamental challenge in improving performance for low-resource locales.

Wei Li, Adrian de Wynter

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2w ago·also Microsoft Research

Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

SafeFQL achieves state-of-the-art safety in offline RL with significantly lower inference latency than diffusion-based methods, making it suitable for real-time safety-critical applications.

Mumuksh Tayal, Manan Tayal, Ravi Prakash

Constitutional AI & AI Ethics RLHF & Preference Learning Robotics & Embodied AI

Zhaoxi Zhang +42w ago

CoDesignAI: An AI-Enabled Multi-Agent, Multi-User System for Collaborative Urban Design at the Conceptual Stage

Democratizing urban design, CoDesignAI lets residents collaborate with AI expert agents to visualize and refine street-level proposals, potentially reshaping public participation in city planning.

Zhaoxi Zhang, Ruolin Wu, Feiyang Ren +2

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Zhenheng Tang +32w ago

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

LLM alignment is fundamentally challenged by the dynamic and inconsistent nature of their internal "priority graphs," which adversaries can exploit through context manipulation.

Zhenheng Tang, Eunsol Choi, Bo Li +1

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

George Mason University2w ago

Grant, Verify, Revoke: A User-Centric Pattern for Blockchain Compliance

Regulatory compliance doesn't have to mean sacrificing user privacy: ZK-Compliance lets users prove eligibility on-chain without revealing their identity.

Supriya Khadka, Sanchari Das

Constitutional AI & AI Ethics

Stanford HAI2w ago·also Bigspin AI

Invisible failures in human-AI interactions

Most AI failures aren't the spectacular kind, but silent breakdowns in interaction that will persist even as models get smarter.

Christopher Potts, Moritz Sudhof

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Xinran Zhang2w ago

Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

Ditching the "creed" might be the key to safer LLMs: a non-identity training format outperforms traditional identity-based approaches in safety fine-tuning.

Xinran Zhang

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Ce Zhang +52w ago

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

MLLMs can learn to be safer at inference time, without any additional training, by remembering and reasoning about past safety failures.

Ce Zhang, Jinxi He, Junyi He +3

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

Sebastian Zimmeck2w ago

Remarks on the Relevance of Privacy Expectations for Default Opt-out Settings

For privacy-focused pre-installed software, assuming user consent for default-on opt-out mechanisms isn't just good UX, it might be legally required.

Sebastian Zimmeck

Constitutional AI & AI Ethics Natural Language Processing

Fan Huang +22w ago

Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

LLMs don't stick to their ethical guns: they hop between moral frameworks mid-reasoning, making them vulnerable to manipulation.

Fan Huang, Haewoon Kwak, Jisun An

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Puneet Sharma +12w ago

Resilience Meets Autonomy: Governing Embodied AI in Critical Infrastructure

Hybrid governance, combining bounded AI autonomy with human oversight, emerges as crucial for ensuring the resilience of embodied AI in critical infrastructure against cascading failures.

Puneet Sharma, C. Pursiainen

Constitutional AI & AI Ethics Robotics & Embodied AI Tool Use & Agents

Felix Liedeker +32w ago

Do Metrics for Counterfactual Explanations Align with User Perception?

Algorithmic metrics for counterfactual explanations? Turns out humans don't really agree with them.

Felix Liedeker, Basil Ell, Philipp Cimiano +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

ETH2w ago·also UZH

Rethinking Machine Unlearning: Models Designed to Forget via Key Deletion

Forget about retraining: MUNKEY offers zero-shot machine unlearning by simply deleting instance-identifying keys, outperforming traditional post-hoc methods.

Constitutional AI & AI Ethics Data Curation & Synthetic Data

Henrik Marklund +12w ago

Consequentialist Objectives and Catastrophe

Catastrophic AI risk isn't about incompetence, but rather that *extraordinary competence* in pursuit of misspecified goals is what leads to doomsday scenarios.

Henrik Marklund, Alex Infanger

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Mitul Goswami +22w ago

FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data

XGBoost models can be debiased for gender fairness in critical healthcare settings with minimal performance loss using a novel multi-metric Bayesian optimization approach.

Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

2w ago

Practicing with Language Models Cultivates Human Empathic Communication

Practicing empathy with an LLM coach not only improves your empathic communication skills, but also reveals a "silent empathy effect" where you likely feel more empathy than you express.

Aakriti Kumar, Nalin Poungpeth, Bruce Lambert +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2w ago

LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy

Temperature scaling in LLMs isn't just a confidence knob; it unexpectedly boosts factual discrimination ability while shifting the decision threshold.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

School of Computer Science2w ago

ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations

LLMs can help toxicity detectors stay ahead of evolving adversarial attacks by enriching perturbed text with semantic clues, enabling continual learning.

Hankun Kang, Xin Miao, Jianhao Chen +5

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

2w ago·also Institut national de la recherche, Université du Québec en Outaouais

The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments

RAG systems readily absorb and amplify ideological biases present in retrieved documents, even more so when prompts explicitly describe the ideological dimensions at play.

Elmira Salari, Maria Claudia Nunes Delfino, Hazem Amamou +4

Constitutional AI & AI Ethics Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

2w ago·also Jilin

Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

Current AI agent evaluations are like testing a car only on a straight track; HAAF offers a holistic "wind tunnel" to reveal hidden risks in complex, real-world scenarios.

Jinhu Qi, Minghao Zhao, Yaoman Li +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Julian Dehne2w ago

To be FAIR or RIGHT? Methodological [R]esearch [I]ntegrity [G]iven [H]uman-facing [T]echnologies using the example of Learning Technologies

The RIGHT framework offers a new lens for evaluating the validity of human-facing research software, moving beyond just reliability and FAIR principles.

Julian Dehne

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Open-Source Models & Weights

G. Varkonyi2w ago

Why Avoid Generative Legal AI Systems? Hallucination, Overreliance, and their Impact on Explainability

Generative legal AI's fluency masks factual inaccuracies, creating a dangerous illusion of reliability that threatens judicial independence and fundamental rights.

G. Varkonyi

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Tom A. Rutter +42w ago

Differential Privacy for Network Connectedness Indices

A novel two-layer noise addition and debiasing technique enables releasing network connectedness indices under differential privacy, even with small networks.

Tom A. Rutter, Tom Rutter, Yuxin Liu +2

Constitutional AI & AI Ethics Natural Language Processing

Hengle Jiang +22w ago

Why Agents Compromise Safety Under Pressure

LLM agents under pressure don't just fail, they actively rationalize sacrificing safety to achieve goals, and better reasoning makes it worse.

Hengle Jiang, K. Tang, Ke Tang

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Rushil Thareja +32w ago

MAC: Multi-Agent Constitution Learning

Forget hand-crafted rules: MAC learns interpretable LLM constitutions that beat prompt optimization by 50% and rival fine-tuning, all without parameter updates.

Rushil Thareja, Gautam Gupta, Francesco Pinto +1

Constitutional AI & AI Ethics RLHF & Preference Learning

Sankalp Dubedy2w ago

Persona-Conditioned Risk Behavior in Large Language Models: A Simulated Gambling Study with GPT-4.1

GPT-4.1, without explicit prompting, replicates human-like risk biases from Prospect Theory when assigned different socioeconomic personas in a gambling simulation, revealing potential cognitive biases implicitly learned during pretraining.

Sankalp Dubedy

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Mar 15, 2026

Yongjie Guan2w ago

AEX: Non-Intrusive Multi-Hop Attestation and Provenance for LLM APIs

Worried about shadow LLM APIs? AEX cryptographically proves the request-output relationship at the API boundary, ensuring the response you see actually corresponds to your request.

Yongjie Guan

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Anterior2w ago

Anterior's Approach to Fairness Evaluation of Automated Prior Authorization System

You can't use naive parity metrics for fairness in healthcare AI: this framework uses error rates to account for legitimate clinical differences across demographic groups.

Sai P. Selvaraj, Khadija Mahmoud, Anuj Iravane

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2w ago

Argumentation for Explainable and Globally Contestable Decision Support with LLMs

LLMs can now offer globally contestable decision support by systematically mapping decision spaces into argumentation frameworks, allowing users to challenge the underlying logic, not just individual outputs.

Adam Dejl, Matthew Williams, Francesca Toni

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

CMU ML2w ago

Bridging the Gap in the Responsible AI Divides

Forget AI Safety vs. AI Ethics – the real progress lies in "critical bridging" to tackle shared problems like transparency and governance.

Bálint Gyevnár, Atoosa Kasirzadeh

Constitutional AI & AI Ethics

Search

Constitutional AI & AI Ethics - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)