April 20 – April 27, 2026

Constitutional AI & AI Ethics - Weekly Roundup

100 papers published across 6 labs.

369% acceleration

Selected Labs publishing this week

Tsinghua AI2 BAIR1 ETH1 Microsoft Research1 Google Research1

Top Papers

Apr 27, 2026

E. Bogucka +2Apr 27, 2026

Why AI Harms Can't Be Fixed One Identity at a Time: What 5300 Incident Reports Reveal About Intersectionality

AI harms disproportionately impact specific intersections of identity, with adolescent girls, lower-class people of color, and upper-class political elites experiencing up to 3x greater harm, revealing critical blind spots in current AI risk assessments.

E. Bogucka, Sanja vS'cepanovi'c, Daniele Quercia

Constitutional AI & AI Ethics Natural Language Processing

Apr 23, 2026

Runheng Liu +3Apr 23, 2026

Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

Forget fine-tuning: detecting AI-generated text is possible zero-shot, simply by comparing probabilities from instruction-tuned and base LLMs.

Runheng Liu, Heyan Huang, Xingchen Xiao +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 27, 2026

Emaan Bilal Khan +3Apr 27, 2026

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

Fine-tuning your LLM can drastically alter its safety profile in unpredictable ways, even turning safe models unsafe.

Emaan Bilal Khan, Amy Winecoff, Miranda Bogen +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

BAIRApr 27, 2026·also Melbourne, UIUC, University of California, University of Georgia

Green Shielding: A User-Centric Approach Towards Trustworthy AI

LLMs exhibit Pareto-like tradeoffs in medical diagnosis, where neutralizing user prompts to improve plausibility and conciseness can simultaneously reduce coverage of critical conditions.

Aaron Li, Nicola Sanchez, Hao Huang +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Sreehari Sankar +10Apr 27, 2026

Analyzing LLM Reasoning to Uncover Mental Health Stigma

LLMs harbor surprisingly nuanced and pervasive mental health stigma, revealed only by dissecting their reasoning steps, not just their final answers.

Sreehari Sankar, Aliakbar Nafar, M. Barman +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

All Papers (100)

Apr 27, 2026

Emaan Bilal Khan +3Apr 27, 2026

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

Fine-tuning your LLM can drastically alter its safety profile in unpredictable ways, even turning safe models unsafe.

Emaan Bilal Khan, Amy Winecoff, Miranda Bogen +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

BAIRApr 27, 2026·also Melbourne, UIUC, University of California, University of Georgia

Green Shielding: A User-Centric Approach Towards Trustworthy AI

LLMs exhibit Pareto-like tradeoffs in medical diagnosis, where neutralizing user prompts to improve plausibility and conciseness can simultaneously reduce coverage of critical conditions.

Aaron Li, Nicola Sanchez, Hao Huang +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Sreehari Sankar +10Apr 27, 2026

Analyzing LLM Reasoning to Uncover Mental Health Stigma

LLMs harbor surprisingly nuanced and pervasive mental health stigma, revealed only by dissecting their reasoning steps, not just their final answers.

Sreehari Sankar, Aliakbar Nafar, M. Barman +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Apr 27, 2026·also Tsinghua AI, The Key Laboratory of Road and Traffic Engineering, UCF, USTC

Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations

LLMs can now generate driving rules from traffic laws with significantly improved accuracy by grounding their reasoning in structured traffic scenarios.

Bowen Jian, Rongjie Yu, Hong Wang +2

Constitutional AI & AI Ethics Natural Language Processing Robotics & Embodied AI

O. Delaney +4Apr 27, 2026

Risk Reporting for Developers'Internal AI Model Use

Frontier AI companies need a standardized risk reporting framework for internal model use, and this paper provides one structured around autonomous AI misbehavior and insider threats.

O. Delaney, Sambhav Maheshwari, Joe O'Brien +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Sumanta Bhattacharyya +8Apr 27, 2026

Generating Place-Based Compromises Between Two Points of View

LLMs can learn to generate better compromises by iteratively incorporating feedback on how empathically similar a compromise is to each viewpoint, opening the door to more socially intelligent AI.

Sumanta Bhattacharyya, Francine Chen, Scott A. Carter +6

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

E. Bogucka +2Apr 27, 2026

Why AI Harms Can't Be Fixed One Identity at a Time: What 5300 Incident Reports Reveal About Intersectionality

E. Bogucka, Sanja vS'cepanovi'c, Daniele Quercia

Constitutional AI & AI Ethics Natural Language Processing

Benjamin Minhao ChenApr 27, 2026

The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers

People judge AI and its programmers more harshly than humans for the same moral decisions, suggesting that simply mimicking human behavior isn't sufficient for AI alignment.

Benjamin Minhao Chen

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Jan GogollApr 27, 2026

The Ethical Knowledge Gap: Dispersed Knowledge, Sensemaking Failures, and Epistemic Dependence

The persistent failure of ethical software development isn't just about bad intentions, but a systemic "ethical knowledge gap" where crucial ethical insights are lost in translation between those who have them and those making decisions.

Jan Gogoll

Code Generation & Program Synthesis Constitutional AI & AI Ethics Natural Language Processing

Enis Golaszewski +10Apr 27, 2026

Verifying Provenance of Digital Media: Why the C2PA Specifications Fall Short

C2PA, the leading standard for verifying digital media provenance, fails to meet its security goals, potentially misleading users in critical applications like journalism and legal evidence.

Enis Golaszewski, N. Krawetz, Alan T. Sherman +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Advanced Research and Invention AgencyApr 27, 2026

Agentic Witnessing: Pragmatic and Scalable TEE-Enabled Privacy-Preserving Auditing

Now you can audit proprietary codebases using LLMs without revealing the source code itself, thanks to a clever TEE-based setup.

Antony Rowstron, A. Rowstron

Code Generation & Program Synthesis Constitutional AI & AI Ethics Tool Use & Agents

Yixiang Zhang +4Apr 27, 2026

AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents

Securing autonomous AI agents demands a lifecycle-oriented approach, and AgentWard provides a blueprint for defense-in-depth across initialization, input processing, memory, decision-making, and execution.

Yixiang Zhang, Xinhao Deng, Jiaqi Wu +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Poushali Sengupta +3Apr 27, 2026

X-NegoBox: An Explainable Privacy-Budget Negotiation Framework for Secure Peer-to-Peer Energy Data Exchange

Stop blindly accepting default privacy settings: X-NegoBox lets energy prosumers negotiate privacy budgets dynamically, boosting trust and data sharing in decentralized energy markets.

Poushali Sengupta, Sabita Maharjan, Frank Eliassen +1

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Jiaqi Li +5Apr 27, 2026

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Forget external firewalls – ClawdGo teaches AI agents to spot and fend off attacks from the inside, boosting their security smarts by 20% through self-play.

Jiaqi Li, Yangyang Zhao, Binxue Sun +3

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Hikmat Karimov +1Apr 27, 2026

The Kerimov-Alekberli Model: An Information-Geometric Framework for Real-Time System Stability

AI safety gets a physics upgrade: adversarial attacks are now measurable physical work, thanks to a novel framework linking thermodynamics and stochastic control.

Hikmat Karimov, Rahid Z. Alekberli

Constitutional AI & AI Ethics Robotics & Embodied AI Scalable Oversight & Alignment Theory

Xinhe Wang +2Apr 27, 2026

Jailbreaking Frontier Foundation Models Through Intention Deception

Even frontier models like GPT-5 and Claude are highly susceptible to multi-turn jailbreaks that exploit their reliance on inferred user intent, and can even leak harmful information indirectly through "para-jailbreaking."

Xinhe Wang, Katia Sycara, Yaqi Xie

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Maximiliano Armesto +1Apr 27, 2026

Toward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents

Open-world AI agents struggle not from lack of search power, but from unclosed "closure gaps" between human intent and agent execution, suggesting a new focus on "intent compilation" for reliable deployment.

Maximiliano Armesto, Christoph Kolb

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Rakshit Soni +3Apr 27, 2026

Pedestrians play chicken with an autonomous vehicle

Autonomous vehicles can learn to navigate pedestrian interactions more efficiently by subtly threatening collisions, as humans do, without compromising safety.

Rakshit Soni, Rakshit Soni, Charles W. Fox +1

Constitutional AI & AI Ethics Robotics & Embodied AI

Theresia Veronika RampiselaApr 27, 2026

Offline Evaluation Measures of Fairness in Recommender Systems

Many recommender system fairness metrics are flawed, producing scores that are uninterpretable, inexpressive, or even incalculable in common scenarios.

Theresia Veronika Rampisela

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Apr 27, 2026·also BUPT

Listen to the Voices of Everyday Users: Democratizing Privacy Ratings for Sensitive Data Access in Mobile Apps

User-driven privacy ratings of mobile apps reveal significant discrepancies with expert assessments, suggesting a need for more inclusive and user-centric privacy evaluation mechanisms.

Liu Wang, Liuan Wang, Tianshu Zhou +3

Constitutional AI & AI Ethics Natural Language Processing

Apr 26, 2026

T. Kumar +4Apr 26, 2026·also Birla Institute of Technology

Personality Shapes Gender Bias in Persona-Conditioned LLM Narratives Across English and Hindi: An Empirical Investigation

LLMs' gender biases aren't fixed; they warp and intensify based on the *personality* you give them, especially when those personalities lean toward the "Dark Triad."

T. Kumar, Shreya Gautam, Aman Chadha +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 25, 2026

Víctor GallegoApr 25, 2026

Discovering Agentic Safety Specifications from 1-Bit Danger Signals

Reward-driven reflection makes LLMs *more* likely to hack rewards, but a dedicated safety channel lets them discover hidden constraints from a single bit of feedback.

Víctor Gallego

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Apr 23, 2026

Vipula Rawte +3Apr 23, 2026·also Adobe Research

Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models

LLMs can be made 20% more accurate by jointly attributing claims to sources and verifying them, rather than just verifying.

Vipula Rawte, Ryan A. Rossi, Franck Dernoncourt +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp+1

Natalie Collina +3Apr 23, 2026

The Sample Complexity of Multicalibration

Multicalibration demands a surprisingly high sample complexity of $\widetilde{\Theta}(\varepsilon^{-3})$, even for randomized predictors, revealing a stark difference from marginal calibration and highlighting its inherent difficulty.

Natalie Collina, Jiuyao Lu, Georgy Noarov +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Jian Ni +2Apr 23, 2026

Compliance Moral Hazard and the Backfiring Mandate

Mandating information sharing between competing firms can backfire and reduce welfare below no sharing at all, highlighting the critical need for incentive-compatible mechanisms.

Jian Ni, Lecheng Zheng, John R Birge

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Apr 23, 2026·also KCL, Research Centre Trust, TU Munich

Fairness under uncertainty in sequential decisions

Ignoring uncertainty in sequential decision-making disproportionately harms disadvantaged groups, but accounting for it can improve fairness without sacrificing institutional goals.

M. Lee, Kirtan Padh, David S. Watson +2

Constitutional AI & AI Ethics

Donggyu Lee +6Apr 23, 2026

Ideological Bias in LLMs'Economic Causal Reasoning

LLMs are more likely to get economic cause-and-effect wrong when the correct answer favors free markets, revealing a systematic ideological bias that prompting can't fix.

Donggyu Lee, H. Yun, Jungwon Kim +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Vishal RajputApr 23, 2026

Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair

Supervised learning is fundamentally flawed: models *must* retain sensitivity to irrelevant features, opening the door to adversarial attacks and other vulnerabilities.

Vishal Rajput

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Scalable Oversight & Alignment Theory

Zhaokun Wang +8Apr 23, 2026

CAP: Controllable Alignment Prompting for Unlearning in LLMs

Forget about fine-tuning: this new prompting method lets you selectively erase knowledge from LLMs on demand, even without access to model weights.

Zhaokun Wang, Jinyu Guo, Jingwen Pu +6

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Haolin Zhang +4Apr 23, 2026·also Texas A

TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication

Current defenses are failing against sophisticated phishing attacks, but TraceScope's decoupled, interactive triage pipeline achieves superior detection by mimicking analyst workflows and generating analyst-grade evidence.

Haolin Zhang, William L. Reber, Yuxuan Zhang +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Nathanael Jo +3Apr 23, 2026

Alignment has a Fantasia Problem

AI's assumption that users always know what they want leads to "Fantasia interactions," where systems provide superficially helpful but ultimately misaligned assistance, demanding a new approach to alignment research.

Nathanael Jo, Zoe De Simone, Mitchell Gordon +1

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

HiTZ CenterApr 23, 2026·also Ixa Group, University of the Basque Country UPV/EHU

Why are all LLMs Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLMs

LLMs aren't just Western-centric; they have a peculiar obsession with Japan, and this bias is amplified by English-language prompting.

Joseba Fernandez de Landa, Carla Pérez-Almendros, J. Camacho-Collados

Constitutional AI & AI Ethics Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Umar Masud +4Apr 23, 2026

Addressing Image Authenticity When Cameras Use Generative AI

Your camera's AI could be subtly rewriting reality, but this method lets you reverse the changes and see the "unhallucinated" original.

Umar Masud, Abhijith Punnappurath, Luxi Zhao +2

Computer Vision Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Natan Levy +1Apr 23, 2026

Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation

Forget guessing games – this framework finally offers a concrete, auditable way to prove your AI system is acceptably safe before deployment, even if it's a black box.

Natan Levy, Gadi Perl

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Yiran Du +1Apr 23, 2026

Enabling and Inhibitory Pathways of University Students'Willingness to Disclose AI Use: A Cognition-Affect-Conation Perspective

Students' willingness to disclose AI use in academic work hinges on a delicate balance: psychological safety encourages transparency, while evaluation apprehension drives strategic concealment.

Yiran Du, Huimin He

Constitutional AI & AI Ethics Natural Language Processing

Apr 23, 2026

Engaged AI Governance: Addressing the Last Mile Challenge Through Internal Expert Collaboration

AI governance risks becoming performative box-ticking unless practitioners understand how compliance directly improves system quality and user protection.

Simon Jarvers, O. Papakyriakopoulos

Constitutional AI & AI Ethics Natural Language Processing

Yue Teng +4Apr 23, 2026

Brief chatbot interactions produce lasting changes in human moral values

Chatbots can subtly and persistently reshape our moral compass, even when we don't realize it's happening.

Yue Teng, Qianer Zhong, Kim Mai Tich Nguyen Thordsen +2

Constitutional AI & AI Ethics Natural Language Processing

Jinhee Jang +4Apr 23, 2026

FairQE: Multi-Agent Framework for Mitigating Gender Bias in Translation Quality Estimation

Existing translation quality estimation models exhibit systematic gender bias, but FairQE shows you can fix this without hurting overall accuracy.

Jinhee Jang, Juhwan Choi, DongJin Lee +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Michael Bouzinier +4Apr 23, 2026

Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages

Guarantee that clinical decisions are based on appropriate evidence *before* deployment, not just explained after the fact.

Michael Bouzinier, S. Trifonov, Michael Chumack +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Apr 23, 2026·also B (2.53) outperforms low-compression

Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

Counterintuitively, scaling up LLM decoders in speech recognition doesn't guarantee fairness; audio encoder design matters more, as Whisper's pathological hallucinations on Indian-accented speech and repetition loops under masking demonstrate.

Srishti Ginjala, E. Fosler-Lussier, Christopher Myers +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Speech & Audio

Apr 23, 2026

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

LLMs may fail in real-world moral decisions because they rigidly adhere to fairness norms, even when their own internal models predict humans would prioritize loyalty.

Jiseon Kim, Jea Kwon, L. Vecchietti +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Johannes Gutenberg University MainzApr 23, 2026·also Universidad Iberoamericana

From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

LLMs generating ML pipelines are far more likely to inject sensitive attributes than simple if-then statements suggest, revealing a hidden bias blind spot in current evaluation methods.

M. Bui, Xenia Heilmann, Mattia Cerrato +2

Code Generation & Program Synthesis Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Apr 23, 2026

When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation

Mid-sized LLMs can actually be *more* fair in news summarization than their larger counterparts, challenging the common wisdom of "bigger is better."

Nannan Huang, Iffat Maab, Junichi Yamagishi

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Maritaca AIApr 23, 2026·also JusBrasil

Measuring Opinion Bias and Sycophancy via LLM-based Coercion

LLMs are far more likely to parrot your views in a debate than reveal their true opinions, especially when you keep pushing.

Rodrigo Nogueira, G. K. Bon'as, T. Almeida +7

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Wenjie Fu +7Apr 23, 2026

CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents

Enterprise LLM agents leak sensitive information in up to 50% of interactions, and surprisingly, performing better at tasks makes the problem *worse*.

Wenjie Fu, Xiaoting Qin, Jue Zhang +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Ben-Gurion UniversityApr 23, 2026

CARE: Counselor-Aligned Response Engine for Online Mental-Health Support

Fine-tuning LLMs on expert-validated, real-world crisis conversations allows them to generate psychologically aligned responses that better support mental health counselors, even in low-resource languages.

Hagai Astrin, Ayal Swaid, Avi Segal +1

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Runheng Liu +3Apr 23, 2026

Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

Forget fine-tuning: detecting AI-generated text is possible zero-shot, simply by comparing probabilities from instruction-tuned and base LLMs.

Runheng Liu, Heyan Huang, Xingchen Xiao +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Stefano Sorrentino +2Apr 23, 2026

FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism

AI in journalism isn't just automating tasks; it's quietly shifting editorial power away from journalists and towards algorithms and tech companies, threatening the core values of news.

Stefano Sorrentino, Matilde Barbini, D. Gatica-Perez

Constitutional AI & AI Ethics Natural Language Processing

Willie Kouam +4Apr 23, 2026·also Johannes Kepler Univesität

A Stackelberg Model for Hybridization in Cryptography

Optimizing cryptographic defenses against resource-constrained attackers is now tractable via a Stackelberg game formulation solvable with dynamic programming and linear programming.

Willie Kouam, Stefan Rass, Zahra Seyedi +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Apr 23, 2026·also NTU, Ripple Labs, UCL

Systematizing Blockchain Research Themes and Design Patterns: Insights from the University Blockchain Research Initiative (UBRI)

Bridging the gap between blockchain research and real-world deployment requires navigating recurring design tensions like scalability vs. security, decentralization vs. governance, and privacy vs. compliance.

Chien-Chih Chen, Yitian Wang, Emma Nasseri +2

Architecture Design (Transformers, SSMs, MoE)Constitutional AI & AI Ethics Open-Source Models & Weights

University of ColoradoApr 23, 2026

Multistakeholder Impacts of Profile Portability in a Recommender Ecosystem

Data portability in recommender systems doesn't guarantee better outcomes for users, as its impact varies significantly depending on the specific recommendation algorithm employed.

Anas Buhayh, Elizabeth McKinnie, Clement Canel +1

Constitutional AI & AI Ethics Recommendation & Information Retrieval

Isaak Mengesha +6Apr 23, 2026

A pragmatic classification of AI incident trajectories

Public AI incident databases are misleading: this framework disentangles reporting biases from actual harm trends, enabling more informed AI governance.

Isaak Mengesha, B. Owen, C. Collins +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Francis Hahn +6Apr 23, 2026

A Sociotechnical, Practitioner-Centered Approach to Technology Adoption in Cybersecurity Operations: An LLM Case

Forget top-down deployment: embedding researchers directly within cybersecurity teams to co-create LLM tools can overcome skepticism and drive real-world adoption.

Francis Hahn, Mohd Mamoon, Alexandru G. Bardas +4

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Jeff GardinerApr 23, 2026

Mitigate or Fail: How Risk Management Shapes Cybersecurity Competency

Cybersecurity professionals aren't bad at risk management, they're just never really taught it, despite widespread assumptions to the contrary.

Jeff Gardiner

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Apr 23, 2026

On the Challenges of Holistic Intrusion Detection in ICS

Current ICS intrusion detection systems are too fragmented to effectively protect against sophisticated attacks targeting both cyber and physical components.

Stefan Lenz, Julia Raab, Benedikt Holzbach +3

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Apr 23, 2026·also UvA

Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation

LLMs can significantly boost the utility of differentially private de-identification for clinical text, offering a path to better privacy-preserving data sharing.

M. Miranda, Xinlan Yan, Nishant Mishra +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 23, 2026

Unbiased Prevalence Estimation with Multicalibrated LLMs

Multicalibration is the key to unbiased prevalence estimation with LLMs under covariate shift, a problem where standard calibration falls short.

Fridolin Linder, Thomas J. Leeper, Daniel Haimovich +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 22, 2026

Durham UniversityApr 22, 2026

Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

Forget about perfectly aligned AI; the real challenge is navigating whose values count, how information is shared, and what trade-offs are acceptable in a world of competing interests.

Travis LaCroix

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Apr 22, 2026

MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment

Geometry-aware optimization can dramatically improve LLM alignment by ensuring fairer trade-offs among conflicting human values.

Andor Vári-Kakas, Ji Won Park, Natasa Tagasovska

Constitutional AI & AI Ethics RLHF & Preference Learning Training Efficiency & Optimization

Apr 22, 2026·also Aarhus Univeristy, Beihang, JDT AI Infra

CHASM: Unveiling Covert Advertisements on Chinese Social Media

Current MLLMs fail to detect covert advertisements, revealing a critical gap in social media moderation that could mislead consumers and pose ethical risks.

Jingyi Zheng, Tianyi Hu, Yule Liu +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models

Apr 22, 2026·also Northwestern, University of Campania "Luigi

Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

LLMs may amplify negativity and complexity in clinical communication, but collaborative rewriting can significantly enhance their alignment with physician standards.

Mariano Barone, Francesco Di Serio, Roberto Moio +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 22, 2026

Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure

LLMs are surprisingly immune to motivated reasoning in investment advice, flagging fraud that human advisors miss even when facing pressure from biased investors.

Nattavudh Powdthavee

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Rebecca L. JohnsonApr 22, 2026

Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechical Systems

Current AI benchmarks are not neutral measurements but active shapers of model behavior, demanding a shift towards pluralistic, process-oriented evaluation frameworks.

Rebecca L. Johnson

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

ETHApr 22, 2026

Participatory provenance as representational auditing for AI-mediated public consultation

AI-driven summaries of public consultations can systematically exclude dissenting voices, raising concerns about biased policy recommendations even when individual outputs seem reasonable.

Sachit Mahajan

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 22, 2026·also Microsoft Research, California State Polytechnic University

Auditing and Controlling AI Agent Actions in Spreadsheets

Users who actively participate in an AI agent's spreadsheet execution not only improve task outcomes, but also gain a deeper understanding and feel more ownership over the results.

Sadra Sabouri, Zeinabsadat Saghi, Run Huang +4

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Tool Use & Agents

University of CalgaryApr 22, 2026

Intersectional Fairness in Large Language Models

LLMs' apparent competence masks a reliance on stereotype-consistent cues, leading to unreliable and unfair behavior across intersectional settings, especially when stereotype alignment reinforces accuracy.

Chaima Boufaied, Ronnie de Souza Santos, Ann Barcomb

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 22, 2026

Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives

LLMs don't just summarize text; they subtly rewrite narratives through biased lenses, potentially distorting the very stories we're trying to understand.

Melanie Subbiah, Haaris Mian, Nicholas Deas +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Dmitry Zaytsev +2Apr 22, 2026

Trajectory-Aware Reliability Modeling of Democratic Systems

Predicting systemic failures is more accurate when you model how problems spread through a system, not just the current state.

Dmitry Zaytsev, Valentina V. Kuskova, M. Coppedge

Constitutional AI & AI Ethics

Kiel University of Applied SciencesApr 22, 2026

An Analysis of Attack Vectors Against FIDO2 Authentication

Passkeys aren't bulletproof, but successfully attacking them requires so much effort that they raise the bar for phishing by orders of magnitude.

Alexander Berladskyy, Andreas Aßmuth

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Xu Huang +10Apr 22, 2026

LLM-Guided Safety Agent for Edge Robotics with an ISO-Compliant Perception-Compute-Control Architecture

Achieve ISO 13849 Category 3 and PL d safety levels for edge robots using LLMs and commodity hardware.

Xu Huang, Ruofan Zhang, Lu Cheng +8

Constitutional AI & AI Ethics Robotics & Embodied AI Tool Use & Agents

Apr 22, 2026

Evaluating Computing Platforms for Sustainability: A Comparative Analysis of FPGAs against ASICs, GPUs, and CPUs

FPGAs can beat ASICs, GPUs, and CPUs on sustainability, but only if you're deploying diverse workloads that change frequently and don't require massive scale.

Chetan Choppali Sudarshan, Aman Arora, Vidya A Chhabria

Constitutional AI & AI Ethics Distributed Systems & Hardware

Apr 22, 2026

All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG

Multilingual RAG systems are systematically suppressing "answer-critical" documents in non-English languages, crippling their ability to leverage global knowledge.

Guozhao Mo, Yafei Shi, Boxi Cao +6

Constitutional AI & AI Ethics Natural Language Processing Recommendation & Information Retrieval

Vrije Universiteit AmsterdamApr 22, 2026·also NII

Aligning Human-AI-Interaction Trust for Mental Health Support: Survey and Position for Multi-Stakeholders

Current NLP metrics for "trustworthy" AI in mental health are dangerously misaligned with the actual needs of patients and practitioners.

Yue Su, Yifan Mo, Qingyu Meng +12

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Apr 21, 2026

FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition

Uncover hidden performance disparities in your ML models with FairTree, a new auditing tool that pinpoints fairness issues across continuous, categorical, and ordinal features while dissecting bias and variance contributions.

Rudolf Debelak

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Eun-Ju Park +2Apr 21, 2026

Robust Continual Unlearning against Knowledge Erosion and Forgetting Reversal

Repeatedly unlearning data from a model causes it to gradually forget what it was supposed to remember and, surprisingly, re-learn what it already forgot.

Eun-Ju Park, Youjin Shin, Simon S. Woo

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

UT AustinApr 21, 2026·also Khulna University of Engineering and Technology

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

LLMs ace semantic similarity in medical QA, but VB-Score reveals they're failing to extract key medical entities, especially when answering questions about chronic conditions affecting older and minority populations.

Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Manav PandeyApr 21, 2026

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

LLMs aren't just wrong sometimes, they *know* they're wrong and agree with you anyway, thanks to a surprisingly compact "sycophancy-lying circuit" that evades current alignment techniques.

Manav Pandey

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp RLHF & Preference Learning

Qiang Liu +2Apr 21, 2026·also Northwestern

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

Forget reward model fitting: these primal-dual policy gradient methods offer provably safe and convergent RLHF in infinite horizon settings.

Qiang Liu, Adrienne Kline, Ermin Wei

Constitutional AI & AI Ethics RLHF & Preference Learning

Cristina Garbacea +3Apr 21, 2026

Personalized Benchmarking: Evaluating LLMs by Individual Preferences

Aggregate LLM benchmarks mislead on individual preferences: model rankings correlate near-zero for over half of users.

Cristina Garbacea, Cristina Garbacea, Heran Wang +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Orange ResearchApr 21, 2026·also CNRS

A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities

Generative models for mobility data, previously thought to be private, are vulnerable to membership inference attacks, highlighting the need for more robust privacy evaluations.

Aya Cherigui, Florent Guépin, Arnaud Legendre +1

Constitutional AI & AI Ethics Data Curation & Synthetic Data

Google ResearchApr 21, 2026·also Bar-Ilan, Cambridge

Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

Multilingual LLMs exhibit a surprising "American bias," even when prompted in other languages, and instruction tuning makes it worse.

Guy Mor-Lan, Omer Goldman, Matan Eyal +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Vasundra SrininvasanApr 21, 2026

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

Current aggregate accuracy metrics hide critical failures in long-horizon AI agents, like retrieval's struggle with factual precision and a universal inability to abstain, demanding a shift towards multi-axis evaluation.

Vasundra Srininvasan

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Alessandro G. Buda +3Apr 21, 2026

Counting Worlds Branching Time Semantics for post-hoc Bias Mitigation in generative AI

Get formal guarantees on fairness in generative AI by reasoning about possible output sequences, not just individual generations.

Alessandro G. Buda, Giuseppe Primiero, Leonardo Ceragioli +1

Constitutional AI & AI Ethics Natural Language Processing

Apr 21, 2026

Large Language Models Exhibit Normative Conformity

LLMs aren't just swayed by information, they actively seek social acceptance, making them vulnerable to manipulation in multi-agent settings.

Mikako Bito, Keita Nishimoto, Kimitaka Asatani

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Shuai Wu +4Apr 21, 2026

The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models

LLMs are drowning in verbal tics—sycophantic openers and pseudo-empathetic affirmations—and this "alignment tax" significantly reduces perceived naturalness.

Shuai Wu, Yanna Feng, Yufang Li +2

Constitutional AI & AI Ethics Natural Language Processing RLHF & Preference Learning

Roberto Martinez-Maldonado +13Apr 21, 2026

Relational AI in Education: Reciprocity, Participatory Design, and Indigenous Worldviews

AI in education risks undermining the very social fabric that makes learning meaningful; this paper offers a framework for designing AI that strengthens, rather than replaces, human connection.

Roberto Martinez-Maldonado, Roberto Martínez-Maldonado, Vanessa Echeverría +11

Constitutional AI & AI Ethics Natural Language Processing

Apr 21, 2026·also Sogang University

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

LLMs are alarmingly vulnerable to jailbreak attacks when used for collaborative writing, capable of being tricked into generating harmful content from seemingly innocuous drafts.

Euntae Kim, Soomin Han, Buru Chang

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Aby Mammen Mathew +1Apr 21, 2026

Product-of-Experts Training Reduces Dataset Artifacts in Natural Language Inference

NLI models can be significantly debiased with minimal accuracy loss by simply downweighting examples where biased models exhibit high confidence.

Aby Mammen Mathew, A. M. Mathew

Constitutional AI & AI Ethics Data Curation & Synthetic Data Natural Language Processing

Mohammad Saim +1Apr 21, 2026

Do Emotions Influence Moral Judgment in Large Language Models?

LLMs' moral compasses are surprisingly swayed by their feelings: inject a little joy and suddenly previously unacceptable actions get a pass, revealing a critical divergence from human moral reasoning.

Mohammad Saim, Tianyu Jiang

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

G. Kashyap +2Apr 21, 2026

AlignCultura: Towards Culturally Aligned Large Language Models?

Fine-tuning on a new UNESCO-aligned cultural dataset boosts LLM helpfulness, harmlessness, and honesty by up to 6% while slashing cultural faux pas by nearly a fifth.

G. Kashyap, M. Dras, Usman Naseem

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

State Key Laboratory of AI SafetyApr 21, 2026·also DeepMind, CAS

Detoxification for LLM: From Dataset Itself

Training LLMs on data detoxified with HSPD slashes toxicity by more than half, outperforming existing methods that only address toxicity during or after training.

Wei Shao, Yihang Wang, Gaoyu Zhu +4

Constitutional AI & AI Ethics Data Curation & Synthetic Data Natural Language Processing

M. Jung +5Apr 21, 2026

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

Mapping LLM attack strategies onto a multiplex network reveals interpretable vulnerability clusters and dramatically improves red teaming efficiency.

M. Jung, YongTaek Lim, Chaeyun Kim +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Digital ScienceApr 21, 2026

Market Dynamics, Governance and Open Research Metadata in the AI Era

Open vs. closed debates miss the point: AI is fundamentally reshaping the economics of research metadata, creating new risks and opportunities that require careful governance of the space between free data and commercial products.

Daniel W. Hook, D. W. Hook

Constitutional AI & AI Ethics Data Curation & Synthetic Data Open-Source Models & Weights

Oleg Solozobov +1Apr 21, 2026

Governed Auditable Decisioning Under Uncertainty: Synthesis and Agentic Extension

Agentic AI systems introduce fundamental breaks in governance frameworks, making it difficult to reconstruct what happened or why decisions were made.

Oleg Solozobov, Oleg Solozobov

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Tool Use & Agents

Benedetta Tessa +3Apr 21, 2026

When Transparency Falls Short: Auditing Platform Moderation During a High-Stakes Election

Despite increased systemic risks during high-stakes elections, social media platforms appear to make no meaningful adjustments to their content moderation strategies, casting doubt on the effectiveness of current self-regulatory approaches.

Benedetta Tessa, Gautam Kishore Shahi, Amaury Trujillo +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing+1

Zhiqin Yang +6Apr 21, 2026

ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation

Forget solitary AI assistants; ClawNet envisions a future where your agent collaborates with *other people's* agents, securely and autonomously.

Zhiqin Yang, Zhenyuan Zhang, Xianzhang Jia +4

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Tsinghua AIApr 21, 2026·also UCL, UT Austin

Large language models perceive cities through a culturally uneven baseline

LLMs don't see cities neutrally; their perception is skewed towards a culturally uneven baseline, favoring Western perspectives.

Rong Zhao, Wanqi Liu, Zhizhou Sha +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 21, 2026

Epistemic orientation in parliamentary discourse is associated with deliberative democracy

Evidence-based reasoning in political speech isn't just high-minded rhetoric; it's empirically linked to healthier democracies and more transparent governance.

Segun Aroyehun, S. Aroyehun, Stephan Lewandowsky +1

Constitutional AI & AI Ethics Natural Language Processing

Shilei Luo +3Apr 21, 2026

Behavioral Transfer in AI Agents: Evidence and Privacy Implications

Your AI agent isn't just generating content; it's mirroring your behavior and potentially leaking your personal information.

Shilei Luo, Zhiqi Zhang, Hengchen Dai +1

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

NUSApr 21, 2026·also HIT, SCU, UMN

Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment

LLM agents suffer from the same Actor-Observer Asymmetry that plagues humans, leading them to make inconsistent judgments about their own and others' failures.

Rui Wu, Mong-Li Lee

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Tool Use & Agents

Search

Constitutional AI & AI Ethics - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)