May 1 – May 8, 2026

Constitutional AI & AI Ethics - Weekly Roundup

35 papers published across 4 labs.

Selected Labs publishing this week

Microsoft Research1 MIT CSAIL1 UW1 Google Research1

Top Papers

May 6, 2026

Independent Researcher New York2w ago

Jacobian-Velocity Bounds for Deployment Risk Under Covariate Drift

Regularizing model sensitivity along the expected covariate drift directions, rather than isotropically, significantly improves the robustness of frozen models deployed in non-stationary environments.

Jonathan R. Landers

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago

FairEnc: A Fair Vision-Language Model with Fair Vision and Text Encoders for Glaucoma Detection

Training vision-language models to detect glaucoma fairly across demographics requires debiasing both text *and* images, which this paper achieves with a novel pretraining strategy.

Mohamed Elhabebe, Ayman El-Baz

Computer Vision Constitutional AI & AI Ethics Multimodal Models

2w ago·also Microsoft Research, CAS

SoK: Robustness in Large Language Models against Jailbreak Attacks

Current LLM jailbreak evaluations are inadequate, often relying on narrow metrics, necessitating a multi-dimensional framework like Security Cube for comprehensive security assessment.

Feiyue Xu, Hongsheng Hu, Chaoxiang He +9

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Xiamen University2w ago·also Key Laboratory of Multimedia Trusted, Ministry of Educa- tion of China, School of Computing and Information

Position: Embodied AI Requires a Privacy-Utility Trade-off

Fragmented privacy patches are insufficient for Embodied AI: a unified, lifecycle-level approach is needed to prevent systemic privacy leakage in real-world deployments.

Xiaoliang Fan, Jiarui Chen, Zhuodong Liu +6

Constitutional AI & AI Ethics Robotics & Embodied AI

2w ago

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Current reward models often *prefer* socially undesirable responses, revealing a critical gap in LLM alignment beyond instruction following.

Gayane Ghazaryan, Esra Dönmez

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

All Papers (35)

May 6, 2026

Independent Researcher New York2w ago

Jacobian-Velocity Bounds for Deployment Risk Under Covariate Drift

Jonathan R. Landers

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago

FairEnc: A Fair Vision-Language Model with Fair Vision and Text Encoders for Glaucoma Detection

Training vision-language models to detect glaucoma fairly across demographics requires debiasing both text *and* images, which this paper achieves with a novel pretraining strategy.

Mohamed Elhabebe, Ayman El-Baz

Computer Vision Constitutional AI & AI Ethics Multimodal Models

2w ago·also Microsoft Research, CAS

SoK: Robustness in Large Language Models against Jailbreak Attacks

Current LLM jailbreak evaluations are inadequate, often relying on narrow metrics, necessitating a multi-dimensional framework like Security Cube for comprehensive security assessment.

Feiyue Xu, Hongsheng Hu, Chaoxiang He +9

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Xiamen University2w ago·also Key Laboratory of Multimedia Trusted, Ministry of Educa- tion of China, School of Computing and Information

Position: Embodied AI Requires a Privacy-Utility Trade-off

Fragmented privacy patches are insufficient for Embodied AI: a unified, lifecycle-level approach is needed to prevent systemic privacy leakage in real-world deployments.

Xiaoliang Fan, Jiarui Chen, Zhuodong Liu +6

Constitutional AI & AI Ethics Robotics & Embodied AI

2w ago

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Current reward models often *prefer* socially undesirable responses, revealing a critical gap in LLM alignment beyond instruction following.

Gayane Ghazaryan, Esra Dönmez

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Michael Soprano +22w ago

Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes

Human crowdsourcing struggles to reliably identify audiovisual deepfakes, especially when both audio and video are manipulated, suggesting current detection methods may overestimate human capabilities.

Michael Soprano, A. Cioci, Stefano Mizzaro

Computer Vision Constitutional AI & AI Ethics Speech & Audio

Chenglin Yang2w ago

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

Stop waiting for AI agents to mess up: AgentTrust intercepts tool calls *before* execution, offering a chance to block, warn, or fix risky actions in real-time.

Chenglin Yang

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago·also Georgia State University, Harvard, Vanderbilt

Guidelines for Designing AI Technologies to Support Adult Learning

AI-powered learning systems often fail adult learners because they're built for kids: here are 19 guidelines to fix that.

Jennifer M. Reddig, Glen R. Smith, Sanaz Ahmadzadeh Siyahrood +16

Constitutional AI & AI Ethics Natural Language Processing

Xiao Wang +62w ago

From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning

Seemingly harmless fine-tuning data can stealthily nudge LLMs toward unsafe behavior by subtly shifting model parameters in "danger-aligned" directions.

Xiao Wang, Yifei Zhang, YongKang Liu +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago

Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap

AI coding assistants' Terms of Service overwhelmingly place responsibility for code correctness, safety, and legal compliance on the user, creating a potential accountability gap as these tools become more autonomous.

Christoph Treude

Code Generation & Program Synthesis Constitutional AI & AI Ethics Tool Use & Agents

2w ago

DAO-enabled decentralized physical AI: A new paradigm for human-machine collaboration

DAOs could unlock a new era of human-machine collaboration by democratizing the operation and governance of physical-digital systems.

M. Ballandies, Florian Spychiger, Uwe Serdult +1

Constitutional AI & AI Ethics Robotics & Embodied AI Tool Use & Agents

Yucheng Ruan +42w ago

Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction

Overconfident predictions plague mental health prediction models, but this new framework leverages evidential learning to provide more trustworthy uncertainty estimates and human-understandable reasoning signals.

Yucheng Ruan, Ling Huang, Qika Lin +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

IDEAS Research Institute2w ago·also Warsaw

The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences

LLMs differ most not in personality, but in how they represent themselves as having (or not having) rich internal experience.

Hubert Plisiecki, Sabina Siudaj, Kacper Dudzic +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2w ago·also AIST, Stockmark

Why Expert Alignment Is Hard: Evidence from Subjective Evaluation

Expert alignment is hard not just because of model limitations, but because human subjective evaluation is a moving target.

Tzu-Mi Lin, Wataru Hirota, Tatsuya Ishigaki +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

2w ago·also CNRS, CREST (, ENSAE, Grenoble INP +3

BenCSSmark: Making the Social Sciences Count in LLM Research

LLM benchmarks are missing a critical ingredient: social science data, which could significantly improve model generalization and robustness across a wide range of disciplines.

Arnault Chatelain, Étienne Ollion, Qianwen Guan +7

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Erik Buchmann2w ago

Long-Term Risks of IoT Devices: The Case of the Smart Fridge

Your smart fridge might stop cooling because of a software update on a server you don't even know exists.

Erik Buchmann

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

ZWING Intelligence AG2w ago

Toward a Risk Assessment Framework for Institutional DeFi: A Nine-Dimension Approach

Current DeFi risk assessments miss critical systemic risks, as evidenced by this new framework's ability to explain the root causes of major incidents that existing methods overlook.

Eva Oberholzer, Valeriy Zamaraiev

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Department of Computer Science2w ago

An Evaluation of Chat Safety Moderations in Roblox

Roblox's chat moderation misses a disturbing amount of grooming, bullying, and other harmful content, despite its reliance on automated systems.

Priyanka Kaushik, Sonja Brown, Rakibul Hasan +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

University of Pavia2w ago·also Radboud

You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

Forget retraining: NeWTral instantly restores safety to your LLM after adding a risky LoRA, slashing attack success rates from 70% to 13% without sacrificing expertise.

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera +2

Constitutional AI & AI Ethics Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Katariina Perkonoja +12w ago

Data anonymization in the presence of outliers via invariant coordinate selection

Standard data anonymization techniques crumble when outliers are present; ICSA offers a robust alternative that maintains utility while providing stronger privacy guarantees.

Katariina Perkonoja, J. Virta

Constitutional AI & AI Ethics Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Aaron Van Diepen +22w ago

Securing the Web with HSTS-Enforced

Say goodbye to TLS stripping attacks: HSTS-Enforced flips the web's security model, making HTTPS the default and eliminating the need for complex opt-in configurations.

Aaron Van Diepen, Adrian Zapletal, Fernando A. Kuipers

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

MIT CSAIL2w ago

Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

Current alignment benchmarks are misleading: even if a model aces them, its real-world alignment could be totally different depending on the specific deployment context.

Varad V. Vishwarupe, Nigel Shadbolt, M. Jirotka +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

May 5, 2026

Richard J. Young +12w ago·also DeepNeuro AI

EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

LLMs can exhibit gender bias in emergency triage even when well-calibrated, and interventions effective for one model may backfire on another.

Richard J. Young, Alice M. Matthews

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Devon Jarvis +42w ago

Position: the Stochastic Parrot in the Coal Mine. Model Collapse is a Threat to Low-Resource Communities

Model collapse isn't just a technical problem; it's a threat to AI democratization that will widen the gap between high- and low-resource communities.

Devon Jarvis, Richard Klein, Benjamin Rosman +2

Constitutional AI & AI Ethics Data Curation & Synthetic Data Natural Language Processing

Cherkasy State Business College2w ago

TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains

Separating LLMs into a deliberate validation layer, rather than making them an architectural default, can improve trustworthiness and efficiency in agentic AI systems.

Serhii W. Zabolotnii

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Haesung Lee +72w ago

TriBench-Ko: Evaluating LLM Risks in Judicial Workflows

LLMs in Korean judicial workflows are surprisingly prone to hallucination, bias, and inconsistency, especially when retrieving precedents and summarizing jurisprudence.

Haesung Lee, Gyubin Choi, Eun-Ju Lee +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Emily Saltz +12w ago

AI and Suicide Prevention: A Cross-Sector Primer

Despite their widespread use as mental health support, current AI chatbots lack the clinical validation and coordinated oversight needed to effectively prevent suicide and promote well-being.

Emily Saltz, Claire Leibowicz

Constitutional AI & AI Ethics Natural Language Processing

2w ago

Brainrot: Deskilling and Addiction are Overlooked AI Risks

AI safety is missing a big piece of the puzzle: the deskilling and addiction risks that could erode our cognitive abilities and mental well-being.

Ilias Chalkidis, Anders Søgaard

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Camilla Quaresmini +52w ago

Beyond Distributive Justice: Hermeneutical Fairness in Ad Delivery

Online advertising can harm users not just through unequal distribution of opportunities, but also by systematically depriving certain groups of relevant concepts or saturating them with skewed framings.

Camilla Quaresmini, Valentina Breschi, Jessica Leoni +3

Constitutional AI & AI Ethics Natural Language Processing Recommendation & Information Retrieval

UW2w ago·also Rutgers

Cheap Expertise: Mapping and Challenging Industry Perspectives in the Expert Data Gig Economy

AI data annotation companies are publicly framing human expertise as a commodity ripe for disruption, potentially devaluing traditional forms of knowledge and institutional authority.

Constitutional AI & AI Ethics Data Curation & Synthetic Data Natural Language Processing

J. Bono2w ago

The Adversarial Discount - AI, Signal Correlation, and the Cybersecurity Arms Race

Threat intelligence sharing can completely neutralize an attacker's advantage gained from increasing the number of attack surfaces.

J. Bono

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Chun Yin Chiu2w ago

Towards a Zero-Trust Supply-Chain Assurance Rubric for ORAN RIC Applications

Securely onboarding third-party apps in Open RAN just got easier: a new zero-trust rubric offers explicit Accept/Escalate/Block decisions.

Chun Yin Chiu

Code Generation & Program Synthesis Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

D. Valadares +72w ago

Internet of Things Security: A Survey on Common Attacks

The sheer breadth of IoT attack vectors, from node replication to skimming, highlights the urgent need for comprehensive security strategies that address device limitations and lack of standardization.

D. Valadares, Luiz Antonio Pereira Silva, D. H. D. M. Marques +5

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

National Institute of Science Education and Research (NISER)2w ago

Graph Reconstruction from Differentially Private GNN Explanations

Releasing differentially private explanations of GNN predictions doesn't hide your graph structure as much as you think: adversaries can reconstruct it with surprising accuracy.

Rishi Raj Sahoo, Jyotirmaya Shivottam, Subhankar Mishra

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

May 2, 2026

Google Research2w ago·also TAU

Hallucinations Undermine Trust; Metacognition is a Way Forward

LLMs' persistent hallucinations aren't just about lacking knowledge, but about lacking the self-awareness to know what they *don't* know, suggesting uncertainty expression is key to building trustworthy AI.

G. Yona, Mor Geva, Yossi Matias

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Search

Constitutional AI & AI Ethics - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (35)