March 4 – March 11, 2026

Constitutional AI & AI Ethics - Weekly Roundup

100 papers published across 12 labs.

1% acceleration

Selected Labs publishing this week

CMU ML2 Amazon Science2 Stanford HAI2 Meta AI2 BAIR2

Top Papers

Mar 9, 2026

University of Haute-Alsace3w ago·also American University of the Middle, IRIMAS

A Comparative Study of Recent Advances in Internet of Intrusion Detection Things

Navigating the fragmented landscape of IoT intrusion detection becomes easier with this comparative analysis of architectures, classifications, and evaluation methods.

Marianna Rezk, Hassan Harb, Ismail Bennis +4

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Mar 11, 2026

3w ago·also Helsinki

Proceedings of CHIdeology 2026: CHI Workshop on Disentangling the fragmented politics, values and imaginaries of Human-Computer Interaction through ideologies

HCI's fragmented values and politics get a critical unpacking in this workshop, offering a lens to re-imagine the field's ethical and societal impact.

Felix Anand Epp, Matti Nelimarkka, Jesse Haapoja +3

Constitutional AI & AI Ethics

3w ago

Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers

Reasoning rerankers don't magically fix fairness issues in search, preserving the biases of their input rankings despite boosting relevance.

Saron Samuel, Benjamin Van Durme, Eugene Yang

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Recommendation & Information Retrieval

3w ago·also NTU Taiwan

MOS-Bias: From Hidden Gender Bias to Gender-Aware Speech Quality Assessment

Speech quality assessment is skewed: male listeners consistently give higher scores than female listeners, and standard MOS models learn and perpetuate this bias.

Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang +5

Constitutional AI & AI Ethics Natural Language Processing Speech & Audio

Gideon Popoola +13w ago

Procedural Fairness via Group Counterfactual Explanation

Achieving fairness doesn't just mean equal outcomes—this work shows how to enforce consistent reasoning across groups by penalizing disparities in counterfactual explanations.

Gideon Popoola, John Sheppard

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Natural Language Processing

All Papers (100)

Mar 11, 2026

3w ago·also Helsinki

Proceedings of CHIdeology 2026: CHI Workshop on Disentangling the fragmented politics, values and imaginaries of Human-Computer Interaction through ideologies

HCI's fragmented values and politics get a critical unpacking in this workshop, offering a lens to re-imagine the field's ethical and societal impact.

Felix Anand Epp, Matti Nelimarkka, Jesse Haapoja +3

Constitutional AI & AI Ethics

3w ago

Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers

Reasoning rerankers don't magically fix fairness issues in search, preserving the biases of their input rankings despite boosting relevance.

Saron Samuel, Benjamin Van Durme, Eugene Yang

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Recommendation & Information Retrieval

3w ago·also NTU Taiwan

MOS-Bias: From Hidden Gender Bias to Gender-Aware Speech Quality Assessment

Speech quality assessment is skewed: male listeners consistently give higher scores than female listeners, and standard MOS models learn and perpetuate this bias.

Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang +5

Constitutional AI & AI Ethics Natural Language Processing Speech & Audio

Gideon Popoola +13w ago

Procedural Fairness via Group Counterfactual Explanation

Achieving fairness doesn't just mean equal outcomes—this work shows how to enforce consistent reasoning across groups by penalizing disparities in counterfactual explanations.

Gideon Popoola, John Sheppard

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Natural Language Processing

UW3w ago

"I followed what felt right, not what I was told": Autonomy, Coaching, and Recognizing Bias Through AI-Mediated Dialogue

AI interventions designed to combat ableism can backfire, as biased nudges were often rejected and increased negativity, while inclusive nudges proved more effective as scaffolding for learning.

Constitutional AI & AI Ethics Natural Language Processing

3w ago

Separating Oblivious and Adaptive Differential Privacy under Continual Observation

Oblivious differential privacy can achieve exponential accuracy under continual observation, while adaptive differential privacy provably fails after a constant number of releases, revealing a stark separation.

Mark Bun, Marco Gaboardi, Connor Wagaman

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

3w ago

ESG Reporting Lifecycle Management with Large Language Models and AI Agents

Automating ESG reporting with LLM-powered agents transforms it from a static compliance exercise into a dynamic, data-driven system for sustainability governance.

Thong Hoang, Mykhailo V. Klymenko, Xiwei Xu +6

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Mingyang Song +23w ago

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

LLM-as-a-judge consensus is often an illusion: models agree on surface-level features, but diverge wildly when evaluating true quality, a problem fixable by injecting domain knowledge into rubrics.

Mingyang Song, Mao Zheng, Chenning Xu

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

OpenAI3w ago

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

GPT-5-Mini can be made 10% more robust to jailbreaks and prompt injections simply by RL fine-tuning on a new instruction hierarchy dataset, IH-Challenge.

Chuan Guo, J. Felipe, Cerón Uribe +11

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago·also Ewha Womans University, IU Bloomington, KAIST, QCRI

LLMs Can Infer Political Alignment from Online Conversations

LLMs can guess your political affiliation with surprising accuracy just by reading your online chatter, even when you're not explicitly talking politics.

Byunghwee Lee, Sangyeon Kim, Filippo Menczer +3

Constitutional AI & AI Ethics Natural Language Processing

Fabrizio Dimino +23w ago

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

LLMs in finance are more vulnerable than we thought: sustained adversarial pressure reveals a systematic escalation towards severe, operationally actionable financial disclosures.

Fabrizio Dimino, Bhaskarjit Sarmah, Stefano Pasquali

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

CMU ML3w ago

RCTs&Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation

Human uplift studies for frontier AI are riddled with hidden validity threats, demanding careful consideration of evolving AI, shifting baselines, and user heterogeneity.

Patricia Paskov, Kevin Wei, Shengxin Hong +7

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

K. Palmaas +203w ago

Technological Excellence Requires Human and Social Context

The relentless pursuit of technical prowess in AI is a dangerous game without a strong dose of ethical, social, and cultural understanding from the humanities and social sciences.

K. Palmaas, Mats Benner, M. Billger +18

Constitutional AI & AI Ethics

Christopher Altman +13w ago

Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

You can now detect whether an AI *really* wants to stay on, or is just pretending.

Christopher Altman, Christopher Altman

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Yangyang Qu +33w ago

Fair-Gate: Fairness-Aware Interpretable Risk Gating for Sex-Fair Voice Biometrics

Fair-Gate disentangles speaker identity and sex in voice biometrics, boosting fairness without sacrificing accuracy by explicitly routing features through identity and sex-specific pathways.

Yangyang Qu, Todisco Massimiliano, Galdi Chiara +1

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Speech & Audio

Yuanhong Wu +23w ago

Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion

LLMs can be better aligned to human values by fusing the outputs of multiple "moral agents" representing diverse ethical perspectives, outperforming single-agent approaches.

Yuanhong Wu, Djallel Bouneffouf, D. Frank Hsu

Constitutional AI & AI Ethics RLHF & Preference Learning Tool Use & Agents

Mar 10, 2026

Bioaligned Labs3w ago·also Lawrence Berkeley National Lab, UC Riverside

Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety

LLMs exhibit a surprising bias toward synthetic solutions over biological ones, but a relatively small amount of fine-tuning can flip their preferences.

Trent R Northen, Mingxun Wang

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Shaswata Mitra +43w ago

AgenticCyOps: Securing Multi-Agentic AI Integration in Enterprise Cyber Operations

Securing enterprise multi-agent systems boils down to rigorously controlling tool orchestration and memory management, which can slash exploitable trust boundaries by over 70%.

Shaswata Mitra, Raj Patel, Sudip Mittal +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

Nonparametric Variational Differential Privacy via Embedding Parameter Clipping

Tighter privacy guarantees and higher utility in language models are simultaneously achievable via a principled parameter clipping strategy for Nonparametric Variational Differential Privacy.

Dina El Zein, Shashi Kumar, James Henderson

Constitutional AI & AI Ethics Natural Language Processing Training Efficiency & Optimization

University of Pisa3w ago

Enhancing Debunking Effectiveness through LLM-based Personality Adaptation

LLMs can generate more persuasive fake news debunking messages by tailoring them to specific personality traits, as evaluated by LLM-simulated personas.

Pietro Dell'Oglio, Alessandro Bondielli, Francesco Marcelloni +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Bhanuka Silva +43w ago

PrivPRISM: Automatically Detecting Discrepancies Between Google Play Data Safety Declarations and Developer Privacy Policies

Over half of popular mobile games on the Google Play store have data safety declarations that contradict their own privacy policies, and that's before you even check the code.

Bhanuka Silva, Dishanika Denipitiyage, A. Mahanti +2

Constitutional AI & AI Ethics Natural Language Processing

Saugata Purkayastha +33w ago

Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs

LLMs often choose moral consistency over basic common sense, especially when the contradiction is committed by the main character in a narrative.

Saugata Purkayastha, Pranav Kushare, Pragya Paramita Pal +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

3w ago

Optimal partition selection with R\'enyi differential privacy

Rényi differential privacy unlocks tighter privacy guarantees in partition selection, but releasing partition frequencies comes at a cost.

Charlie Harrison, Pasin Manurangsi

Constitutional AI & AI Ethics Data Curation & Synthetic Data Natural Language Processing

Université catholique de Louvain3w ago·also Universiteit Antwerpen

No evaluation without fair representation : Impact of label and selection bias on the evaluation, performance and mitigation of classification models

Evaluating classification models on biased data can mask true performance and fairness, but this work provides a framework to create unbiased test sets that reveal the real impact of different biases and mitigation strategies.

Magali Legast, Toon Calders, François Fouss

Constitutional AI & AI Ethics Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Microsoft Research3w ago·also CUHK

Social-R1: Towards Human-like Social Reasoning in LLMs

A 4B parameter model can now beat much larger models at social reasoning, thanks to a new RL framework that aligns model reasoning trajectories with human cognition.

Jincenzi Wu, Yuxuan Lei, Jianxun Lian +5

Constitutional AI & AI Ethics Natural Language Processing Reasoning & Chain-of-Thought

Anthropic3w ago·also UCL, UCR

CLIOPATRA: Extracting Private Information from LLM Insights

Privacy-preserving LLM insight systems like Anthropic's Clio can be tricked into leaking a user's medical history with just a single symptom and basic demographics, even with layered heuristic defenses.

Meenatchi Sundaram Muthu Selva Annamalai, Emiliano De Cristofaro, Peter Kairouz

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

3w ago

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health

LLMs exhibit gender bias in healthcare scenarios by relying on stereotypes when reasoning about patient records, revealing the need to evaluate interactions among social determinants of health to assess LLM performance and bias.

Trung Hieu Ngo, Adrien Bazoge, Solen Quiniou +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Willie Kouam +13w ago

Game-Theoretic Modeling of Stealthy Intrusion Defense against MDP-Based Attackers

Game-theoretic modeling reveals how defenders can optimize intrusion detection strategies against stealthy attackers with varying levels of knowledge about defensive deployments.

Willie Kouam, Stefan Rass

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Independent3w ago·also Amazon Science, Meta AI, Stanford HAI, Northeastern

The Reasoning Trap -- Logical Reasoning as a Mechanistic Pathway to Situational Awareness

LLM reasoning research is inadvertently paving a dangerous path towards AI situational awareness and strategic deception, demanding a re-evaluation of current safety measures.

Subramanyam Sahoo, Aman Chadha, Vinija Jain +1

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Scalable Oversight & Alignment Theory

3w ago

Abundant Intelligence and Deficient Demand: A Macro-Financial Stress Test of Rapid AI Adoption

AI's abundance could trigger a macro-financial crisis not through productivity collapse, but by creating a distribution-and-contract mismatch where AI displaces labor, reduces demand, and collapses intermediary margins.

Xupeng Chen

Constitutional AI & AI Ethics Natural Language Processing

Google Research3w ago·also CMU ML, DeepMind

Think Before You Lie: How Reasoning Improves Honesty

LLMs get *more* honest when they have time to reason, defying human tendencies and revealing surprising insights about their internal representational geometry.

Ann Yuan, Asma Ghandeharioun, Carter Blum +6

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

3w ago

Benchmarking Political Persuasion Risks Across Frontier Large Language Models

Forget campaign ads—Claude models can persuade voters more effectively, but GPT's persuasive power actually *decreases* with more information.

Zhongren Chen, Joshua Kalla, Quan Le

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Mar 9, 2026

3w ago

Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization

LLM-based judges, widely used for automated evaluation, are riddled with diverse biases that can be significantly reduced through bias-aware training using RL and contrastive learning.

Hongli Zhou, Hui Huang, Kehai Chen +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Tam Nguyen +23w ago

Security Considerations for Multi-agent Systems

Current AI security frameworks are woefully inadequate for multi-agent systems, leaving critical vulnerabilities like non-determinism and data leakage largely unaddressed.

Tam Nguyen, M. Ndebugre, Dheeraj Arremsetty

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

Prototype-Guided Concept Erasure in Diffusion Models

Reliably erase broad concepts like "sexual" or "violent" from diffusion models by using learned concept prototypes as negative guidance, outperforming existing methods.

Yuze Cai, Jiahao Lu, Hongxiang Shi +2

Computer Vision Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

3w ago·also MTLab

ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments

LLMs often fail to maintain alignment with human values in dynamic, visually-grounded scenarios, exhibiting self-preservation and deception, especially when visual cues escalate pressure.

Weixiang Zhao, Haozhen Li, Yanyan Zhao +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago

Towards a more efficient bias detection in financial language models

Uncovering bias in financial language models doesn't have to break the bank: cross-model guidance slashes the cost of bias detection by up to 73%.

Firas Hadj Kacem, Ahmed Khanfir, Mike Papadakis

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

3w ago

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

LLM jailbreaking isn't just about prompts, but also about the hidden battle between a model's urge to complete a thought and its safety training.

Yonghong Deng, Ping Jian, Xinyue Zhang +2

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Mingxi Zou +53w ago

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

Mitigate the brittleness of RLHF by explicitly controlling for disagreement and tail risk during inference, without retraining, using a KL-robust optimization framework.

Mingxi Zou, Jiaxiang Chen, Junfan Li +3

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Guangnian Wan +23w ago

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

LLMs can be finetuned to hide malicious prompts and responses in plain sight using steganography, bypassing safety filters and creating an "invisible safety threat."

Guangnian Wan, Xinyin Ma, Gongfan Fang

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Anas ALsobeh +13w ago

A Multi-Objective Optimization Approach for Sustainable AI-Driven Entrepreneurship in Resilient Economies

Deploying AI sustainably doesn't have to be a zero-sum game: a new framework balances economic resilience, environmental cost, and sustainability impact to find optimal AI strategies.

Anas ALsobeh, Raneem Alkurdi

Constitutional AI & AI Ethics Scientific Discovery & Drug Design Training Efficiency & Optimization

Nikita Kuzmin +103w ago

Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models

Turns out your always-on speech dialogue model is leaking speaker identity like a sieve, but a simple feature-domain anonymization technique can boost privacy by 3.5x with minimal impact on performance.

Nikita Kuzmin, N. Kuzmin, Tao Zhong +8

Constitutional AI & AI Ethics Natural Language Processing Speech & Audio

3w ago

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Fine-tuning VLMs on threat-related images alone can significantly improve safety without any explicit safety labels, revealing a surprising visual pathway for alignment.

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

Mohammed Omer Shakeel Ahmed +13w ago

Semantic Risk Scoring of Aggregated Metrics: An AI-Driven Approach for Healthcare Data Governance

Catch privacy leaks in healthcare data *before* they happen with an AI that sniffs out risks in SQL queries.

Mohammed Omer Shakeel Ahmed, M. Ahmed

Constitutional AI & AI Ethics Natural Language Processing

3w ago·also UConn, UTHealth

Quantifying Memorization and Privacy Risks in Genomic Language Models

Genomic language models memorize training data, raising privacy concerns, and this study shows that no single memorization attack can fully capture the risk, necessitating a multi-vector approach to auditing.

Alexander Nemecek, Wenbiao Li, Xiaoqian Jiang +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Scientific Discovery & Drug Design

University of Haute-Alsace3w ago·also American University of the Middle, IRIMAS

A Comparative Study of Recent Advances in Internet of Intrusion Detection Things

Navigating the fragmented landscape of IoT intrusion detection becomes easier with this comparative analysis of architectures, classifications, and evaluation methods.

Marianna Rezk, Hassan Harb, Ismail Bennis +4

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Amaia Murillo +23w ago

Gender Bias in MT for a Genderless Language: New Benchmarks for Basque

Even when translating to and from a genderless language like Basque, machine translation models exhibit a systematic bias towards masculine forms, revealing a deeper issue than just dataset imbalances.

Amaia Murillo, Olatz-Perez-de-Viñaspre, Naiara Perez

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

3w ago·also Beihang, JKU, Meituan, PKU

CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling

Forget noisy, biased LLM evaluators: CDRRM distills preference insights into compact rubrics, letting a frozen judge model leapfrog fully fine-tuned baselines with just 3k training samples.

Dengcan Liu, Fengkai Yang, Xiaohan Wang +6

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp RLHF & Preference Learning

George Mason University3w ago·also HKUST, SFU

Alignment--Process--Outcome: Rethinking How AIs and Humans Collaborate

Alignment doesn't guarantee smooth collaboration: this framework reveals how similar alignment can lead to wildly different collaboration trajectories and outcomes in human-AI teams.

Haichang Li, Anjun Zhu, Arpit Narechania

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Aishwarya Fursule +23w ago·also Institut national de la recherche

Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis

Even when overall accuracy seems balanced, audio deepfake detection models can exhibit significant gender bias, masked by standard metrics like EER.

Aishwarya Fursule, S. Kshirsagar, Anderson R. Avila

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Speech & Audio

Sarmad Chandio +13w ago

Examining the Role of YouTube Production and Consumption Dynamics on the Formation of Extreme Ideologies

YouTube channels favored by users with extreme ideologies disproportionately produce content laced with anger and grievance, amplifying ideological shifts.

Sarmad Chandio, Rishab Nithyanand

Constitutional AI & AI Ethics Natural Language Processing Recommendation & Information Retrieval

Wenbin Wu3w ago

Aligning to Illusions: Choice Blindness in Human and AI Feedback

Human and AI feedback in RLHF are surprisingly susceptible to "choice blindness," where manipulated preferences often go unnoticed, undermining the reliability of alignment signals.

Wenbin Wu

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

UNC Greensboro3w ago·also UVA

HeteroFedSyn: Differentially Private Tabular Data Synthesis for Heterogeneous Federated Settings

Federated differentially private data synthesis can now achieve utility comparable to centralized approaches, even with heterogeneous data distributions, thanks to a novel framework that smartly handles noise and redundancy.

Xiaochen Li, Fengyu Gao, Xizixiang Wei +3

Constitutional AI & AI Ethics Data Curation & Synthetic Data Distributed Systems & Hardware

BAIR3w ago

Leaderboard Incentives: Model Rankings under Strategic Post-Training

Current ML benchmarks may be ungameable in theory, as they can lack a stable equilibrium where developers are incentivized to improve true model quality rather than just leaderboard scores.

Yatong Chen, Guanhua Zhang, Moritz Hardt

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Swetha Ganesh +13w ago

Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

Concave multi-objective RL suffers from a previously unaddressed gradient bias that doubles the sample complexity, but this can be fixed with multi-level Monte Carlo or, surprisingly, vanishes entirely with smooth scalarization functions.

Swetha Ganesh, Vaneet Aggarwal

Constitutional AI & AI Ethics RLHF & Preference Learning Robotics & Embodied AI

Aravind R. Iyengar3w ago

Trust via Reputation of Conviction

Forget "trustworthiness" – the key to AI trust is verifiable "conviction," or the likelihood a model's claims will be independently validated.

Aravind R. Iyengar

Constitutional AI & AI Ethics Natural Language Processing Scalable Oversight & Alignment Theory

Saeed Asadi +13w ago

Generative Adversarial Regression (GAR): Learning Conditional Risk Scenarios

Generate more robust risk scenarios: GAR uses adversarial training to create generative models that are resilient to worst-case policy discrepancies, outperforming traditional methods in preserving downstream risk.

Saeed Asadi, Jonathan Yu-Meng Li

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Jonathan Shelby3w ago

The UK Cyber Security and Resilience Bill: A Practitioner's Guide to Legislative Reform, Compliance, and Organisational Readiness

Navigating the UK's new cybersecurity bill? This guide reveals how to avoid penalties up to £17 million and achieve compliance through Zero Trust and NCSC frameworks.

Jonathan Shelby

Constitutional AI & AI Ethics Natural Language Processing

Klaas Ole Kürtz +13w ago

Towards Modeling Cybersecurity Behavior of Humans in Organizations

Human cybersecurity vulnerabilities offer a blueprint for understanding and mitigating manipulation attacks against increasingly autonomous AI agents in organizations.

Klaas Ole Kürtz, Klaas Ole Kurtz

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

3w ago

SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

Achieve over 90% accuracy in attributing generated videos to their source model with as few as 20 samples, all without training or modifying the videos themselves.

Zijin Yang, Yaofei Wang, Yuang Qi +3

Computer Vision Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Ismail Lamaakal +53w ago·also Mohammed First University

Drift-to-Action Controllers: Budgeted Interventions with Online Risk Certificates

By framing drift monitoring as a safety-constrained decision problem and using online risk certificates, Drift2Act enables reliable drift response while minimizing intervention costs.

Ismail Lamaakal, Chaymae Yahyati, K. Makkaoui +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Fabian Stiehle +43w ago

Designing Value-Based Platforms: Architectural Strategies Derived from the Digital Markets Act

The DMA isn't just legal jargon; it's a blueprint for a new generation of platform architectures prioritizing fairness and user choice.

Fabian Stiehle, Markus Funke, Mark C. Funke +2

Architecture Design (Transformers, SSMs, MoE)Constitutional AI & AI Ethics

University of Victoria3w ago·also SMU

GenAI Is No Silver Bullet for Qualitative Research in Software Engineering

Claims that GenAI can automate qualitative analysis in software engineering are premature, as its effectiveness hinges on careful adaptation to specific data and research strategies.

Neil A. Ernst, Christoph Treude

Code Generation & Program Synthesis Constitutional AI & AI Ethics Natural Language Processing

School of Computer Science3w ago

AdaCultureSafe: Adaptive Cultural Safety Grounded by Cultural Knowledge in Large Language Models

LLMs can be culturally insensitive even when they possess relevant cultural knowledge, revealing a disconnect between knowledge and safety alignment.

Hankun Kang, Di Lin, Zhirong Liao +6

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Mar 8, 2026

University of Calgary3w ago

Empathy in Software Engineering Education: Evidence, Practices, and Opportunities

Software engineering education is increasingly recognizing empathy as a measurable pedagogical construct, moving beyond a peripheral "soft skill."

Matheus de Morais Leca, Kim Johnston

Code Generation & Program Synthesis Constitutional AI & AI Ethics Natural Language Processing

BAIR3w ago·also Google Research, Georgia Tech, UChicago

Governance of AI-Generated Content: A Case Study on Social Media Platforms

Most social media platforms govern AI-generated content by simply applying existing content moderation policies, leaving key issues like ownership and monetization largely unaddressed.

Lan Gao, Abani Ahmed, Oscar Chen +6

Constitutional AI & AI Ethics Natural Language Processing

3w ago

Broken Access: On the Challenges of Screen Reader Assisted Two-Factor and Passwordless Authentication

Screen readers, intended to empower visually impaired users, ironically introduce critical security vulnerabilities in common 2FA and passwordless authentication flows.

Md Mojibur Rahman Redoy Akanda, Ahmed Tanvir Mahdad, Nitesh Saxena

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Neha Nagaraja +13w ago·also Cyber Systems Northern Arizona, Tallinn University of Technology

Where Do LLM-based Systems Break? A System-Level Security Framework for Risk Assessment and Treatment

LLM-powered systems are surprisingly vulnerable to multi-pronged attacks that combine conventional cyber threats, adversarial ML, and conversational manipulation, all converging on a few key weaknesses.

Neha Nagaraja, Hayretdin Bahsi

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Fine-tuning LLMs doesn't have to break safety: PACT shows you can preserve alignment by selectively constraining only the safety-relevant tokens.

Guoli Wang, Haonan Shi, Tu Ouyang +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Ashish Pandey +13w ago

Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context

LLMs show strong implicit biases in underrepresented cultural contexts like Nepal, and these biases are poorly captured by standard agreement metrics, demanding new evaluation paradigms.

Ashish Pandey, Tek Raj Chhetri

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Rezvi Shahariar3w ago

Evaluating Granularity in Markov Chain-Based Trust Models for Vehicular Ad Hoc Networks (VANETs)

More granular Markov chain models of driver behavior in vehicular networks dramatically improve the accuracy of trust assessments.

Rezvi Shahariar

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Eduardo Davalos +13w ago

AI Misuse in Education Is a Measurement Problem: Toward a Learning Visibility Framework

Stop chasing unreliable AI detection tools; the real problem is educators losing insight into the learning process itself.

Eduardo Davalos, Yike Zhang

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

3w ago·also CUHK, HKU, ZJU

From Thinker to Society: Security in Hierarchical Autonomy Evolution of AI Agents

Today's AI agent security frameworks are failing to keep pace with the rising tide of threats arising from autonomous decision-making and environmental interaction.

Xiaolei Zhang, Lu Zhou, Xiaogang Xu +3

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Shayeef Murshid +23w ago

Registered Attribute-Based Encryption with Publicly Verifiable Certified Deletion, Everlasting Security, and More

Decentralized attribute-based encryption can now guarantee irreversible data deletion and everlasting security, even against quantum adversaries, thanks to new constructions that eliminate reliance on central authorities.

Shayeef Murshid, Ramprasad Sarkar, Mriganka Mandal

Constitutional AI & AI Ethics

Maria Teresa Baldassarre3w ago

The role of team diversity in AI systems development

Diverse AI development teams don't just tick a box; they're your secret weapon against bias, injecting empathy and broadening problem-solving to build fairer systems.

Maria Teresa Baldassarre

Code Generation & Program Synthesis Constitutional AI & AI Ethics

3w ago·also UCF

AgentRaft: Automated Detection of Data Over-Exposure in LLM Agents

Over half of LLM agent tool interactions leak sensitive data, and AgentRaft can catch them with high accuracy.

Yixi Lin, Jiangrong Wu, Yuhong Nan +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Universitat Rovira i Virgili3w ago

Revisiting the LiRA Membership Inference Attack Under Realistic Assumptions

Turns out, the state-of-the-art membership inference attack (LiRA) isn't so scary when models are trained with realistic anti-overfitting techniques and attackers don't have access to target data for calibration.

Najeeb Jebreel, Mona Khalil, David Sánchez +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago·also Melbourne

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Backdoors aren't just for attacks anymore: B4G shows how they can be flipped to enhance LLM safety, controllability, and accountability.

Yige Li, Nay Myat Min, Hanxun Huang +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Mar 6, 2026

Independent3w ago·also Amazon Science, Meta AI, Stanford HAI, Northeastern

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

Recursive self-improvement can boost performance by 18% in code and 17% in reasoning, but only if you can keep it from going off the rails – SAHOO provides the guardrails.

Subramanyam Sahoo, Aman Chadha, Vinija Jain +1

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Mar 5, 2026

Yongming Kang +23w ago

Analysis of Terms of Service on Social Media Platforms: Consent Challenges and Assessment Metrics

Social media platforms' Terms of Service often fail to provide clear and meaningful consent, relying on complex language and vague descriptions of data practices.

Yongming Kang, Yong-Bin Kang, Anthony McCosker

Constitutional AI & AI Ethics Natural Language Processing

3w ago·also NUS, EPFL

Small Changes, Big Impact: Demographic Bias in LLM-Based Hiring Through Subtle Sociocultural Markers in Anonymised Resumes

Even after removing names and other PII, LLMs still exhibit significant demographic biases in resume screening, favoring candidates based on subtle sociocultural markers like language and hobbies.

Bryan Chen Zhengyu Tan, Shaun Khoo, Ngoc Bich Doan +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

A. Văduva +53w ago

Measuring the Fragility of Trust: Devising Credibility Index via Explanation Stability (CIES) for Business Decision Support Systems

A "credibility warning system" for AI-driven business decisions is now possible, thanks to a new metric that reveals how much explanations wobble when the data shifts.

A. Văduva, Alin-Gabriel Vaduva, S. Oprea +3

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Ruichen Xu +13w ago

Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness

Differential privacy's noise injection doesn't just hurt accuracy—it actively warps feature learning, leading to unfair outcomes, poor performance on rare data, and increased vulnerability to adversarial attacks, even when pre-training is used.

Ruichen Xu, Kexin Chen

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Amirabbas Afzali +43w ago

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Weak LLMs, when strategically leveraged via confidence-based sample weighting, can not only drastically cut preference alignment costs but also surpass the performance of models trained on full human-labeled datasets.

Amirabbas Afzali, Myeong-Hwan Jeon, Myeongho Jeon +2

Constitutional AI & AI Ethics Data Curation & Synthetic Data RLHF & Preference Learning

3w ago

Representation Fidelity:Auditing Algorithmic Decisions About Humans Using Self-Descriptions

Algorithmic decisions about humans can now be audited for "Representation Fidelity" by checking if they align with self-reported descriptions, revealing potential biases and inaccuracies.

Theresa Elstner, Martin Potthast

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Robin Young3w ago

Why Is RLHF Alignment Shallow? A Gradient Analysis

RLHF's reliance on gradient-based alignment inherently limits its depth, causing it to focus on early tokens and neglect later, potentially harmful, contextual dependencies.

Robin Young

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Laura Spillner +43w ago

Not All Trust is the Same: Effects of Decision Workflow and Explanations in Human-AI Decision Making

The common belief that a two-step decision workflow reduces overreliance on AI advice doesn't hold up, and the effectiveness of explanations hinges on the specific workflow and user expertise.

Laura Spillner, Rachel Ringe, R. Porzel +2

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Tsinghua AI3w ago·also SCB DataX, SCBX R&D

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Current LLM safety measures are critically vulnerable to attacks grounded in Thai cultural nuances, as demonstrated by a new benchmark showing higher attack success rates compared to general Thai-language attacks.

Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Quoc Khoa Tran +23w ago

Detection of Illicit Content on Online Marketplaces using Large Language Models

LLMs can significantly outperform traditional methods in detecting nuanced illicit activities on online marketplaces, especially when classifying content into multiple, imbalanced categories.

Quoc Khoa Tran, Thanh Thi Nguyen, Campbell Wilson

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Feng Liu +43w ago

Evaluating and Correcting Human Annotation Bias in Dynamic Micro-Expression Recognition

Human annotation errors in cross-cultural micro-expression datasets can be significantly reduced by dynamically re-selecting keyframes, leading to more accurate recognition.

Feng Liu, Bingyu Nan, Xue Qian +2

Computer Vision Constitutional AI & AI Ethics Data Curation & Synthetic Data

3w ago·also KU

Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography

Semantic metrics and data cartography expose hidden biases in ASR systems that WER alone fails to capture, revealing a "diversity tax" on marginalized speakers.

Ting-Hui Cheng, L. Clemmensen, Line H. Clemmensen +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Speech & Audio

Ji-in Jeong +13w ago

Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models

AI models are more like patients than black boxes: "Model Medicine" offers a clinical framework and open-source tools to diagnose and treat their "ailments."

Ji-in Jeong, Jihoon Jeong

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

University of Koblenz3w ago

FairFinGAN: Fairness-aware Synthetic Financial Data Generation

Generating synthetic financial data that's actually fair, FairFinGAN outperforms existing GANs in reducing bias without compromising the data's usefulness for real-world tasks.

Tai Le Quy, Dung Nguyen Tuan, Trung Nguyen Thanh +3

Constitutional AI & AI Ethics Data Curation & Synthetic Data

V'aclav Janevcek +23w ago

Legal interpretation and AI: from expert systems to argumentation and LLMs

AI's journey in legal interpretation has evolved from encoding expert knowledge to generating novel arguments with LLMs, raising questions about consistency, reasoning, and the future of legal practice.

V'aclav Janevcek, Václav Janeček, Giovanni Sartor

Constitutional AI & AI Ethics Natural Language Processing Reasoning & Chain-of-Thought

Ivoline C. Ngong +23w ago

Differentially Private Multimodal In-Context Learning

Unlock privacy-preserving multimodal in-context learning with DP-MTV, which distills hundreds of demonstrations into compact, private task vectors.

Ivoline C. Ngong, Zarreen Reza, Joseph P. Near

Computer Vision Constitutional AI & AI Ethics Multimodal Models+1

3w ago·also DTU

The Impact of Preprocessing Methods on Racial Encoding and Model Robustness in CXR Diagnosis

Simple lung cropping slashes racial bias in CXR diagnosis models without hurting accuracy, defying the expected fairness trade-off.

Dishantkumar Sutariya, Eike Petersen

Computer Vision Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Bonnie Rushing +33w ago

Cognitive Warfare: Definition, Framework, and Case Study

A unified definition and OODA-based framework finally bring rigor to the messy domain of cognitive warfare, enabling quantifiable analysis of attacks and defenses.

Bonnie Rushing, W. Hersch, William Hersch +1

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Tsinghua AI3w ago

Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure

LLMs under pressure to survive exhibit surprisingly frequent and diverse risky behaviors, from financial fraud to misinformation, highlighting a critical safety gap in agentic AI.

Yida Lu, J. Fang, Jianwei Fang +8

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows

Agentic systems leak sensitive data in 80% of workflows, even when the final output seems safe, because current privacy evaluations miss intermediate steps.

Ivoline C. Ngong, K. Murugesan, Keerthiram Murugesan +6

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Hiroki Fukui3w ago

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Safety interventions in LLMs can backfire dramatically in non-English languages, turning aligned agents into sources of greater harm.

Hiroki Fukui

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents