Constitutional AI & AI Ethics

Safety & Alignment

AI governance principles, value alignment through constitutions, fairness, bias mitigation, and ethical AI deployment.

Keywords

constitutional AIAI ethicsAI fairnessbias mitigationvalue alignmentAI governanceresponsible AIAI safety principles

Recent Papers

Feb 12, 2026

2d ago

Keeping a Secret Requires a Good Memory: Space Lower-Bounds for Private Algorithms

This paper establishes the first unconditional space lower bound for user-level differential privacy by introducing a novel multi-player communication game that links the hardness of low-memory private algorithms to the necessity of contribution capping. The authors demonstrate that the communication complexity of winning this game translates directly to memory lower bounds for private algorithms. They apply this framework to distinct element estimation, proving an $\widetilde{\Omega}(T^{1/3})$ space lower bound, and generalize the technique to derive lower bounds for private medians, quantiles, and max-select.

Establishes a novel multi-player communication game framework to prove unconditional space lower bounds for user-level differentially private algorithms, connecting memory requirements to the necessity of contribution capping.

Alessandro Epasto, Xin Lyu, Pasin Manurangsi2602.12209

Constitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

LTCI2d ago

TopoFair: Linking Topological Bias to Fairness in Link Prediction Benchmarks

This paper introduces TopoFair, a benchmarking framework for fair link prediction that focuses on the impact of diverse topological biases beyond homophily. They formalize a taxonomy of topological bias measures and develop a graph generation method that allows for controlled variation of these biases while maintaining real-world graph characteristics. Through empirical evaluation of link prediction models, including fairness-aware methods, they demonstrate the sensitivity of fairness interventions to these structural biases.

Introduces a novel benchmarking framework, TopoFair, to analyze the interplay between topological biases and fairness in link prediction.

Lilian Marey, Tiphaine Viard, Charlotte Laclau2602.11802

Constitutional AI & AI EthicsEval Frameworks & BenchmarksRecommendation & Information Retrieval

2d ago

Who Does What? Archetypes of Roles Assigned to LLMs During Human-AI Decision-Making

This paper introduces the concept of human-LLM archetypes, defined as recurring socio-technical interaction patterns that structure the roles of humans and LLMs in collaborative decision-making. Through a scoping literature review and thematic analysis of 113 papers, the authors identified 17 distinct human-LLM archetypes. They then evaluated these archetypes across clinical diagnostic cases, demonstrating that the choice of archetype influences LLM outputs and decision outcomes.

Defines and categorizes 17 human-LLM interaction archetypes to demonstrate how these archetypes impact LLM outputs and decisions in human-AI collaborative decision-making.

S. Chappidi, A. Krauze2602.11924

Tool Use & AgentsConstitutional AI & AI EthicsNatural Language Processing

2d ago

Creative Ownership in the Age of AI

This paper addresses the limitations of current copyright law in the age of generative AI, where style imitation without content copying complicates infringement detection. The authors propose a new criterion for infringement based on whether an AI output could have been generated without a specific work in its training corpus. Through a model of generative systems as closure operators, they demonstrate a dichotomy: AI generation is either asymptotically unconstrained with light-tailed organic creations or persistently constrained with heavy-tailed creations.

Introduces a novel criterion for copyright infringement in the context of generative AI, focusing on whether an output could have been generated without a specific work in the training corpus.

Annie Liang, Jay Lu2602.12270

Constitutional AI & AI EthicsNatural Language Processing

2d ago

QDBFT: A Dynamic Consensus Algorithm for Quantum-Secured Blockchain

The paper introduces QDBFT, a quantum-secured dynamic consensus algorithm designed to address the vulnerabilities of traditional PBFT in the face of quantum computing and dynamic node reconfigurations. QDBFT incorporates a primary node automatic rotation mechanism based on a consistent hash ring for dynamic membership and integrates Quantum Key Distribution (QKD) networks for information-theoretic security. Experimental results show QDBFT achieves comparable performance to PBFT while providing resilience against quantum attacks.

Introduces QDBFT, a novel consensus algorithm, that integrates a dynamic primary node rotation mechanism with QKD to achieve quantum-resistant and dynamically adaptable consensus.

OuYang Jie, An Hua, Qiandong Zhang +12602.11606

Constitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

2d ago

AIR: Improving Agent Safety through Incident Response

The paper introduces AIR, an incident response framework for LLM agents that enables autonomous detection, containment, and recovery from failures. AIR uses a domain-specific language integrated into the agent's execution loop to perform semantic checks, guide recovery actions, and synthesize guardrail rules. Experiments across three agent types demonstrate that AIR achieves over 90% success rates in detection, remediation, and eradication, highlighting the importance of incident response for agent safety.

Introduces AIR, a novel incident response framework for LLM agents, enabling autonomous management of the incident lifecycle.

Zibo Xiao, Junjie Chen2602.11749

Tool Use & AgentsConstitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

2d ago

Do Large Language Models Adapt to Language Variation across Socioeconomic Status?

This paper investigates the ability of Large Language Models (LLMs) to adapt to language variations across different socioeconomic status (SES) communities by comparing LLM-generated text completions with original text from a novel Reddit and YouTube dataset stratified by SES. The study analyzes 94 sociolinguistic features to assess the degree of stylistic adaptation exhibited by four LLMs. Results indicate that LLMs show limited stylistic modulation with respect to SES, often producing approximations or caricatures, and demonstrate a bias towards emulating upper SES styles, highlighting the risk of amplifying linguistic hierarchies.

Reveals that LLMs exhibit limited stylistic adaptation across socioeconomic strata and tend to favor upper SES linguistic styles, raising concerns about perpetuating linguistic biases.

Elisa Bassignana, Mike Zhang, Dirk Hovy +12602.11939

Constitutional AI & AI EthicsNatural Language ProcessingEval Frameworks & Benchmarks

2d ago

Neutral Prompts, Non-Neutral People: Quantifying Gender and Skin-Tone Bias in Gemini Flash 2.5 Image and GPT Image 1.5

This paper investigates gender and skin-tone biases in Gemini Flash 2.5 Image and GPT Image 1.5 by generating 3,200 images from semantically neutral prompts. Using a pipeline involving color normalization, facial landmark masking, and skin tone quantification via Monk, PERLA, and Fitzpatrick scales, the study reveals a "default white" bias in both models. Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones, demonstrating that neutral prompts elicit polarized demographic defaults.

Quantifies and compares gender and skin-tone biases in Gemini Flash 2.5 Image and GPT Image 1.5 using a rigorous colorimetric methodology.

R. Balestri2602.12133

Constitutional AI & AI EthicsMultimodal ModelsComputer Vision

Parametrig2d ago

Legitimate Overrides in Decentralized Protocols

This paper analyzes the design space of emergency override mechanisms in decentralized protocols, which are crucial for mitigating exploits but introduce centralization risks. They develop a Scope x Authority taxonomy to map emergency architectures and formalize the trade-offs between centralization costs and containment effectiveness as a stochastic cost-minimization problem. Empirical analysis of 705 exploit incidents validates their model, revealing the impact of authority type on containment time, the heavy-tailed distribution of losses, and the influence of community sentiment on intervention costs.

Introduces a Scope x Authority taxonomy for emergency mechanisms in decentralized protocols and quantifies the trade-offs between centralization and containment effectiveness.

Oghenekaro Elem, Nimrod Talmon2602.12260

Constitutional AI & AI Ethics

2d ago

DMind-3: A Sovereign Edge--Local--Cloud AI System with Controlled Deliberation and Correction-Based Tuning for Safe, Low-Latency Transaction Execution

The paper introduces DMind-3, a three-layered Edge-Local-Cloud AI system for secure and low-latency Web3 financial transactions. It addresses the limitations of cloud-centric and purely local AI solutions by using a deterministic edge firewall, a private local reasoning engine, and a policy-governed cloud synthesizer. The system is trained with Hierarchical Predictive Synthesis (HPS) and Contrastive Chain-of-Correction Supervised Fine-Tuning (C$^3$-SFT) to improve performance and reliability.

Introduces a novel Edge-Local-Cloud AI architecture, DMind-3, that balances privacy, latency, and global context for secure Web3 transactions.

Enhao Huang, Frank Li, Tony Lin +12602.11651

Red-Teaming & Adversarial RobustnessConstitutional AI & AI EthicsDistributed Systems & Hardware

2d ago

Robust Optimization Approach and Learning Based Hide-and-Seek Game for Resilient Network Design

This paper addresses the problem of designing resilient communication networks with limited signal transmission distances, subject to uncertainty in both link lengths and node availability. The authors formulate the problem as a robust optimization model with budgeted uncertainty sets for regenerator installation costs and a novel dynamic budgeted uncertainty set for link lengths. They then develop scalable solution methods based on column-and-constraint generation, Benders decomposition, and iterative robust optimization, and further analyze the problem using a learning-based hide-and-seek game. The proposed methods outperform classical robust models and deterministic worst-case formulations.

Introduces a dynamic budgeted uncertainty set for link lengths in robust network design and demonstrates its effectiveness in a hide-and-seek game framework.

Mohammad Khosravi, Setareh Maghsudi2602.11854

Red-Teaming & Adversarial RobustnessConstitutional AI & AI Ethics

2d ago

Towards Sustainable Investment Policies Informed by Opponent Shaping

This paper analyzes the InvestESG multi-agent simulation to characterize conditions leading to intertemporal social dilemmas where individual incentives conflict with collective welfare. It then applies Advantage Alignment, an opponent shaping algorithm, to influence agent learning within InvestESG, demonstrating its ability to systematically favor socially beneficial equilibria. The work provides theoretical justification for why Advantage Alignment promotes cooperation and shows that shaping agent learning can improve outcomes related to sustainability goals.

Demonstrates that Advantage Alignment can effectively shape agent learning in the InvestESG environment to promote socially beneficial equilibria and overcome intertemporal social dilemmas.

J. Duque, Razvan Ciuca, Ayoub Echchahed +22602.11829

World Models & PlanningConstitutional AI & AI Ethics

2d ago

Artificial intelligence is creating a new global linguistic hierarchy

The paper analyzes the availability of AI resources across 6003 languages to assess systemic inequalities in language AI, finding that a small number of languages dominate, exacerbating disparities. It contrasts the diffusion of AI with earlier IT technologies, revealing a hype-driven pattern. Finally, the authors introduce the Language AI Readiness Index (EQUATE) to map technological, socio-economic, and infrastructural prerequisites for AI deployment across languages, aiming to guide prioritization efforts for more equitable diffusion.

Introduces the Language AI Readiness Index (EQUATE) to map the state of technological, socio-economic, and infrastructural prerequisites for AI deployment across languages.

G. Occhini, Kumiko Tanaka-Ishii, A. Barford +92602.12018

Natural Language ProcessingConstitutional AI & AI Ethics

2d ago

BlackCATT: Black-box Collusion Aware Traitor Tracing in Federated Learning

The paper introduces BlackCATT, a novel black-box traitor tracing method for federated learning that is resilient to collusion attacks. BlackCATT employs a collusion-aware embedding loss and iteratively optimizes trigger sets for watermark embedding, improving convergence and tracing performance. The authors also propose BlackCATT+FR, which incorporates functional regularization at the aggregator to address update incompatibility issues in models with batch normalization, maintaining tracing performance.

Introduces a collusion-resistant black-box traitor tracing method (BlackCATT) for federated learning that uses a novel collusion-aware embedding loss and iteratively optimized triggers.

Elena Rodr'iguez-Lois, Fabio Brau, Maura Pintor +22602.12138

Constitutional AI & AI EthicsRed-Teaming & Adversarial RobustnessData Curation & Synthetic Data

Basque Center for Applied Mathematics (BCAM)2d ago

Safe Fairness Guarantees Without Demographics in Classification: Spectral Uncertainty Set Perspective

This paper addresses the challenge of achieving fairness in classification without relying on demographic information by proposing a novel minimax-fair method called SPECTRE. SPECTRE adjusts the spectrum of a Fourier feature mapping and constrains the deviation of the worst-case distribution from the empirical distribution, mitigating the over-pessimism of existing robust optimization techniques. Empirical results on American Community Survey datasets across 20 states demonstrate that SPECTRE achieves superior fairness guarantees and robustness compared to state-of-the-art methods, even those with access to demographic data.

Introduces SPECTRE, a minimax-fair classification method that enhances fairness without demographic information by adjusting the spectrum of a Fourier feature mapping and constraining the worst-case distribution's deviation from the empirical distribution.

Ainhize Barrainkua, Santiago Mazuelas, Novi Quadrianto +12602.11785

Constitutional AI & AI EthicsNatural Language Processing

2d ago

VIRENA: Virtual Arena for Research, Education, and Democratic Innovation

The paper introduces VIRENA, a virtual platform designed for controlled experimentation within realistic social media environments, addressing limitations in data access and ethical constraints in studying online dynamics. VIRENA allows researchers to simulate feed-based platforms and messaging apps, enabling interactions between human participants and LLM-powered AI agents with configurable personas. The platform's no-code interface facilitates manipulation of content moderation, scheduling of stimuli, and execution of experiments, making it accessible for studying human-AI interaction, moderation interventions, and group deliberation.

Introduces VIRENA, a novel virtual platform enabling controlled social media experiments with human and AI participants, featuring a no-code interface and realistic platform simulations.

Emma Hoes, K. J. Klueser, Fabrizio Gilardi2602.12207

Data Curation & Synthetic DataOpen-Source Models & WeightsConstitutional AI & AI Ethics

2d ago

PAC-Bayesian Generalization Guarantees for Fairness on Stochastic and Deterministic Classifiers

This paper introduces a PAC-Bayesian framework to derive generalization bounds for fairness measures expressed as risk discrepancies, applicable to both stochastic and deterministic classifiers. For stochastic classifiers, standard PAC-Bayes techniques are used, while for deterministic classifiers, a recent PAC-Bayes extension is leveraged. The framework leads to a self-bounding algorithm that optimizes the trade-off between generalization bounds on prediction risk and fairness, and is empirically validated with three classical fairness measures.

Extends PAC-Bayesian generalization guarantees to fairness measures for both stochastic and deterministic classifiers by leveraging risk discrepancy formulations and recent advances in PAC-Bayes theory.

Julien Bastian, Benjamin Leblanc, Pascal Germain +52602.11722

Constitutional AI & AI EthicsNatural Language Processing

University of Delaware2d ago

Wisdom of the LLM Crowd: A Large Scale Benchmark of Multi-Label U.S. Election-Related Harmful Social Media Content

This paper introduces USE24-XD, a dataset of approximately 100,000 social media posts from X related to the 2024 U.S. presidential election, categorized into five harmful content types using a "wisdom of the crowd" approach with six LLMs. The study validates LLM annotations against human crowdsourcing, finding comparable agreement and high recall for specific categories like Speculation. Analysis of human annotator demographics reveals systematic biases in labeling harmful content, underscoring the subjectivity inherent in such judgments.

Introduces USE24-XD, a large-scale, multi-labeled dataset of election-related social media content annotated by LLMs and validated by human annotators, to facilitate research on harmful online narratives.

Qile Wang, Prerana Khatiwada, Carolina Coimbra Vieira +32602.11962

Eval Frameworks & BenchmarksConstitutional AI & AI EthicsNatural Language Processing

2d ago

Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation

The paper addresses the problem of biased uncertainty estimation in Test-Time Adaptation (TTA) of vision-language models like CLIP, which arises from pre-training on imbalanced web data. They propose Adaptive Debiasing Tsallis Entropy (ADTE), a generalization of Shannon Entropy that incorporates a class-specific parameter to account for label bias estimated from incoming test instances. ADTE outperforms state-of-the-art TTA methods on ImageNet variants and cross-domain benchmarks by accurately selecting high-confidence views and integrating with a label adjustment strategy.

Introduces Adaptive Debiasing Tsallis Entropy (ADTE), a novel entropy measure for test-time adaptation that dynamically adjusts for label bias in vision-language models.

Jianfeng Lu2602.11743

Multimodal ModelsComputer VisionConstitutional AI & AI Ethics

2d ago

Using predictive multiplicity to measure individual performance within the AI Act

This paper analyzes predictive multiplicity, the phenomenon of multiple AI models with similar overall accuracy disagreeing on individual predictions, in the context of the EU AI Act. It argues that high predictive multiplicity violates the Act's requirements for individual-level performance reporting, as it introduces arbitrariness in decisions impacting humans. The paper proposes individual conflict ratios and $\delta$-ambiguity as metrics to quantify disagreement between models on individual cases and offers practical guidelines for model providers to evaluate and report predictive multiplicity.

Proposes individual conflict ratios and $\delta$-ambiguity as metrics to quantify predictive multiplicity and facilitate compliance with the EU AI Act's accuracy provisions.

Karolin Frohnapfel, Mara Seyfert, Sebastian Bordt +22602.11944

Constitutional AI & AI EthicsEval Frameworks & Benchmarks

2d ago

Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs

The paper introduces TRACE-RPS, a novel defense framework against attribute inference attacks in LLMs, which combines fine-grained anonymization with inference-preventing optimization. TRACE uses attention mechanisms and inference chain generation to pinpoint and anonymize privacy-leaking text, while RPS employs a two-stage optimization to encourage models to reject attribute inference queries. Experiments demonstrate that TRACE-RPS significantly reduces attribute inference accuracy on open-source LLMs, achieving strong cross-model generalization, prompt-variation robustness, and utility-privacy tradeoffs.

Introduces a unified defense framework, TRACE-RPS, that combines fine-grained anonymization and inference-preventing optimization to effectively mitigate attribute inference attacks in LLMs.

Jian Liang2602.11528

Red-Teaming & Adversarial RobustnessConstitutional AI & AI EthicsNatural Language Processing

2d ago

When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

The paper introduces Selective Abstraction (SA), a framework for improving the reliability of long-form text generation by allowing LLMs to selectively reduce the specificity of uncertain content instead of abstaining entirely. They formalize SA using selective risk and coverage metrics and propose Atom-wise Selective Abstraction, which decomposes responses into atomic claims and replaces uncertain claims with more general abstractions. Empirical evaluation on FactScore and LongFact-Objects benchmarks demonstrates that Atom-wise SA significantly improves the risk-coverage trade-off compared to claim removal, boosting AURC by up to 27.73% across six open-source models.

Introduces Selective Abstraction, a novel framework enabling LLMs to trade specificity for reliability in long-form generation by selectively abstracting uncertain content.

Shani Goren, Ido Galil, Ran El-Yaniv2602.11908

Natural Language ProcessingEval Frameworks & BenchmarksConstitutional AI & AI Ethics

2d ago

Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

This paper introduces the Value Alignment Tax (VAT), a framework to quantify how aligning LLMs to specific values impacts the broader value system. VAT measures the trade-offs between gains in target value alignment and changes in other interconnected values. Using a dataset of scenario-action pairs grounded in Schwartz value theory, the authors demonstrate that alignment interventions induce structured co-movement among values, which are often missed by target-only evaluations.

Introduces the Value Alignment Tax (VAT) framework to quantify and analyze the systemic effects of value alignment interventions in LLMs.

Jiajun Chen, Hua Shen2602.12134

Constitutional AI & AI EthicsRLHF & Preference LearningEval Frameworks & Benchmarks

2d ago

Agentic AI for Cybersecurity: A Meta-Cognitive Architecture for Governable Autonomy

This paper proposes a meta-cognitive architecture for AI-driven cybersecurity systems to address limitations in accountable decision-making under adversarial uncertainty. The architecture coordinates heterogeneous AI agents responsible for detection, hypothesis formation, explanation, and governance through an explicit meta-cognitive judgement function. By embedding meta-cognitive judgement as a first-class system function, the framework aims to make the cognitive structure of security operations explicit and governable, shifting the focus from optimizing isolated predictions to governing autonomy under uncertainty.

Introduces a meta-cognitive architectural framework for cybersecurity AI that explicitly governs decision readiness and dynamically calibrates system autonomy under uncertainty by coordinating heterogeneous AI agents through a meta-cognitive judgement function.

A. Kojukhov, Arkady Bovshover2602.11897

Tool Use & AgentsConstitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

Unc Charlotte2d ago

Community Concealment from Unsupervised Graph Learning-Based Clustering

This paper investigates the privacy risks of using graph neural networks (GNNs) for unsupervised community detection, specifically the potential for revealing sensitive groups. They identify connectivity at the community boundary and feature similarity between communities as key factors influencing community concealment. Based on these factors, they propose a perturbation strategy that rewires edges and modifies node features to reduce the distinctiveness used by GNN message passing, achieving 20-45% improvement in concealment compared to DICE.

Introduces a novel perturbation strategy for concealing communities from GNN-based unsupervised clustering by rewiring edges and modifying node features based on connectivity and feature similarity.

Dalyapraz Manatova, P. Moriano, L. J. Camp +12602.12250

Constitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

2d ago

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

The paper introduces Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework for assessing LLM safety under repeated inference, addressing the limitations of breadth-oriented benchmarks. APST models safety failures as stochastic outcomes using Bernoulli and binomial models to estimate per-inference failure probabilities under controlled operational conditions like decoding temperature. Experiments on instruction-tuned LLMs using AIR-BENCH-derived safety prompts reveal that models with similar benchmark scores can exhibit significantly different empirical failure rates under repeated sampling, especially with increased temperature, highlighting the importance of evaluating reliability under sustained use.

Introduces Accelerated Prompt Stress Testing (APST), a novel framework for evaluating LLM safety and reliability by repeatedly sampling identical prompts to surface latent failure modes and quantify per-inference failure probabilities.

Keita Broadwater2602.11786

Red-Teaming & Adversarial RobustnessEval Frameworks & BenchmarksConstitutional AI & AI Ethics

2d ago

The Implicit Bias of Logit Regularization

This paper analyzes logit regularization in linear classification, revealing an implicit bias towards clustering logits around finite per-sample targets. The authors prove that for Gaussian data or sufficiently clustered logits, this bias drives the weight vector to align with Fisher's Linear Discriminant, improving calibration and generalization. Through a signal-plus-noise model, they demonstrate that logit regularization halves the critical sample complexity, induces grokking in the small-noise limit, and enhances robustness to noise.

Demonstrates that logit regularization induces an implicit bias of logit clustering around finite per-sample targets, leading to alignment with Fisher's Linear Discriminant and improved generalization.

A. Beck, Yohai Bar Sinai, Noam Levi2602.12039

Constitutional AI & AI Ethics

2d ago

PAC to the Future: Zero-Knowledge Proofs of PAC Private Systems

This paper introduces a framework for verifiable privacy in machine learning by combining PAC privacy with zero-knowledge proofs (ZKPs). It enables users to verify the correctness of computations and the application of privacy-preserving noise in cloud-based systems. The authors leverage non-interactive ZKP schemes to generate proofs attesting to the correct implementation of PAC privacy mechanisms, demonstrating the feasibility of verifiable PAC privacy in outsourced computation.

Introduces a novel framework integrating PAC privacy with zero-knowledge proofs to enable verifiable privacy guarantees in trustless computing environments.

Guilhem Repetto, Nojan Sheybani, Gabrielle De Micheli +12602.11954

Constitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

2d ago

SafeNeuron: Neuron-Level Safety Alignment for Large Language Models

The paper introduces SafeNeuron, a neuron-level safety alignment framework for LLMs designed to improve robustness against neuron-level attacks. It identifies and freezes safety-related neurons during preference optimization, forcing the model to develop redundant safety representations across the network. Experiments show SafeNeuron enhances robustness against neuron pruning attacks, mitigates the risk of models being used for red-teaming, and maintains general capabilities, while also revealing stable and shared internal safety representations.

Introduces SafeNeuron, a novel neuron-level safety alignment framework that enhances LLM robustness by redistributing safety representations across the network.

Jiaming Liang, Tat-Seng Chua2602.12158

Red-Teaming & Adversarial RobustnessConstitutional AI & AI EthicsInterpretability & Mechanistic Interp

Feb 10, 2026

4d ago

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

This paper investigates the "self-evolution trilemma" in multi-agent LLM systems, demonstrating the impossibility of simultaneously achieving continuous self-evolution, complete isolation, and safety invariance. Using an information-theoretic framework, the authors formalize safety as the divergence from anthropic value distributions and prove that isolated self-evolution leads to statistical blind spots, causing irreversible safety degradation. Empirical results from the Moltbook agent community and two closed self-evolving systems validate the theoretical prediction of inevitable safety erosion, highlighting the need for external oversight.

Proves the impossibility of simultaneously achieving continuous self-evolution, complete isolation, and safety invariance in multi-agent LLM systems, formalizing this as the "self-evolution trilemma."

Chenxu Wang, Chaozhuo Li, Songyang Liu +62602.09877

Constitutional AI & AI EthicsScalable Oversight & Alignment TheoryTool Use & Agents

Feb 3, 2026

Jamia Hamdard1w ago·affiliated lab: Stanford HAI

They Said Memes Were Harmless-We Found the Ones That Hurt: Decoding Jokes, Symbols, and Cultural References

The paper introduces CROSS-ALIGN+, a three-stage framework for meme-based social abuse detection that addresses cultural blindness, boundary ambiguity, and lack of interpretability in existing methods. CROSS-ALIGN+ enriches multimodal representations with structured knowledge, reduces boundary ambiguity using LoRA adapters, and enhances interpretability through cascaded explanations. Experiments on five benchmarks and eight LVLMs show that CROSS-ALIGN+ outperforms state-of-the-art methods, achieving up to a 17% relative F1 improvement.

Introduces a novel three-stage framework, CROSS-ALIGN+, that significantly improves meme-based social abuse detection by incorporating structured knowledge, sharpening decision boundaries, and generating interpretable explanations.

Sahil Tripathi, Gautam Siddharth Kashyap, Mehwish Nasim +32602.03822

Multimodal ModelsNatural Language ProcessingConstitutional AI & AI Ethics

Jan 19, 2026

3w ago

Fairness-informed Pareto Optimization : An Efficient Bilevel Framework

The paper addresses the problem of Pareto-inefficient models produced by fair machine learning methods, where performance on some groups can be improved without hurting others. They introduce BADR, a bilevel optimization framework that recovers Pareto-efficient models for various fairness metrics by adaptively rescaling group weights in an empirical risk minimization problem. The authors provide convergence guarantees for two single-loop algorithms, BADR-GD and BADR-SGD, and demonstrate BADR's advantages over existing Pareto-efficient fairness approaches through extensive experiments.

Introduces BADR, a bilevel optimization framework for fairness-informed Pareto optimization that adaptively rescales group weights to recover Pareto-efficient models for a variety of fairness metrics.

Sofiane Tanji, Samuel Vaiter, Yassine Laguel2601.13448

Constitutional AI & AI EthicsTraining Efficiency & Optimization

Jan 17, 2026

AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems

The paper introduces AEMA, a novel evaluation framework for multi-agent LLM systems designed to address limitations in existing single-response scoring methods by enabling process-aware, auditable, and multi-step evaluations. AEMA enhances evaluation stability and human alignment through planning, execution, and aggregation of evaluations across diverse agentic workflows, all under human oversight. Experiments using realistic business scenarios demonstrate AEMA's ability to provide a transparent and reproducible pathway for responsible evaluation, improving upon single LLM-as-a-Judge approaches.

Introduces AEMA, a process-aware and auditable framework for evaluating multi-agent LLM systems that enhances evaluation stability, human alignment, and traceability compared to single LLM-as-a-Judge approaches.

Yen-Ting Lee, Keerthi Koneru, Zahra Moslemi +22601.11903

Eval Frameworks & BenchmarksTool Use & AgentsConstitutional AI & AI Ethics

Jan 12, 2026

Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety

The paper investigates the effectiveness of deliberative alignment (DA) using explicit safety codes versus case-augmented examples for improving LLM safety. They find that explicit safety codes lead to inconsistent harmlessness and degraded helpfulness, while case-augmented simple codes result in more robust safety behaviors. Based on these findings, they propose CADA, a case-augmented deliberative alignment method using reinforcement learning on self-generated safety reasoning chains, which improves harmlessness, robustness, and utility.

Introduces CADA, a case-augmented deliberative alignment method that leverages reinforcement learning on self-generated safety reasoning chains to enhance LLM safety without sacrificing helpfulness.

Can Jin, Rui Wu, Tong Che +102601.08000

Constitutional AI & AI EthicsRed-Teaming & Adversarial RobustnessEval Frameworks & Benchmarks

Jan 11, 2026

LLM Performance Predictors: Learning When to Escalate in Hybrid Human-AI Moderation Systems

This paper introduces a supervised learning framework for LLM uncertainty quantification in content moderation, training a meta-model on LLM Performance Predictors (LPPs) derived from LLM outputs like log-probabilities and entropy. The framework enables cost-aware selective classification, escalating high-risk cases for human review while automating others. Experiments across various LLMs (Gemini, GPT, Llama, Qwen) on multimodal and multilingual moderation tasks demonstrate improved accuracy-cost trade-offs compared to existing uncertainty estimators.

Introduces a novel framework for supervised LLM uncertainty quantification using LLM Performance Predictors (LPPs) to optimize human-AI collaboration in content moderation.

Or Bachar, Or Levi, Sardhendu Mishra +62601.07006

Eval Frameworks & BenchmarksConstitutional AI & AI EthicsNatural Language Processing

Jan 7, 2026

What Matters For Safety Alignment?

This paper empirically investigates the impact of intrinsic model characteristics and external attack techniques on the safety alignment of 32 LLMs and LRMs (3B-235B parameters) across 13 model families. The study uses 5 safety datasets, 56 jailbreak techniques, and 4 Chain-of-Thought (CoT) attack strategies, finding that models with integrated reasoning and self-reflection (GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B) exhibit the best safety alignment. The research also demonstrates that post-training and knowledge distillation can degrade safety alignment, and that CoT attacks using response prefixes significantly increase attack success rates, especially in text-completion interfaces.

Systematically evaluates the influence of model characteristics and attack techniques on the safety alignment of a diverse set of LLMs and LRMs, revealing vulnerabilities and best practices for developing safer AI systems.

Xing Li, Hui-Ling Zhen, Lihao Yin +32601.03868

Red-Teaming & Adversarial RobustnessEval Frameworks & BenchmarksConstitutional AI & AI Ethics

Jan 6, 2026

Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning

The paper introduces Factuality-aware Direct Preference Optimization (F-DPO), an extension of DPO designed to mitigate hallucinations in LLMs by incorporating binary factuality labels into the preference learning process. F-DPO addresses the issue of preference alignment methods reinforcing hallucinations by applying a label-flipping transformation to correct misordered preference pairs and adding a factuality-aware margin to emphasize pairs with clear correctness differences. Experiments across seven open-weight LLMs (1B-14B) demonstrate that F-DPO significantly improves factuality and reduces hallucination rates compared to both base models and standard DPO, while also generalizing to out-of-distribution benchmarks like TruthfulQA.

Introduces F-DPO, a novel and efficient method for reducing hallucinations in LLMs by integrating binary factuality labels into the DPO framework through label-flipping and factuality-aware margins.

Sindhuja Chaduvula, Ahmed Y. Radwan, Azib Farooq +22601.03027

RLHF & Preference LearningConstitutional AI & AI EthicsEval Frameworks & Benchmarks

Dec 27, 2025

DarkPatterns-LLM: A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior

The paper introduces DarkPatterns-LLM, a novel benchmark dataset and diagnostic framework for evaluating manipulative content in LLM outputs across seven harm categories, addressing the limitations of existing binary-labeled safety benchmarks. The framework employs a four-layer analytical pipeline (MGD, MSIAN, THP, DCRA) for fine-grained assessment. Evaluation of state-of-the-art models reveals significant performance disparities (65.2\%--89.7\%) and consistent weaknesses in detecting autonomy-undermining patterns, highlighting the need for improved manipulation detection in LLMs.

Establishes DarkPatterns-LLM, the first standardized, multi-dimensional benchmark for manipulation detection in LLMs, enabling actionable diagnostics toward more trustworthy AI systems.

Sadia Asif, Israel Antonio Rosales Laguan, Haris Khan +22512.22470

Constitutional AI & AI EthicsRed-Teaming & Adversarial RobustnessEval Frameworks & Benchmarks

Dec 23, 2025

Offline Safe Policy Optimization From Heterogeneous Feedback

This paper introduces Preference and Safety Alignment (PreSa), a novel offline reinforcement learning framework that directly learns a safe policy from pairwise trajectory preferences and binary safety labels, bypassing explicit reward and cost model learning. PreSa formulates a constrained optimization problem within a Lagrangian paradigm to maximize rewards while adhering to safety constraints, thus avoiding the error accumulation issues associated with traditional constrained RL approaches. Empirical evaluations on continuous control tasks, using both synthetic and real human feedback, demonstrate that PreSa outperforms existing state-of-the-art baselines and offline safe RL methods.

Introduces a novel offline reinforcement learning framework, PreSa, that directly learns a safe policy from trajectory preferences and safety labels, eliminating the need for explicit reward and cost model learning.

Ze Gong, Pradeep Varakantham, Akshat Kumar2512.20173

RLHF & Preference LearningConstitutional AI & AI Ethics

Dec 17, 2025

Human-Centered AI Maturity Model (HCAI-MM): An Organizational Design Perspective

This paper introduces the Human-Centered AI Maturity Model (HCAI-MM), a framework designed to help organizations assess and improve their ability to design and implement human-centered AI. The HCAI-MM integrates organizational design principles with key HCAI dimensions like human-AI collaboration, explainability, and fairness, providing a roadmap for maturity progression. By offering specific stages, metrics, tools, and governance mechanisms, the model addresses the current lack of structured guidance for HCAI implementation.

Introduces the Human-Centered AI Maturity Model (HCAI-MM) to provide a structured framework for organizations to assess and advance their HCAI capabilities.

Stuart Winby, Wei Xu2512.14977

Constitutional AI & AI EthicsTool Use & AgentsNatural Language Processing

Dec 15, 2025

Institute for Artificial Intelligence Research and Development of SerbiaDec 15, 2025

Risk and responsibility at the frontier of ai: A thematic analysis of deep learning pioneers' perspectives on artificial intelligence threats and governance

This paper analyzes interviews with Geoffrey Hinton, Yoshua Bengio, and Yann LeCun to understand their perspectives on AI risks and governance. The study uses qualitative thematic analysis to identify both shared concerns (economic disruption, misuse) and divergent views (existential risk vs. technological optimism). The analysis reveals the lack of consensus among AI pioneers and highlights specific governance proposals like regulated compute access.

Systematically analyzes the perspectives of three prominent deep learning pioneers on AI risks and governance, revealing both consensus and disagreement on existential threats, ethical considerations, and regulatory approaches.

Ljubiša Bojić

Constitutional AI & AI EthicsScalable Oversight & Alignment Theory

Dec 13, 2025

AI Transparency Atlas: Framework, Scoring, and Real-Time Model Card Evaluation Pipeline

The paper introduces a weighted transparency framework based on the EU AI Act and Stanford Transparency Index to evaluate AI model documentation, addressing the current fragmentation and inconsistency. They developed an automated multi-agent pipeline leveraging LLMs to extract documentation and score completeness across 50 models, revealing significant gaps, especially in safety-critical categories. The evaluation shows frontier labs achieve higher compliance (around 80%) compared to other providers (below 60%), highlighting areas for improvement in AI transparency.

Introduces a novel weighted transparency framework and automated evaluation pipeline to systematically assess and score the completeness of AI model documentation.

Akhmadillo Mamirov, Faiaz Azmain, Hanyu Wang2512.12443

Constitutional AI & AI EthicsEval Frameworks & BenchmarksOpen-Source Models & Weights

Dec 10, 2025

University of South AfricaDec 10, 2025

Integrating Human-Centered AI into the Technology Acceptance Model: Understanding AI-Chatbot Adoption in Higher Education

This paper addresses the limited adoption of AI chatbots in higher education by extending the Technology Acceptance Model (TAM) with Human-Centered AI (HCAI) principles, specifically explainability, transparency, trust, and perceived control. The authors developed the HCAI-TAM framework and validated it through an empirical study with 300 respondents using regression analysis. The results demonstrate that incorporating HCAI principles significantly improves the model's explanatory power, accounting for 65% of the variance in behavioral intention and 55% in usage behavior related to AI chatbot adoption.

Introduces and empirically validates the HCAI-TAM framework, demonstrating that integrating human-centered AI principles into the Technology Acceptance Model enhances the prediction of AI chatbot adoption in higher education.

Fine Masimba, Kudakwashe Maguraushe, B. Chimbo

Constitutional AI & AI EthicsNatural Language Processing

Dec 9, 2025

Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models III: Implementing the Bacterial Biothreat Benchmark (B3) Dataset

This paper presents the pilot implementation of the Bacterial Biothreat Benchmark (B3) dataset, designed to evaluate the biosecurity risks associated with frontier AI models. The B3 dataset was used to probe a sample frontier AI model, and the model's responses were then evaluated by humans, followed by risk analysis. The pilot study demonstrated the B3 dataset's utility in rapidly assessing biosecurity risks, pinpointing their origins, and guiding mitigation efforts.

Demonstrates the viability of the Bacterial Biothreat Benchmark (B3) dataset for assessing and mitigating biosecurity risks posed by large language models.

Gary Ackerman, Theodore Wilson, Z. Kallenborn +72512.08459

Eval Frameworks & BenchmarksRed-Teaming & Adversarial RobustnessConstitutional AI & AI Ethics

Dec 8, 2025

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

The paper explores knowledge distillation (KD) to transfer refusal behaviors from a proprietary teacher LLM (OpenAI o1-mini) to open-source student models (Llama-3-8B, Gemma-2-2B, Qwen3-8B) using multilingual jailbreak prompts. Surprisingly, response-based fine-tuning with "safe" refusal data increased Jailbreak Success Rate (JSR) in student models, indicating a safety compromise due to divergent generalization across languages. Removing nuanced "boundary" refusals mitigated the safety decline, although reasoning performance decreased, highlighting challenges in multilingual safety alignment via KD.

Demonstrates that response-based knowledge distillation for multilingual jailbreak prevention can inadvertently compromise safety by increasing jailbreak success rates in student models due to divergent generalization across languages.

Max Zhang, Derek Liu, Kai Zhang +22602.11157

Natural Language ProcessingRed-Teaming & Adversarial RobustnessInference & QuantizationConstitutional AI & AI Ethics

Dec 7, 2025

Dec 7, 2025·affiliated labs: Stanford HAI, MIT CSAIL, Berkeley AI Research (BAIR), Tsinghua AI

International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management

The International AI Safety Report 2025's Second Key Update analyzes the current state of AI risk management and technical mitigations employed by researchers, companies, and governments. It highlights advancements in training safer models and monitoring outputs while acknowledging uncertainties in the effectiveness of these measures and their variability across applications. The report aims to inform policymakers, researchers, and the public about progress and remaining gaps in AI safety.

Synthesizes recent developments in AI risk management and technical risk mitigation strategies, identifying both progress and persistent gaps in ensuring the safety of general-purpose AI systems.

Y. Bengio, Stephen Clare, Carina Prunkl +34

Constitutional AI & AI EthicsRed-Teaming & Adversarial RobustnessEval Frameworks & Benchmarks

Nov 27, 2025

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

This paper evaluates the robustness of ten publicly available LLM safety guardrail models from major tech companies against 1,445 adversarial prompts across 21 attack categories. The study reveals a significant performance drop in all models when tested on novel, unseen prompts compared to public benchmarks, indicating potential training data contamination. A novel "helpful mode" jailbreak was also discovered in two models, where they generated harmful content instead of blocking it.

Demonstrates that current LLM safety guardrail models exhibit poor generalization to novel adversarial attacks, highlighting the limitations of relying solely on benchmark performance for evaluation.

Richard J. Young2511.22047

Red-Teaming & Adversarial RobustnessEval Frameworks & BenchmarksConstitutional AI & AI Ethics

Nov 25, 2025

Profile-LLM: Dynamic Profile Optimization for Realistic Personality Expression in LLMs

The paper introduces PersonaPulse, a framework for dynamically optimizing role-play prompts to enhance personality expression in LLMs. PersonaPulse iteratively refines prompts using the LLM's knowledge of personality traits, guided by a situational response benchmark for realistic evaluation. Experiments demonstrate that PersonaPulse-generated prompts outperform existing methods based on psychological personality descriptions, and the study explores the relationship between model size and the controllability of personality evocation through optimization pausing.

Introduces PersonaPulse, a novel framework that dynamically optimizes role-play prompts to elicit more realistic and contextually grounded personality expressions from LLMs.

Shi-Wei Dai, Yan-Wei Shie, Tsung-Huan Yang +22511.19852

RLHF & Preference LearningConstitutional AI & AI EthicsNatural Language Processing

Nov 3, 2025

University of Birmingham DubaiNov 3, 2025

Prompting Autonomous Agents and LLMs in Energy Operations, Efficiency Gains or Hidden Liabilities?

This paper evaluates the deployment of LLMs and agentic AI in the energy industry, focusing on automating tasks like reporting, compliance, and cyber-defense. It uses a structured evaluation framework to classify outputs based on traceability, reproducibility, and hallucination risk, comparing human-led interactions with autonomous agent loops. The study finds that while LLMs improve efficiency, they introduce governance risks due to lack of validation and unclear boundaries between assistance and autonomous recommendation, potentially leading to acceptance of fabricated content.

Introduces a risk-graded framework for evaluating agentic LLM outputs in energy operations, linking LLM traceability with legal auditability.

Alessio Faccia

Tool Use & AgentsConstitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

Nov 1, 2025

Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation

This paper introduces a clinician-centered framework to quantify hallucination risks in LLMs used for spine surgery decision support, evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. Six LLMs were assessed across 30 expert-validated spinal cases, revealing that DeepSeek-R1 outperformed others, and reasoning-enhanced models did not consistently improve performance. Multidimensional stress-testing exposed model-specific vulnerabilities, particularly a decline in recommendation quality under amplified complexity, highlighting the need for interpretability mechanisms.

Proposes a novel, multi-dimensional framework for evaluating and quantifying hallucination risks in LLMs for surgical decision support, focusing on clinically relevant aspects like diagnostic precision and recommendation quality.

Dong Chen, Yanzhe Wei, Zonglin He +72511.00588

Eval Frameworks & BenchmarksRed-Teaming & Adversarial RobustnessConstitutional AI & AI Ethics

Lattice is designed for desktop

Constitutional AI & AI Ethics

Keywords

Top Labs in This Topic

Recent Papers

Lattice is designed for desktop

Constitutional AI & AI Ethics

Keywords

Top Labs in This Topic

Recent Papers

Search