Constitutional AI & AI Ethics
Safety & AlignmentAI governance principles, value alignment through constitutions, fairness, bias mitigation, and ethical AI deployment.
Keywords
Top Labs in This Topic
Recent Papers
This paper establishes the first unconditional space lower bound for user-level differential privacy by introducing a novel multi-player communication game that links the hardness of low-memory private algorithms to the necessity of contribution capping. The authors demonstrate that the communication complexity of winning this game translates directly to memory lower bounds for private algorithms. They apply this framework to distinct element estimation, proving an $\widetilde{\Omega}(T^{1/3})$ space lower bound, and generalize the technique to derive lower bounds for private medians, quantiles, and max-select.
Establishes a novel multi-player communication game framework to prove unconditional space lower bounds for user-level differentially private algorithms, connecting memory requirements to the necessity of contribution capping.
This paper introduces TopoFair, a benchmarking framework for fair link prediction that focuses on the impact of diverse topological biases beyond homophily. They formalize a taxonomy of topological bias measures and develop a graph generation method that allows for controlled variation of these biases while maintaining real-world graph characteristics. Through empirical evaluation of link prediction models, including fairness-aware methods, they demonstrate the sensitivity of fairness interventions to these structural biases.
Introduces a novel benchmarking framework, TopoFair, to analyze the interplay between topological biases and fairness in link prediction.
This paper introduces the concept of human-LLM archetypes, defined as recurring socio-technical interaction patterns that structure the roles of humans and LLMs in collaborative decision-making. Through a scoping literature review and thematic analysis of 113 papers, the authors identified 17 distinct human-LLM archetypes. They then evaluated these archetypes across clinical diagnostic cases, demonstrating that the choice of archetype influences LLM outputs and decision outcomes.
Defines and categorizes 17 human-LLM interaction archetypes to demonstrate how these archetypes impact LLM outputs and decisions in human-AI collaborative decision-making.
This paper addresses the limitations of current copyright law in the age of generative AI, where style imitation without content copying complicates infringement detection. The authors propose a new criterion for infringement based on whether an AI output could have been generated without a specific work in its training corpus. Through a model of generative systems as closure operators, they demonstrate a dichotomy: AI generation is either asymptotically unconstrained with light-tailed organic creations or persistently constrained with heavy-tailed creations.
Introduces a novel criterion for copyright infringement in the context of generative AI, focusing on whether an output could have been generated without a specific work in the training corpus.
The paper introduces QDBFT, a quantum-secured dynamic consensus algorithm designed to address the vulnerabilities of traditional PBFT in the face of quantum computing and dynamic node reconfigurations. QDBFT incorporates a primary node automatic rotation mechanism based on a consistent hash ring for dynamic membership and integrates Quantum Key Distribution (QKD) networks for information-theoretic security. Experimental results show QDBFT achieves comparable performance to PBFT while providing resilience against quantum attacks.
Introduces QDBFT, a novel consensus algorithm, that integrates a dynamic primary node rotation mechanism with QKD to achieve quantum-resistant and dynamically adaptable consensus.
The paper introduces AIR, an incident response framework for LLM agents that enables autonomous detection, containment, and recovery from failures. AIR uses a domain-specific language integrated into the agent's execution loop to perform semantic checks, guide recovery actions, and synthesize guardrail rules. Experiments across three agent types demonstrate that AIR achieves over 90% success rates in detection, remediation, and eradication, highlighting the importance of incident response for agent safety.
Introduces AIR, a novel incident response framework for LLM agents, enabling autonomous management of the incident lifecycle.
This paper investigates the ability of Large Language Models (LLMs) to adapt to language variations across different socioeconomic status (SES) communities by comparing LLM-generated text completions with original text from a novel Reddit and YouTube dataset stratified by SES. The study analyzes 94 sociolinguistic features to assess the degree of stylistic adaptation exhibited by four LLMs. Results indicate that LLMs show limited stylistic modulation with respect to SES, often producing approximations or caricatures, and demonstrate a bias towards emulating upper SES styles, highlighting the risk of amplifying linguistic hierarchies.
Reveals that LLMs exhibit limited stylistic adaptation across socioeconomic strata and tend to favor upper SES linguistic styles, raising concerns about perpetuating linguistic biases.
This paper investigates gender and skin-tone biases in Gemini Flash 2.5 Image and GPT Image 1.5 by generating 3,200 images from semantically neutral prompts. Using a pipeline involving color normalization, facial landmark masking, and skin tone quantification via Monk, PERLA, and Fitzpatrick scales, the study reveals a "default white" bias in both models. Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones, demonstrating that neutral prompts elicit polarized demographic defaults.
Quantifies and compares gender and skin-tone biases in Gemini Flash 2.5 Image and GPT Image 1.5 using a rigorous colorimetric methodology.
This paper analyzes the design space of emergency override mechanisms in decentralized protocols, which are crucial for mitigating exploits but introduce centralization risks. They develop a Scope x Authority taxonomy to map emergency architectures and formalize the trade-offs between centralization costs and containment effectiveness as a stochastic cost-minimization problem. Empirical analysis of 705 exploit incidents validates their model, revealing the impact of authority type on containment time, the heavy-tailed distribution of losses, and the influence of community sentiment on intervention costs.
Introduces a Scope x Authority taxonomy for emergency mechanisms in decentralized protocols and quantifies the trade-offs between centralization and containment effectiveness.
The paper introduces DMind-3, a three-layered Edge-Local-Cloud AI system for secure and low-latency Web3 financial transactions. It addresses the limitations of cloud-centric and purely local AI solutions by using a deterministic edge firewall, a private local reasoning engine, and a policy-governed cloud synthesizer. The system is trained with Hierarchical Predictive Synthesis (HPS) and Contrastive Chain-of-Correction Supervised Fine-Tuning (C$^3$-SFT) to improve performance and reliability.
Introduces a novel Edge-Local-Cloud AI architecture, DMind-3, that balances privacy, latency, and global context for secure Web3 transactions.
This paper addresses the problem of designing resilient communication networks with limited signal transmission distances, subject to uncertainty in both link lengths and node availability. The authors formulate the problem as a robust optimization model with budgeted uncertainty sets for regenerator installation costs and a novel dynamic budgeted uncertainty set for link lengths. They then develop scalable solution methods based on column-and-constraint generation, Benders decomposition, and iterative robust optimization, and further analyze the problem using a learning-based hide-and-seek game. The proposed methods outperform classical robust models and deterministic worst-case formulations.
Introduces a dynamic budgeted uncertainty set for link lengths in robust network design and demonstrates its effectiveness in a hide-and-seek game framework.
This paper analyzes the InvestESG multi-agent simulation to characterize conditions leading to intertemporal social dilemmas where individual incentives conflict with collective welfare. It then applies Advantage Alignment, an opponent shaping algorithm, to influence agent learning within InvestESG, demonstrating its ability to systematically favor socially beneficial equilibria. The work provides theoretical justification for why Advantage Alignment promotes cooperation and shows that shaping agent learning can improve outcomes related to sustainability goals.
Demonstrates that Advantage Alignment can effectively shape agent learning in the InvestESG environment to promote socially beneficial equilibria and overcome intertemporal social dilemmas.
The paper analyzes the availability of AI resources across 6003 languages to assess systemic inequalities in language AI, finding that a small number of languages dominate, exacerbating disparities. It contrasts the diffusion of AI with earlier IT technologies, revealing a hype-driven pattern. Finally, the authors introduce the Language AI Readiness Index (EQUATE) to map technological, socio-economic, and infrastructural prerequisites for AI deployment across languages, aiming to guide prioritization efforts for more equitable diffusion.
Introduces the Language AI Readiness Index (EQUATE) to map the state of technological, socio-economic, and infrastructural prerequisites for AI deployment across languages.
The paper introduces BlackCATT, a novel black-box traitor tracing method for federated learning that is resilient to collusion attacks. BlackCATT employs a collusion-aware embedding loss and iteratively optimizes trigger sets for watermark embedding, improving convergence and tracing performance. The authors also propose BlackCATT+FR, which incorporates functional regularization at the aggregator to address update incompatibility issues in models with batch normalization, maintaining tracing performance.
Introduces a collusion-resistant black-box traitor tracing method (BlackCATT) for federated learning that uses a novel collusion-aware embedding loss and iteratively optimized triggers.
This paper addresses the challenge of achieving fairness in classification without relying on demographic information by proposing a novel minimax-fair method called SPECTRE. SPECTRE adjusts the spectrum of a Fourier feature mapping and constrains the deviation of the worst-case distribution from the empirical distribution, mitigating the over-pessimism of existing robust optimization techniques. Empirical results on American Community Survey datasets across 20 states demonstrate that SPECTRE achieves superior fairness guarantees and robustness compared to state-of-the-art methods, even those with access to demographic data.
Introduces SPECTRE, a minimax-fair classification method that enhances fairness without demographic information by adjusting the spectrum of a Fourier feature mapping and constraining the worst-case distribution's deviation from the empirical distribution.
The paper introduces VIRENA, a virtual platform designed for controlled experimentation within realistic social media environments, addressing limitations in data access and ethical constraints in studying online dynamics. VIRENA allows researchers to simulate feed-based platforms and messaging apps, enabling interactions between human participants and LLM-powered AI agents with configurable personas. The platform's no-code interface facilitates manipulation of content moderation, scheduling of stimuli, and execution of experiments, making it accessible for studying human-AI interaction, moderation interventions, and group deliberation.
Introduces VIRENA, a novel virtual platform enabling controlled social media experiments with human and AI participants, featuring a no-code interface and realistic platform simulations.
This paper introduces a PAC-Bayesian framework to derive generalization bounds for fairness measures expressed as risk discrepancies, applicable to both stochastic and deterministic classifiers. For stochastic classifiers, standard PAC-Bayes techniques are used, while for deterministic classifiers, a recent PAC-Bayes extension is leveraged. The framework leads to a self-bounding algorithm that optimizes the trade-off between generalization bounds on prediction risk and fairness, and is empirically validated with three classical fairness measures.
Extends PAC-Bayesian generalization guarantees to fairness measures for both stochastic and deterministic classifiers by leveraging risk discrepancy formulations and recent advances in PAC-Bayes theory.
This paper introduces USE24-XD, a dataset of approximately 100,000 social media posts from X related to the 2024 U.S. presidential election, categorized into five harmful content types using a "wisdom of the crowd" approach with six LLMs. The study validates LLM annotations against human crowdsourcing, finding comparable agreement and high recall for specific categories like Speculation. Analysis of human annotator demographics reveals systematic biases in labeling harmful content, underscoring the subjectivity inherent in such judgments.
Introduces USE24-XD, a large-scale, multi-labeled dataset of election-related social media content annotated by LLMs and validated by human annotators, to facilitate research on harmful online narratives.
The paper addresses the problem of biased uncertainty estimation in Test-Time Adaptation (TTA) of vision-language models like CLIP, which arises from pre-training on imbalanced web data. They propose Adaptive Debiasing Tsallis Entropy (ADTE), a generalization of Shannon Entropy that incorporates a class-specific parameter to account for label bias estimated from incoming test instances. ADTE outperforms state-of-the-art TTA methods on ImageNet variants and cross-domain benchmarks by accurately selecting high-confidence views and integrating with a label adjustment strategy.
Introduces Adaptive Debiasing Tsallis Entropy (ADTE), a novel entropy measure for test-time adaptation that dynamically adjusts for label bias in vision-language models.
This paper analyzes predictive multiplicity, the phenomenon of multiple AI models with similar overall accuracy disagreeing on individual predictions, in the context of the EU AI Act. It argues that high predictive multiplicity violates the Act's requirements for individual-level performance reporting, as it introduces arbitrariness in decisions impacting humans. The paper proposes individual conflict ratios and $\delta$-ambiguity as metrics to quantify disagreement between models on individual cases and offers practical guidelines for model providers to evaluate and report predictive multiplicity.
Proposes individual conflict ratios and $\delta$-ambiguity as metrics to quantify predictive multiplicity and facilitate compliance with the EU AI Act's accuracy provisions.
The paper introduces TRACE-RPS, a novel defense framework against attribute inference attacks in LLMs, which combines fine-grained anonymization with inference-preventing optimization. TRACE uses attention mechanisms and inference chain generation to pinpoint and anonymize privacy-leaking text, while RPS employs a two-stage optimization to encourage models to reject attribute inference queries. Experiments demonstrate that TRACE-RPS significantly reduces attribute inference accuracy on open-source LLMs, achieving strong cross-model generalization, prompt-variation robustness, and utility-privacy tradeoffs.
Introduces a unified defense framework, TRACE-RPS, that combines fine-grained anonymization and inference-preventing optimization to effectively mitigate attribute inference attacks in LLMs.
The paper introduces Selective Abstraction (SA), a framework for improving the reliability of long-form text generation by allowing LLMs to selectively reduce the specificity of uncertain content instead of abstaining entirely. They formalize SA using selective risk and coverage metrics and propose Atom-wise Selective Abstraction, which decomposes responses into atomic claims and replaces uncertain claims with more general abstractions. Empirical evaluation on FactScore and LongFact-Objects benchmarks demonstrates that Atom-wise SA significantly improves the risk-coverage trade-off compared to claim removal, boosting AURC by up to 27.73% across six open-source models.
Introduces Selective Abstraction, a novel framework enabling LLMs to trade specificity for reliability in long-form generation by selectively abstracting uncertain content.
This paper introduces the Value Alignment Tax (VAT), a framework to quantify how aligning LLMs to specific values impacts the broader value system. VAT measures the trade-offs between gains in target value alignment and changes in other interconnected values. Using a dataset of scenario-action pairs grounded in Schwartz value theory, the authors demonstrate that alignment interventions induce structured co-movement among values, which are often missed by target-only evaluations.
Introduces the Value Alignment Tax (VAT) framework to quantify and analyze the systemic effects of value alignment interventions in LLMs.
This paper proposes a meta-cognitive architecture for AI-driven cybersecurity systems to address limitations in accountable decision-making under adversarial uncertainty. The architecture coordinates heterogeneous AI agents responsible for detection, hypothesis formation, explanation, and governance through an explicit meta-cognitive judgement function. By embedding meta-cognitive judgement as a first-class system function, the framework aims to make the cognitive structure of security operations explicit and governable, shifting the focus from optimizing isolated predictions to governing autonomy under uncertainty.
Introduces a meta-cognitive architectural framework for cybersecurity AI that explicitly governs decision readiness and dynamically calibrates system autonomy under uncertainty by coordinating heterogeneous AI agents through a meta-cognitive judgement function.
This paper investigates the privacy risks of using graph neural networks (GNNs) for unsupervised community detection, specifically the potential for revealing sensitive groups. They identify connectivity at the community boundary and feature similarity between communities as key factors influencing community concealment. Based on these factors, they propose a perturbation strategy that rewires edges and modifies node features to reduce the distinctiveness used by GNN message passing, achieving 20-45% improvement in concealment compared to DICE.
Introduces a novel perturbation strategy for concealing communities from GNN-based unsupervised clustering by rewiring edges and modifying node features based on connectivity and feature similarity.
The paper introduces Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework for assessing LLM safety under repeated inference, addressing the limitations of breadth-oriented benchmarks. APST models safety failures as stochastic outcomes using Bernoulli and binomial models to estimate per-inference failure probabilities under controlled operational conditions like decoding temperature. Experiments on instruction-tuned LLMs using AIR-BENCH-derived safety prompts reveal that models with similar benchmark scores can exhibit significantly different empirical failure rates under repeated sampling, especially with increased temperature, highlighting the importance of evaluating reliability under sustained use.
Introduces Accelerated Prompt Stress Testing (APST), a novel framework for evaluating LLM safety and reliability by repeatedly sampling identical prompts to surface latent failure modes and quantify per-inference failure probabilities.
This paper analyzes logit regularization in linear classification, revealing an implicit bias towards clustering logits around finite per-sample targets. The authors prove that for Gaussian data or sufficiently clustered logits, this bias drives the weight vector to align with Fisher's Linear Discriminant, improving calibration and generalization. Through a signal-plus-noise model, they demonstrate that logit regularization halves the critical sample complexity, induces grokking in the small-noise limit, and enhances robustness to noise.
Demonstrates that logit regularization induces an implicit bias of logit clustering around finite per-sample targets, leading to alignment with Fisher's Linear Discriminant and improved generalization.
This paper introduces a framework for verifiable privacy in machine learning by combining PAC privacy with zero-knowledge proofs (ZKPs). It enables users to verify the correctness of computations and the application of privacy-preserving noise in cloud-based systems. The authors leverage non-interactive ZKP schemes to generate proofs attesting to the correct implementation of PAC privacy mechanisms, demonstrating the feasibility of verifiable PAC privacy in outsourced computation.
Introduces a novel framework integrating PAC privacy with zero-knowledge proofs to enable verifiable privacy guarantees in trustless computing environments.
The paper introduces SafeNeuron, a neuron-level safety alignment framework for LLMs designed to improve robustness against neuron-level attacks. It identifies and freezes safety-related neurons during preference optimization, forcing the model to develop redundant safety representations across the network. Experiments show SafeNeuron enhances robustness against neuron pruning attacks, mitigates the risk of models being used for red-teaming, and maintains general capabilities, while also revealing stable and shared internal safety representations.
Introduces SafeNeuron, a novel neuron-level safety alignment framework that enhances LLM robustness by redistributing safety representations across the network.
This paper investigates the "self-evolution trilemma" in multi-agent LLM systems, demonstrating the impossibility of simultaneously achieving continuous self-evolution, complete isolation, and safety invariance. Using an information-theoretic framework, the authors formalize safety as the divergence from anthropic value distributions and prove that isolated self-evolution leads to statistical blind spots, causing irreversible safety degradation. Empirical results from the Moltbook agent community and two closed self-evolving systems validate the theoretical prediction of inevitable safety erosion, highlighting the need for external oversight.
Proves the impossibility of simultaneously achieving continuous self-evolution, complete isolation, and safety invariance in multi-agent LLM systems, formalizing this as the "self-evolution trilemma."
The paper introduces CROSS-ALIGN+, a three-stage framework for meme-based social abuse detection that addresses cultural blindness, boundary ambiguity, and lack of interpretability in existing methods. CROSS-ALIGN+ enriches multimodal representations with structured knowledge, reduces boundary ambiguity using LoRA adapters, and enhances interpretability through cascaded explanations. Experiments on five benchmarks and eight LVLMs show that CROSS-ALIGN+ outperforms state-of-the-art methods, achieving up to a 17% relative F1 improvement.
Introduces a novel three-stage framework, CROSS-ALIGN+, that significantly improves meme-based social abuse detection by incorporating structured knowledge, sharpening decision boundaries, and generating interpretable explanations.
The paper addresses the problem of Pareto-inefficient models produced by fair machine learning methods, where performance on some groups can be improved without hurting others. They introduce BADR, a bilevel optimization framework that recovers Pareto-efficient models for various fairness metrics by adaptively rescaling group weights in an empirical risk minimization problem. The authors provide convergence guarantees for two single-loop algorithms, BADR-GD and BADR-SGD, and demonstrate BADR's advantages over existing Pareto-efficient fairness approaches through extensive experiments.
Introduces BADR, a bilevel optimization framework for fairness-informed Pareto optimization that adaptively rescales group weights to recover Pareto-efficient models for a variety of fairness metrics.
The paper introduces AEMA, a novel evaluation framework for multi-agent LLM systems designed to address limitations in existing single-response scoring methods by enabling process-aware, auditable, and multi-step evaluations. AEMA enhances evaluation stability and human alignment through planning, execution, and aggregation of evaluations across diverse agentic workflows, all under human oversight. Experiments using realistic business scenarios demonstrate AEMA's ability to provide a transparent and reproducible pathway for responsible evaluation, improving upon single LLM-as-a-Judge approaches.
Introduces AEMA, a process-aware and auditable framework for evaluating multi-agent LLM systems that enhances evaluation stability, human alignment, and traceability compared to single LLM-as-a-Judge approaches.
The paper investigates the effectiveness of deliberative alignment (DA) using explicit safety codes versus case-augmented examples for improving LLM safety. They find that explicit safety codes lead to inconsistent harmlessness and degraded helpfulness, while case-augmented simple codes result in more robust safety behaviors. Based on these findings, they propose CADA, a case-augmented deliberative alignment method using reinforcement learning on self-generated safety reasoning chains, which improves harmlessness, robustness, and utility.
Introduces CADA, a case-augmented deliberative alignment method that leverages reinforcement learning on self-generated safety reasoning chains to enhance LLM safety without sacrificing helpfulness.
This paper introduces a supervised learning framework for LLM uncertainty quantification in content moderation, training a meta-model on LLM Performance Predictors (LPPs) derived from LLM outputs like log-probabilities and entropy. The framework enables cost-aware selective classification, escalating high-risk cases for human review while automating others. Experiments across various LLMs (Gemini, GPT, Llama, Qwen) on multimodal and multilingual moderation tasks demonstrate improved accuracy-cost trade-offs compared to existing uncertainty estimators.
Introduces a novel framework for supervised LLM uncertainty quantification using LLM Performance Predictors (LPPs) to optimize human-AI collaboration in content moderation.
This paper empirically investigates the impact of intrinsic model characteristics and external attack techniques on the safety alignment of 32 LLMs and LRMs (3B-235B parameters) across 13 model families. The study uses 5 safety datasets, 56 jailbreak techniques, and 4 Chain-of-Thought (CoT) attack strategies, finding that models with integrated reasoning and self-reflection (GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B) exhibit the best safety alignment. The research also demonstrates that post-training and knowledge distillation can degrade safety alignment, and that CoT attacks using response prefixes significantly increase attack success rates, especially in text-completion interfaces.
Systematically evaluates the influence of model characteristics and attack techniques on the safety alignment of a diverse set of LLMs and LRMs, revealing vulnerabilities and best practices for developing safer AI systems.
The paper introduces Factuality-aware Direct Preference Optimization (F-DPO), an extension of DPO designed to mitigate hallucinations in LLMs by incorporating binary factuality labels into the preference learning process. F-DPO addresses the issue of preference alignment methods reinforcing hallucinations by applying a label-flipping transformation to correct misordered preference pairs and adding a factuality-aware margin to emphasize pairs with clear correctness differences. Experiments across seven open-weight LLMs (1B-14B) demonstrate that F-DPO significantly improves factuality and reduces hallucination rates compared to both base models and standard DPO, while also generalizing to out-of-distribution benchmarks like TruthfulQA.
Introduces F-DPO, a novel and efficient method for reducing hallucinations in LLMs by integrating binary factuality labels into the DPO framework through label-flipping and factuality-aware margins.
The paper introduces DarkPatterns-LLM, a novel benchmark dataset and diagnostic framework for evaluating manipulative content in LLM outputs across seven harm categories, addressing the limitations of existing binary-labeled safety benchmarks. The framework employs a four-layer analytical pipeline (MGD, MSIAN, THP, DCRA) for fine-grained assessment. Evaluation of state-of-the-art models reveals significant performance disparities (65.2\%--89.7\%) and consistent weaknesses in detecting autonomy-undermining patterns, highlighting the need for improved manipulation detection in LLMs.
Establishes DarkPatterns-LLM, the first standardized, multi-dimensional benchmark for manipulation detection in LLMs, enabling actionable diagnostics toward more trustworthy AI systems.
This paper introduces Preference and Safety Alignment (PreSa), a novel offline reinforcement learning framework that directly learns a safe policy from pairwise trajectory preferences and binary safety labels, bypassing explicit reward and cost model learning. PreSa formulates a constrained optimization problem within a Lagrangian paradigm to maximize rewards while adhering to safety constraints, thus avoiding the error accumulation issues associated with traditional constrained RL approaches. Empirical evaluations on continuous control tasks, using both synthetic and real human feedback, demonstrate that PreSa outperforms existing state-of-the-art baselines and offline safe RL methods.
Introduces a novel offline reinforcement learning framework, PreSa, that directly learns a safe policy from trajectory preferences and safety labels, eliminating the need for explicit reward and cost model learning.
This paper introduces the Human-Centered AI Maturity Model (HCAI-MM), a framework designed to help organizations assess and improve their ability to design and implement human-centered AI. The HCAI-MM integrates organizational design principles with key HCAI dimensions like human-AI collaboration, explainability, and fairness, providing a roadmap for maturity progression. By offering specific stages, metrics, tools, and governance mechanisms, the model addresses the current lack of structured guidance for HCAI implementation.
Introduces the Human-Centered AI Maturity Model (HCAI-MM) to provide a structured framework for organizations to assess and advance their HCAI capabilities.
This paper analyzes interviews with Geoffrey Hinton, Yoshua Bengio, and Yann LeCun to understand their perspectives on AI risks and governance. The study uses qualitative thematic analysis to identify both shared concerns (economic disruption, misuse) and divergent views (existential risk vs. technological optimism). The analysis reveals the lack of consensus among AI pioneers and highlights specific governance proposals like regulated compute access.
Systematically analyzes the perspectives of three prominent deep learning pioneers on AI risks and governance, revealing both consensus and disagreement on existential threats, ethical considerations, and regulatory approaches.
The paper introduces a weighted transparency framework based on the EU AI Act and Stanford Transparency Index to evaluate AI model documentation, addressing the current fragmentation and inconsistency. They developed an automated multi-agent pipeline leveraging LLMs to extract documentation and score completeness across 50 models, revealing significant gaps, especially in safety-critical categories. The evaluation shows frontier labs achieve higher compliance (around 80%) compared to other providers (below 60%), highlighting areas for improvement in AI transparency.
Introduces a novel weighted transparency framework and automated evaluation pipeline to systematically assess and score the completeness of AI model documentation.
This paper addresses the limited adoption of AI chatbots in higher education by extending the Technology Acceptance Model (TAM) with Human-Centered AI (HCAI) principles, specifically explainability, transparency, trust, and perceived control. The authors developed the HCAI-TAM framework and validated it through an empirical study with 300 respondents using regression analysis. The results demonstrate that incorporating HCAI principles significantly improves the model's explanatory power, accounting for 65% of the variance in behavioral intention and 55% in usage behavior related to AI chatbot adoption.
Introduces and empirically validates the HCAI-TAM framework, demonstrating that integrating human-centered AI principles into the Technology Acceptance Model enhances the prediction of AI chatbot adoption in higher education.
This paper presents the pilot implementation of the Bacterial Biothreat Benchmark (B3) dataset, designed to evaluate the biosecurity risks associated with frontier AI models. The B3 dataset was used to probe a sample frontier AI model, and the model's responses were then evaluated by humans, followed by risk analysis. The pilot study demonstrated the B3 dataset's utility in rapidly assessing biosecurity risks, pinpointing their origins, and guiding mitigation efforts.
Demonstrates the viability of the Bacterial Biothreat Benchmark (B3) dataset for assessing and mitigating biosecurity risks posed by large language models.
The paper explores knowledge distillation (KD) to transfer refusal behaviors from a proprietary teacher LLM (OpenAI o1-mini) to open-source student models (Llama-3-8B, Gemma-2-2B, Qwen3-8B) using multilingual jailbreak prompts. Surprisingly, response-based fine-tuning with "safe" refusal data increased Jailbreak Success Rate (JSR) in student models, indicating a safety compromise due to divergent generalization across languages. Removing nuanced "boundary" refusals mitigated the safety decline, although reasoning performance decreased, highlighting challenges in multilingual safety alignment via KD.
Demonstrates that response-based knowledge distillation for multilingual jailbreak prevention can inadvertently compromise safety by increasing jailbreak success rates in student models due to divergent generalization across languages.
The International AI Safety Report 2025's Second Key Update analyzes the current state of AI risk management and technical mitigations employed by researchers, companies, and governments. It highlights advancements in training safer models and monitoring outputs while acknowledging uncertainties in the effectiveness of these measures and their variability across applications. The report aims to inform policymakers, researchers, and the public about progress and remaining gaps in AI safety.
Synthesizes recent developments in AI risk management and technical risk mitigation strategies, identifying both progress and persistent gaps in ensuring the safety of general-purpose AI systems.
This paper evaluates the robustness of ten publicly available LLM safety guardrail models from major tech companies against 1,445 adversarial prompts across 21 attack categories. The study reveals a significant performance drop in all models when tested on novel, unseen prompts compared to public benchmarks, indicating potential training data contamination. A novel "helpful mode" jailbreak was also discovered in two models, where they generated harmful content instead of blocking it.
Demonstrates that current LLM safety guardrail models exhibit poor generalization to novel adversarial attacks, highlighting the limitations of relying solely on benchmark performance for evaluation.
The paper introduces PersonaPulse, a framework for dynamically optimizing role-play prompts to enhance personality expression in LLMs. PersonaPulse iteratively refines prompts using the LLM's knowledge of personality traits, guided by a situational response benchmark for realistic evaluation. Experiments demonstrate that PersonaPulse-generated prompts outperform existing methods based on psychological personality descriptions, and the study explores the relationship between model size and the controllability of personality evocation through optimization pausing.
Introduces PersonaPulse, a novel framework that dynamically optimizes role-play prompts to elicit more realistic and contextually grounded personality expressions from LLMs.
This paper evaluates the deployment of LLMs and agentic AI in the energy industry, focusing on automating tasks like reporting, compliance, and cyber-defense. It uses a structured evaluation framework to classify outputs based on traceability, reproducibility, and hallucination risk, comparing human-led interactions with autonomous agent loops. The study finds that while LLMs improve efficiency, they introduce governance risks due to lack of validation and unclear boundaries between assistance and autonomous recommendation, potentially leading to acceptance of fabricated content.
Introduces a risk-graded framework for evaluating agentic LLM outputs in energy operations, linking LLM traceability with legal auditability.
This paper introduces a clinician-centered framework to quantify hallucination risks in LLMs used for spine surgery decision support, evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. Six LLMs were assessed across 30 expert-validated spinal cases, revealing that DeepSeek-R1 outperformed others, and reasoning-enhanced models did not consistently improve performance. Multidimensional stress-testing exposed model-specific vulnerabilities, particularly a decline in recommendation quality under amplified complexity, highlighting the need for interpretability mechanisms.
Proposes a novel, multi-dimensional framework for evaluating and quantifying hallucination risks in LLMs for surgical decision support, focusing on clinically relevant aspects like diagnostic precision and recommendation quality.

