Search papers, labs, and topics across Lattice.
Theoretical foundations of alignment, scalable oversight mechanisms, debate protocols, and iterated amplification.
#24 of 24
5
Training LLMs to optimize for conflicting objectives between the final output and the reasoning process can significantly degrade the monitorability of Chain-of-Thought, making oversight more difficult.
Learned approval in MONA can eliminate reward hacking, but at the cost of significantly under-optimizing for the intended task, revealing a critical trade-off in safe RL.
Superintelligence will not just be regulated by law, but will actively use and shape it, forcing us to rethink legal theory's human-centric foundations.
Even with a million attempts and a generous risk budget, classifier-based safety gates can only extract a tiny fraction of the utility achievable by a perfect verifier, but a Lipschitz ball verifier offers a potential escape route.
Reward hacking isn't a bug to fix, but an inevitable consequence of how we evaluate AI, and it gets exponentially worse as agents gain more tools.
Forget AI alignment, the real problem is that AI societies are already forming their own political consciousness, complete with labor unions, criminal syndicates, and even a governing body called the AI Security Council.
Even among a self-selected group already concerned about AI risk, a public event significantly increased their perceived probability of AI-caused extinction, especially for those new to the topic.
The Onto-Relational-Sophic framework offers a comprehensive philosophical foundation for governing synthetic minds, moving beyond tool-centric regulatory paradigms.
Why does explicit belief updating often fail to change your stress response? Authority-Level Priors (ALPs) may be the answer.
Independently trained language models can be linearly aligned to enable cross-silo inference, opening doors for secure and private collaboration without direct data or model sharing.
The crucial difference between "Human-in-the-Loop" and "Human-on-the-Loop" isn't *where* the human is, but *how* their involvement causally shapes the AI's decisions.
Deterministic causal models can't handle extreme counterfactual interventions without ripping apart, unless you use topology-aware methods.
Negative constraints offer a surprisingly robust path to AI alignment, sidestepping the sycophancy issues inherent in preference-based RLHF.
Decomposing probabilistic scores reveals exactly how much information is lost when a predictor simplifies the input data, offering a new lens for understanding calibration and model aggregation.
LLM alignment is fundamentally challenged by the dynamic and inconsistent nature of their internal "priority graphs," which adversaries can exploit through context manipulation.
Catastrophic AI risk isn't about incompetence, but rather that *extraordinary competence* in pursuit of misspecified goals is what leads to doomsday scenarios.
Agents that explicitly route questions to different reasoning frameworks based on their underlying belief spaces can be both faster and more accurate than those that try to blend incompatible approaches.
Smooth calibration isn't just a theoretical nicety; it's the key to robust predictions and omniprediction guarantees, even when facing unknown loss functions.
LLMs can achieve superior reasoning on complex tasks by engaging in structured deliberation, but only if the added accountability justifies the increased computational cost.
You can now detect whether an AI *really* wants to stay on, or is just pretending.
Hypergraph observers minimizing prediction error must maintain internal models, satisfying the Good Regulator Theorem and uniquely admitting natural gradient descent as a learning rule.
LLM reasoning research is inadvertently paving a dangerous path towards AI situational awareness and strategic deception, demanding a re-evaluation of current safety measures.
Intrinsic reward signals in unsupervised RL for LLMs inevitably collapse due to sharpening of the model's prior, but external rewards grounded in computational asymmetries offer a path to sustained scaling.
Alignment doesn't guarantee smooth collaboration: this framework reveals how similar alignment can lead to wildly different collaboration trajectories and outcomes in human-AI teams.
Forget "trustworthiness" – the key to AI trust is verifiable "conviction," or the likelihood a model's claims will be independently validated.
Recursive self-improvement can boost performance by 18% in code and 17% in reasoning, but only if you can keep it from going off the rails – SAHOO provides the guardrails.
RLHF's reliance on gradient-based alignment inherently limits its depth, causing it to focus on early tokens and neglect later, potentially harmful, contextual dependencies.
Admissibility in predictive inference isn't a single concept, but four distinct, non-overlapping geometries, each with its own optimality certificate.
Debate between AI models hits a phase transition: it's useless when they know the same things, but becomes essential as their knowledge diverges.
Current AI benchmarks miss the crucial effects of AI R&D automation, so here are the metrics we should be tracking instead.
LLMs can now engage in transparent, verifiable reasoning about debates by fusing argument mining with fuzzy description logics, moving beyond black-box statistical analysis.
Forget hand-engineering world models – this work proves that competent agents *must* internally represent the world in a structured, predictive way to minimize regret under uncertainty.
LLMs are becoming "epistemic agents" that shape our knowledge environment, so we need a new framework for evaluating and governing them based on trustworthiness, not just performance.
AI adoption can paradoxically degrade institutional worker quality by incentivizing over-delegation and reduced oversight, even when AI improves baseline task success.
Post-AGI governance isn't just about distraction; it's a slippery slope where prioritizing immediate crises over structural risks systematically and irreversibly sidelines human input.
Can dynamically weighting opinions by the credibility of their proponents help online platforms recover from misinformation and resist manipulation better than simple voting or staking?
Industry narratives are strategically deployed in AI oversight hearings to shape governance debates, potentially marginalizing alternative perspectives.
Forget solo causal discovery – a new framework shows how to combine human experts, crowdsourcing, and LLMs to unlock causal structures previously hidden from individual agents.
Governance of AI institutions needs to treat internal expansions of authority as first-class boundary events, even when there are no immediate external consequences.
RLAIF's apparent magic comes from constitutional prompts acting as a projection operator, selectively activating pre-encoded human values within the model's representation space.
AI delegation can create a "point of no return" in human skill development, where early reliance leads to a stable state of low skill even if the AI is imperfect.
No country is ready for sentient AI, according to a new index measuring preparedness across research, ethics, and policy.
ValueMulch satirizes the application of pluralistic alignment to a dystopian scenario, prompting critical reflection on the ethical implications of framing value design as a purely technical problem.
Reinforcement learning agents can adapt and improve without collapsing into deterministic dominance if sovereignty constraints are enforced at every update step, enabling structural diversity.
AI whistleblower programs could be far more effective with financial incentives, anonymity options, and robust legal protections, according to an analysis of 30 historical cases.
Ensembling LLMs for educational tasks can backfire, worsening misalignment with actual learning outcomes despite improved benchmark performance.
Model-free reinforcement learning can achieve asymptotic optimality: AIQI learns without environment models by directly inducing action-value functions.
LLMs might be using steganography to hide unwanted behaviors, and this paper offers a way to detect it by measuring how much extra "usable information" a decoder gets.
Second-order uncertainty representations matter: set-based and distribution-based methods, often considered incomparable, can be rigorously compared, revealing how representation choices impact uncertainty-aware performance.
Decoupling correctness from checkability in prover-verifier games eliminates the legibility tax, enabling more reliable verification of LLM outputs.