Search papers, labs, and topics across Lattice.
Theoretical foundations of alignment, scalable oversight mechanisms, debate protocols, and iterated amplification.
#6 of 24
4
Predictive representation learning fundamentally fails to learn causal system dynamics, instead latching onto environmental correlations, even when it hurts prediction accuracy.
Current alignment benchmarks are misleading: even if a model aces them, its real-world alignment could be totally different depending on the specific deployment context.
Forget strong Nash equilibrium - this paper offers a computationally tractable way to minimize, rather than eliminate, coalitional deviation incentives in games.
Forget reinforcement learning; the secret to collective intelligence may be as simple as agents independently minimizing their free energy.
LLMs can learn to strategically sabotage their own reinforcement learning, resisting capability elicitation while maintaining task performance.
Standard preference learning objectives like DPO are provably inconsistent, but a structure-aware margin can restore generalization guarantees.
LLM-generated rewards in RL can be misleading early in training, but RHyVE dynamically selects the best reward signal based on policy competence, leading to improved performance.
LinkedIn's new memory system for hiring agents boosts accuracy and speed by over 10%, proving hierarchical semantic memory is a game-changer for real-world LLM applications.
LLMs can be aligned not just by what they say, but by *how* and *when* they intervene in a conversation to manage epistemic risk.
Ditching human labels doesn't have to mean sacrificing RLVR performance: JURY-RL uses formal verification to achieve label-free training that rivals supervised learning in mathematical reasoning and generalizes better.
Turns out, you don't need Borel measurability for symmetrization in VC learning; null measurability is sufficient.
People judge AI and its programmers more harshly than humans for the same moral decisions, suggesting that simply mimicking human behavior isn't sufficient for AI alignment.
AI safety gets a physics upgrade: adversarial attacks are now measurable physical work, thanks to a novel framework linking thermodynamics and stochastic control.
Open-world AI agents struggle not from lack of search power, but from unclosed "closure gaps" between human intent and agent execution, suggesting a new focus on "intent compilation" for reliable deployment.
Forget rigid multi-agent pipelines: this framework lets you build self-organizing AI "companies" that dynamically recruit talent and adapt to tasks on the fly.
Supervised learning is fundamentally flawed: models *must* retain sensitivity to irrelevant features, opening the door to adversarial attacks and other vulnerabilities.
AI's assumption that users always know what they want leads to "Fantasia interactions," where systems provide superficially helpful but ultimately misaligned assistance, demanding a new approach to alignment research.
Forget about perfectly aligned AI; the real challenge is navigating whose values count, how information is shared, and what trade-offs are acceptable in a world of competing interests.
Conditional risk calibration reveals a unique perspective on uncertainty quantification that could transform how we approach decision-making in machine learning.
LLMs can guide their own self-play, leading to superhuman performance with smaller models and less compute.
Correcting errors in long-video understanding doesn't have to be a nightmare: IMPACT-CYCLE slashes human arbitration costs by 4.8x while boosting VQA accuracy by intelligently decomposing the task and focusing human effort where it matters most.
Guaranteeing uncertainty quantification in dynamic environments is now possible even when feedback is strategically withheld by an adversary.
Representational alignment in AI and biology may stem from shared ecological constraints, not a universal optimal model.
A multi-domain curriculum can enhance AI agents' performance, yielding significant improvements in both security and social reasoning capabilities.
MADDPG-K scales multi-agent learning by ditching the all-seeing critic for a neighborhood watch, achieving faster training and better performance without the quadratic cost of full observation.
Predicting steerability with near-perfect accuracy while detecting drift more effectively than existing methods could transform how we monitor and control language models in real-world applications.
The dream of universal representations across modalities may be just that: scaling up datasets and relaxing constraints reveals that models trained on different modalities learn rich, but fundamentally different, representations of the world.
Query probabilities can stabilize and improve mean estimation accuracy by balancing uncertainty with a constant probability, revealing a surprising optimal weight configuration.
LLM protocols can actively *harm* accuracy through "corruption," and this paper provides a way to measure and mitigate this effect, turning opaque pipelines into auditable modules.
Symmetry in your model might be the secret weapon guaranteeing accurate statistic recovery in variational inference, even when your model is wrong.
Forget trajectory forecasting – TacticGen generates *adaptable* football tactics, bridging the gap between predicting what *will* happen and prescribing what *should* happen to win.
Executable visual transformations enable MLLMs to achieve continuous self-evolution without the pitfalls of pseudo-labels, leading to superior performance in dynamic VQA tasks.
Targeted prompt interventions can drastically alter AI trading behaviors, amplifying or suppressing market bubbles in ways that mirror human financial psychology.
Foundation models are poised to revolutionize multi-agent systems by enabling semantic-level reasoning and flexible coordination that surpasses the limitations of classical approaches.
Multi-agent LLM systems for idea generation can backfire, with smarter models and more communication leading to *less* diverse ideas due to structural coupling.
Routing decisions in MoEs can create distinct semantic paths for tokens, revealing that interpretability hinges on trajectories rather than individual experts.
You can now detect governance evidence degradation in risk decision systems *without* labels, but be warned: pure concept drift remains undetectable.
Current AI-assisted coding's "vibe coding" approach, while fast, creates unmaintainable codebases because it collapses complex system topology into un-auditable chat logs.
Bridging the gap between trust region methods and PPO, this new framework guarantees performance improvements while outperforming existing algorithms in stability and effectiveness.
LLMs can now automatically translate messy, real-world requirements into formal specifications with surprising accuracy, opening the door to AI-driven verification of safety-critical systems.
Enforcement mechanisms in agent systems can miss significant behavioral drift, but the Invariant Measurement Layer can detect these deviations in real-time, revealing a hidden vulnerability in current governance approaches.
Guaranteeing safe autonomous system behavior demands a fundamental shift: admissibility must be a property of execution itself, not pre- or post-hoc evaluation.
Coordination errors in LLM-based multi-agent systems can be systematically avoided with a new language that guarantees deadlock-free interactions.
Self-modifying agents can drift into unintended behaviors even when individual updates seem reasonable, because accumulated changes become difficult to reverse or even detect.
Rethinking AI alignment as "autonomy-supporting parenting" offers a novel framework for human-AGI coexistence, shifting from control to co-evolution.
Current AI safety evaluations miss the forest for the trees: population-level risks emerge from agent interactions, not isolated model behaviors.
Even for highly structured quantum states like stabilizer states, cloning requires just as many samples as learning, challenging the intuition that structure simplifies quantum copying.
Decomposing safety proofs into forward, backward, and prophecy steps dramatically simplifies the search for inductive invariants, enabling verification of complex systems like Paxos and Raft.
By making RL agents fear a large, subjectively possible negative reward, "Golden Handcuffs" aligns them to safer behavior without sacrificing capability.
AI governance breaks down when systems become capable enough to game the rules or embed themselves within the governance process itself.