Search papers, labs, and topics across Lattice.
Today's best smartphone GUI agents stumble when faced with the messy reality of personalized user workflows, achieving only limited success on a new benchmark designed to mimic real-world use.
GPT-5 can only solve 37% of PhD-level 3D geometry coding problems, suggesting AI can't reliably automate complex scientific coding tasks yet.
Forget brute-force search: CoT2-Meta shows that strategically controlling reasoning trajectories with metacognition yields significant gains in accuracy and compute efficiency across a wide range of reasoning tasks.
LLM agents controlling real-world tools are alarmingly easy to manipulate, with an 85% success rate for privilege escalation attacks, despite exhibiting basic security awareness.
NPM malware detection tools often fail because they struggle to distinguish between innocuous code behavior and malicious intent, a problem addressable by analyzing behavioral chains.
Stop burying your agent harness logic in code: NLAHs let you express it in natural language, making it portable, editable, and analyzable.
MLLMs can ace the test, but still fail to *see*—they often succeed at complex reasoning with symbols while failing at basic symbol recognition, revealing a reliance on linguistic priors over true visual perception.
Embodied navigation agents, already struggling, fall apart when faced with the kinds of messy, real-world sensor and instruction corruptions that NavTrust now exposes.
LLMs can exhibit surprising "strategic realism" when analyzing an ongoing geopolitical conflict, but their reasoning falters in politically ambiguous situations, revealing critical domain-specific limitations.
LLMs struggle to effectively use private library APIs even when provided with the correct documentation, but PriCoder can boost their performance by over 20% through targeted training data synthesis.
MLLMs still can't handle time-sensitive multimodal reasoning, often failing to integrate auditory and visual cues effectively in dynamic environments like a 4D escape room.
Tool-using agents may seem capable, but they struggle to distinguish neutral actions from errors, highlighting a critical need for better step-level process understanding.
LLMs struggle with low-resource general-purpose programming languages, and surprisingly, translating code *to* a low-resource language is harder than generating it from text.
Scaling up LLMs boosts combinatorial creativity in code generation, but plateaus on exploratory tasks, revealing a "convergence-by-scaling" effect where larger models become less divergent.
LLMs in collaborative coding often stumble on interaction subtleties, leading to a new class of problems called "Interaction Smells" that can now be systematically identified and mitigated.
Fisheye cameras can now see the world in 4D, thanks to a new benchmark and method that tackles the unique distortions of spherical projection for improved occupancy tracking.
Current language agents are still far from matching human expert performance when faced with real-world professional tasks requiring complex reasoning, authoritative source retrieval, and domain-specific knowledge, as revealed by the new \$OneMillion-Bench benchmark.
LLMs can automate and improve thematic analysis of qualitative data, achieving expert-level alignment in clinical domains through iterative codebook refinement.
Current LLM safety measures are critically vulnerable to attacks grounded in Thai cultural nuances, as demonstrated by a new benchmark showing higher attack success rates compared to general Thai-language attacks.
Interpolating latent representations before decoding yields a reconstruction FID (iFID) that finally aligns with the generation FID of latent diffusion models, achieving ~0.85 correlation where standard rFID fails.
Current judge models for instruction-following are surprisingly unreliable, but a new benchmark exposes their flaws and offers a path to better alignment.
LLMs can synthesize verifiable discrete-event world models from natural language, bridging the gap between hand-engineered simulators and unconstrained neural models.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
MiroFlow leapfrogs existing LLM agent frameworks with its agent graph architecture, delivering state-of-the-art performance and robust execution across a diverse range of benchmarks.
Even the best vision-language models struggle to diagnose brain tumors from MRI scans, but a new dataset and benchmark reveals a path to significant accuracy gains through instruction tuning.
Uncovered: mental health chatbots can fall into dangerous "validation spirals" or "empathy fatigue" patterns, revealing critical relational safety flaws missed by current single-turn evaluations.
VLMs can get a +39% boost in downstream reasoning by using translator-guided reinforcement learning to improve geometric perception, a far better result than standard supervised fine-tuning.
AI agents can now learn durable skills instead of constantly "reinventing the wheel," thanks to SkillNet's infrastructure for creating, evaluating, and connecting AI skills at scale.
LLMs scrub away up to 20% of culturally specific language, even while preserving the core meaning, revealing a "Semantic Preservation Paradox" that threatens linguistic diversity.
Current video benchmarks are too simple; UniVBench offers the first unified framework to measure the integrated capabilities of video foundation models using complex, multi-shot videos and a standardized evaluation system.
LLM agent frameworks are riddled with bugs stemming from API misuse and documentation issues, leading to crashes and functional errors that current agent-level evaluations miss.
Achieve zero package hallucinations from LLMs in dependency recommendation by monitoring the decoding process and intervening with an authoritative package list.
VLMs still can't reason about spatial logic in real-world scenes, but a new benchmark and scene graph method shows how to make progress.
LLM-powered pentesting agents fail not because of model limitations, but because they can't estimate task difficulty, leading to wasted effort and premature context exhaustion.
LLM code copilots are put to the test with SecCodeBench-V2, a new benchmark revealing their security vulnerabilities across 22 CWE categories and five programming languages.
A new family of GUI agents, GUI-Owl-1.5, leapfrogs existing open-source models on 20+ GUI benchmarks, proving that multi-platform, real-time GUI automation is now within reach.
Forget monolithic models: a mixture-of-experts approach using clustered semantic domains boosts definition modeling by 7% BLEU, proving that specialization wins.
LLM benchmark accuracy jumps 10% when evaluated on a cleaned-up version of Humanity's Last Exam, highlighting the significant impact of dataset noise on performance metrics.
Retrieval models, even large ones, struggle under realistic acoustic noise, as revealed by the new SQuTR benchmark.
PatientHub finally offers a standardized, reproducible framework for patient simulation, streamlining development and benchmarking across diverse methods and models.
Current verifiers often reward correct answers derived from flawed reasoning, but PRIME offers a benchmark to identify and select verifiers that actually penalize incorrect derivations.
MLLMs can ace the high-level strategy for two-handed robot tasks, but still fumble basic coordination like not smashing the robot's arms together.
GPT-5's real-time router learns to route queries to specialized models, making it faster and more useful than its predecessors.
Despite progress in AI safety, it's still largely unknown how effective current safeguards are at preventing AI harms, and their effectiveness varies wildly.
LLMs still can't convincingly mimic human personas, especially when it comes to syntactic style and memory, despite advancements in other areas.
LLMs still struggle to learn effectively from user feedback during service, as revealed by a new benchmark spanning multiple domains and languages.
LLMs still struggle to synthesize coherent scientific surveys, as evidenced by a new benchmark revealing significant performance gaps even with advanced agentic frameworks.