Search papers, labs, and topics across Lattice.
23
420
18
Existing zero-shot multimodal information extraction models struggle with real-world scenarios containing both seen and unseen categories, but this work solves it by modeling hierarchical semantic relationships in hyperbolic space and aligning semantic similarity distributions.
A lightweight VLA with deep state space models lets robots outperform larger models at language-guided manipulation while running 3x faster.
GPT-5-Mini can be made 10% more robust to jailbreaks and prompt injections simply by RL fine-tuning on a new instruction hierarchy dataset, IH-Challenge.
By pinpointing the causal origins of tool use, AttriGuard neutralizes indirect prompt injection attacks that can hijack LLM agents, even when faced with adversarial optimization.
Nail design retrieval gets a major upgrade: NaiLIA leverages dense intent descriptions and palette queries to outperform standard methods, opening the door to more nuanced and personalized image search.
Forget difficulty-based heuristics: InSight leverages weighted mutual information to select RL training data, boosting LLM reasoning and alignment with up to 2.2x speedup.
LLMs harbor surprisingly consistent hidden beliefs on sensitive topics like mass surveillance and torture, even when direct questioning suggests otherwise.
LLMs struggle to understand nuanced values across languages, with accuracy dropping below 77% and varying by over 20% between languages, as revealed by the new X-Value benchmark.
Key contribution not extracted.
Cutting LLMs' reasoning token budget can backfire spectacularly, tanking performance even below that of models with *no* reasoning at all.
Achieve competitive image-text fact checking at just $0.013 per check by combining RAG with reverse image search, using a surprisingly simple and reproducible architecture.
Fine-tuning LLMs on datasets filtered at the token level, rather than the sentence level, can boost performance by up to 13.7%.
Speech recognition models stumble badly on real-world street names, especially for non-English speakers, but a simple synthetic data boost can dramatically improve accuracy.
Finally, a single 3D medical vision-language model that nails both high-level reasoning (report generation, VQA) and fine-grained segmentation from language, point, or box prompts.
Claude 2 can match the performance of top medical specialists on pulmonary thromboembolism knowledge assessments, suggesting AI's potential for clinical decision support.
Despite their promise, even the best multimodal LLM (GPT-4o) achieves only 26% accuracy in grading knee osteoarthritis from radiographs, revealing a significant gap in clinical reliability.
LLMs still struggle to reliably produce accurate Islamic content and citations, despite relatively strong performance, revealing a critical gap in faith-sensitive AI writing.
AI-generated feedback on student portfolios from GPT-4o and Claude-Sonnet-4 shows promise for high-stakes clinical assessments, but careful evaluation is needed to ensure accuracy and educational value.
Open-weight reasoning models now rival proprietary systems in agentic capabilities and benchmark performance, thanks to gpt-oss-120b and gpt-oss-20b.
Current LLMs fall far short of supporting holistic human well-being, with even the best models struggling to score above 72/100 on a new Flourishing AI Benchmark, particularly in areas like Faith and Spirituality.
An LLM-powered smart tutor isn't just another homework helper; it's a real-time feedback loop for instructors, revealing student struggles and enabling more effective teaching.
ChatGPT-4 slashes data extraction time in scoping reviews by 66%, but don't ditch the human reviewers just yet.
LLMs can generate plain language summaries of scientific research that are as good as human-written ones, but easier to read.