Search papers, labs, and topics across Lattice.
Evaluation methodology for AI systems, benchmark design, capability measurement, and safety evaluations.
#6 of 24
2
Even Gemini can understand you if you speak its language: structured intent prompting slashes cross-language performance variance and boosts weaker models more than stronger ones.
Tabular foundation model performance hinges on the evaluation metric, revealing that no single pretraining objective is universally optimal across different risk profiles.
Forget complex LLMs: a small, fine-tuned transformer surprisingly nails readability scoring for German ESG reports.
Current vision-language models are surprisingly bad at identifying common household safety hazards, but a new benchmark could change that.
Stop guessing which layers to edit in your LLM – KEditVis reveals the inner workings of knowledge editing, letting you pinpoint the most effective interventions.
LLMs can be rigorously evaluated for metacognitive abilities like confidence assessment and risk-aware decision-making using psychophysical frameworks borrowed from human cognition research.
LLMs don't just make people confidently wrong; they create a dangerous illusion of competence by decoupling performance from actual understanding.
Multimodal AI models learn to be lazy, often ignoring entire modalities, and current active learning methods don't fix the problem.
An 8B open-source model, trained with a new closed-loop environment for 6G network management, achieves performance comparable to GPT-4, suggesting a viable path to autonomous network control.
Multi-agent systems for automated research face a fundamental trade-off: parallel exploration offers speed and stability, while expert teams unlock deeper reasoning at the cost of increased fragility.
Training language models on individual children's language reveals that distributional and interactional linguistic features, not just dataset size, are key to efficient learning, mirroring factors that drive child language acquisition.
Enriching meaning representations with task demonstrators can significantly boost dialogue generation, especially in challenging scenarios, revealing a simple yet effective strategy for improving NLG performance.
Forget IoU, measuring the structural compactness of attribution maps with Minimum Spanning Trees reveals fundamental differences in how models explain themselves.
Reward LLMs for verifiable reasoning steps, not just correct answers, to get more reliable multi-step logic.
Stop cobbling together memory-augmented agents: MemFactory offers a unified "Lego-like" framework that streamlines training and boosts performance by up to 14.8%.
Multilingual vision-language models can achieve surprisingly strong performance (36% on MMMU) simply by training on translated data and aligning with parallel text corpora.
LLM-as-a-Judge, while improving evaluation scalability, introduces critical security vulnerabilities that can compromise the trustworthiness of entire evaluation pipelines.
Correcting a vision-language model's "hallucinations" is now as simple as pinpointing and editing the right intermediate representation, sidestepping costly retraining or dual inference.
AI agents are far better at automating data engineering tasks than previously thought, but flawed benchmarks are obscuring their true potential.
LLMs can nail the clinical content of prior authorization letters, but consistently fumble the administrative details that actually get them approved.
AI benchmarks may be giving you a false sense of comprehensive evaluation: the six scores on the Open LLM Leaderboard effectively boil down to just two independent measurements.
Today's best smartphone GUI agents stumble when faced with the messy reality of personalized user workflows, achieving only limited success on a new benchmark designed to mimic real-world use.
NeuralUCB can slash LLM inference costs while maintaining quality, offering a practical alternative to always using the biggest, most expensive models.
Northern Kurdish finally gets its due with FLEURS-Kobani, a new benchmark dataset that exposes the challenges and opportunities for ASR and speech translation in this under-resourced language.
LLMs are surprisingly bad at strategic communication, leaking sensitive information even when trying to be secretive.
Current evaluation methods miss 8-17% of agentic workflow failures because they only check final outcomes, overlooking cases where agents bypass policy checks but still reach the right answer.
LLM-generated authorial impersonations, despite their sophistication, are surprisingly detectable by existing authorship verification methods, even outperforming on some genuine negative samples.
Forget fancy ensembling – simply asking an LLM how confident it is in its grading is the most reliable way to predict its accuracy, and it's far cheaper than self-consistency voting.
LLMs may ace English, but LLM Probe reveals surprising performance disparities in low-resource languages, with sequence-to-sequence models unexpectedly leading in morphosyntax.
Mental-health support chatbots get a much-needed reality check with CounselReflect, a toolkit that exposes their strengths and weaknesses through transparent, multi-dimensional audits.
LLMs ace linguistic benchmarks, but a token-level perplexity analysis reveals they're often relying on the wrong cues.
LLMs struggle to handle common, challenging patient behaviors like contradictory statements and inaccurate medical information, revealing critical safety gaps in medical consultation applications.
Despite its simple grammar, Esperanto translation still poses challenges for LLMs, with NLLB models only preferred in about half of human evaluations.
Japanese entity linking gets a boost: CADEL offers a high-quality, Japan-specific corpus to tackle the unique challenges of linking entities in administrative web documents.
LLMs can achieve state-of-the-art multilingual speech recognition by smartly handling noisy phoneme inputs, even with severe data imbalance across languages.
Forget clunky prompt engineering: distilling user history into a learned preference memory boosts LLM-based product reranking by over 10%.
Forget slow, bloated LLMs – this work shows you can get GPT-4o quality on long-document QA with a 3B model and a clever structure-first distillation approach.
You don't need a massive model to beat Gemini-2.5-Pro in real-world content moderation: Xuanwu VL-2B achieves superior recall on policy-violating text using only 2B parameters.
LLMs still struggle to accurately infer user interests from interaction histories, especially when dealing with diverse engagement signals – a critical gap for effective personalization.
LLMs can mimic legislative reasoning, but their performance hinges on the proposal's idiosyncrasy, revealing a susceptibility to plausible-sounding confabulation that could mislead policymakers.
Forget resource-intensive workshops – AI can now simulate entire expert panels to generate and stress-test socio-technical scenarios, opening doors to rapid policy exploration.
Stop treating inter-rater reliability as a simple green light for "ground truth" in AIED – your data's probably messier than you think, especially with LLMs in the mix.
GPT-5 can only solve 37% of PhD-level 3D geometry coding problems, suggesting AI can't reliably automate complex scientific coding tasks yet.
YOLOv11 crushes the competition in form element detection, showcasing its potential for automating document processing across diverse real-world forms.
Current facial expression editing models can't simultaneously preserve identity and accurately manipulate expressions, revealing a critical need for better fine-grained instruction following.
Expert ordinal comparisons reveal that fusing vision and language in wound representation learning boosts agreement by 5.6% over unimodal foundation models for a rare genetic skin disorder.
LLMs can maintain conversational stability and improve retrieval accuracy in long-running interactions by adaptively compressing context, leading to reduced token usage and faster inference.
Dialogue agents can now remember what you told them six turns ago with 57% accuracy, thanks to a new memory architecture that selectively forgets less important details.
Current text-to-long-video evaluation metrics can't reliably assess video quality, failing to match human judgment in 9 out of 10 tested degradation aspects.
Unexplained P99.9 latency spikes in Apache Pulsar could be due to a previously undocumented Linux kernel page cache writeback interaction inside BookKeeper's ForceWriteThread, even with dedicated NVMe drives.