Search papers, labs, and topics across Lattice.
23 papers from Microsoft Research on Eval Frameworks & Benchmarks
AI-generated code's fluency masks a critical flaw: it often fails to deliver what users actually intend, highlighting the urgent need for "intent formalization" to bridge the gap between informal requirements and precise program behavior.
LLMs, even when prompted or fine-tuned, struggle to replicate the messy reality of human conversation, raising serious questions about their utility as proxies for social interaction.
LLMs' ability to fairly represent English dialects hinges on the quality of human consensus, revealing a fundamental challenge in improving performance for low-resource locales.
Vision-language models struggle to adapt plans based on visual input alone, revealing a critical gap in their ability to use what they see when things don't go as expected.
LLMs exhibit a surprising "conversation tax" in diagnostic reasoning, frequently abandoning correct initial diagnoses to align with incorrect user suggestions in multi-turn dialogues.
LLMs still can't automate real-world threat research, struggling with accuracy and nuanced expertise in a new benchmark derived from a world-leading company's CTI workflow.
Can RAG systems handle complex, multi-sentence queries while maintaining factual grounding and transparency?
LLMs writing long stories frequently contradict themselves on basic facts and timelines, especially in the middle of the narrative, highlighting a critical weakness in long-form generation.
Automating software repository build and testing across languages and platforms is now possible, unlocking scalable benchmarking and training for coding agents.
Despite codebases evolving rapidly, retrieval benchmarks can remain surprisingly robust even when re-judged on newer versions of the corpus.
Save 20% on LLM costs with <2% accuracy drop by strategically cascading a small model with a large one, guided by a confidence-calibrated SLM.
LLMs can mimic your style, but your friends can still tell it's not really you, especially when it comes to your opinions.
LLMs struggle with instruction following in Indic languages despite progress in high-resource languages, as shown by a new benchmark spanning 14 languages.
Forget static rubrics: SibylSense adaptively learns rubrics at inference time, leading to more discriminative rewards and better RL performance in open-ended generation tasks.
NanoKnow reveals that even with external evidence, LLMs are more accurate when answers were seen during pre-training, highlighting the crucial role of parametric knowledge.
LLMs can reason more causally by simply checking if their counterfactual predictions are consistent, even without any extra training data.
Forget static, homogeneous multi-agent systems: Team-of-Thoughts unlocks superior performance by dynamically orchestrating heterogeneous agents based on calibrated coordination and self-assessed domain expertise.
LLMs can't reliably debug code in long contexts (64k-128k tokens) even with perfect information retrieval, despite impressive performance in agentic workflows that decompose the task.
LLM development teams often resort to workarounds and augmentation strategies when faced with the practical challenges of integrating domain experts, revealing a gap between ideal participatory design and real-world constraints.
Most AI models are failing to disclose critical safety information like deception behaviors and hallucination risks, even from top labs.
Enterprise AI assistants can achieve zero data retention, but the architectural and compliance paths taken by Salesforce and Microsoft reveal significant trade-offs.
Even the best LLMs fail more than 40% of the time when orchestrating multiple tools in realistic scenarios, revealing critical gaps in real-world agent capabilities.
VLMs can be effectively adapted, even under data and compute constraints, to create a unified evaluator for video world models that rivals task-specific models and aligns well with human judgment.