Elias Stengel-Eskin

Papers on Lattice

Total citations

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Reasoning & Chain-of-Thought (6)Eval Frameworks & Benchmarks (5)Multimodal Models (4)Tool Use & Agents (3)

Frequent co-authors

Mohit Bansal (4)Justin Chih-Yao Chen (2)Hyunji Lee (2)David Wan (2)

Papers (11)

Jun 24, 2026

Atin Pothiraj +42w ago·also AI2

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

PQSG reveals that state-of-the-art video generation models often misrepresent physical laws, with significant implications for the realism of generated content.

Atin Pothiraj, Jaemin Cho, Yue Zhang +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Jun 19, 2026

3w ago

CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks

CalVerT boosts QA performance by equipping agents with calibrated self-confidence and grounding scores, reducing both erroneous confident answers and unnecessary information retrieval.

Ashwin Vinod, Ying Ding, Elias Stengel-Eskin

Reasoning & Chain-of-Thought Tool Use & Agents

Jun 17, 2026

3w ago

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

Counterfactual reasoning is the key to unlocking a 5.5% accuracy boost in pragmatic language understanding for LLMs, challenging their traditional reliance on literal interpretations.

Jihyung Park, Minchao Huang, Leqi Liu +1

Natural Language Processing Reasoning & Chain-of-Thought

Jun 9, 2026

A History-Aware Visually Grounded Critic for Computer Use Agents

HiViG outperforms existing critics by integrating historical context and visual grounding, achieving up to 9% higher success rates in complex GUI tasks.

Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen +6

Multimodal Models Tool Use & Agents

Apr 30, 2026

Saeid Asgari Taghanaki +15Apr 30, 2026·also Microsoft Research

Diagnosing Capability Gaps in Fine-Tuning Data

Stop wasting compute on fine-tuning datasets with hidden capability gaps: GoalCover lets you diagnose and fix them *before* training.

Saeid Asgari Taghanaki, Raksha Agarwal, Rakshanda Agarwal +13

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Apr 26, 2026

Apr 26, 2026·also UNC

A Novel Approach to Evaluating the Effectiveness of Large Language Models for Multimodal Analysis of Embodied Learning in Classrooms

Findings position LLMs as effective late-fusion mechanisms for multimodal learning analytics and demonstrate the viability of LLM-as-a-Judge for scalable, human-in-the-loop evaluation.

Joyce Horn Fonteles, Nithin Sivakumaran, Clayton Cohn +6

Apr 15, 2026

David Wan +6Apr 15, 2026·also UNC

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

Even the best search-augmented agents, like Gemini Deep Research, are easily distracted by noisy web content, leading to surprisingly poor performance (40% accuracy) on a new multimodal reasoning benchmark.

David Wan, Hyunji Lee, Mikaela Cankosyan +4

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Apr 13, 2026

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

Even with ToM prompting, today's LLMs can be easily fooled in simple privacy games, but RL-trained "double agents" learn to effectively mislead attackers by modeling their beliefs.

Hanqi Xiao, Vaidehi Patil, Zaid Khan +3

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Apr 6, 2026

Apr 6, 2026·also IIT Bombay

Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems

LLMs can learn to solve previously intractable reasoning problems by training on adaptively-reformulated, cognitively simpler versions of the same tasks.

J. Chen, Justin Chih-Yao Chen, Archiki Prasad +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Feb 26, 2026

Evaluating Stochasticity in Deep Research Agents

DRA outputs are surprisingly variable, with inference and early-stage decisions being the biggest culprits, but structured outputs and ensemble querying can significantly reduce this stochasticity.

Haotian Zhai, Haotian Zhai, Elias Stengel-Eskin +4

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

Feb 12, 2026

Feb 12, 2026·also UT Austin

Multimodal Fact-Level Attribution for Verifiable Reasoning

Even state-of-the-art multimodal LLMs struggle to accurately cite their sources when reasoning across video, audio, and text, often hallucinating citations despite generating correct answers.

David Wan, Ziyang Wang, Elias Stengel-Eskin +2