Asaf Yehudai

Agents that excel on traditional benchmarks may crumble under the pressure of newly synthesized tasks, revealing the limitations of current evaluation methods.

Tomer Keren, Asaf Yehudai, Asaf Yehudai +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Tool Use & Agents

May 21, 2026

Asaf Yehudai +2May 21, 2026

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Current LLM agent evaluation tools are stuck in the Stone Age, but Agentic CLEAR automates dynamic, multi-level analysis, finally offering insights that adapt to the rapidly evolving agent landscape.

Asaf Yehudai, Lilach Eden, Michal Shmueli-Scheuer

Eval Frameworks & Benchmarks Tool Use & Agents

Apr 14, 2026

AI2Apr 14, 2026·also MIT CSAIL, Faculty of Data and Decision Science, HUJI, IBM Research +1

Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

Stop re-running full benchmarks: Calibrate new LLM datasets against existing suites with just 100 "anchor" questions and still get highly accurate performance predictions.

Asaf Yehudai, Yotam Perlitz, Leshem Choshen

Eval Frameworks & Benchmarks Training Efficiency & Optimization

Mar 17, 2026

Mar 17, 2026·also MIT CSAIL, IBM Research

Mediocrity is the key for LLM as a Judge Anchor Selection

Using a top or bottom-performing LLM as an anchor in "LLM-as-a-judge" benchmarks can dramatically skew results, making the choice of a mediocre anchor key to reliable evaluation.

Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen +1

Eval Frameworks & Benchmarks Natural Language Processing

Feb 26, 2026

Feb 26, 2026·also HUJI

General Agent Evaluation

General-purpose agents can match the performance of specialized agents across diverse environments without any environment-specific tuning, challenging the need for task-specific engineering.

Elron Bandel, Asaf Yehudai, Lilach Eden +20

Eval Frameworks & Benchmarks Tool Use & Agents

Search

Asaf Yehudai

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (6)