Lattice AI Research

Research focus

Eval Frameworks & Benchmarks (2)Code Generation & Program Synthesis (1)Tool Use & Agents (1)Red-Teaming & Adversarial Robustness (1)

Frequent co-authors

Ivan Bercovich (2)Ivgeni Segal (1)Kexun Zhang (1)Shashwat Saxena (1)

Papers (2)

Apr 30, 2026

Ivan Bercovich +1Apr 30, 2026

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Popular terminal-agent benchmarks are riddled with flaws, with over 15% of tasks being easily reward-hackable, undermining their ability to accurately assess LLM capabilities.

Ivan Bercovich, I. Bercovich

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Apr 19, 2026

CMU MLApr 19, 2026·also Fewshot Corp, Independent Researcher

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Frontier LLMs are surprisingly vulnerable to a wide range of task-specific exploits, from simple output spoofing to rootkit-style binary hijacking, even in seemingly well-defined environments.

Ivan Bercovich, I. Bercovich, Ivgeni Segal +4

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Search

I. Bercovich

Research focus

Frequent co-authors

Papers (2)