Search papers, labs, and topics across Lattice.
5
0
7
10
Today's best AI agents can only complete 33% of common online tasks like booking appointments or filling out job applications, revealing a significant gap between current capabilities and real-world utility.
Current video understanding benchmarks and post-training datasets are riddled with linguistic biases, meaning VLMs might be acing tests without actually "watching" the video.
Image generation models ace photorealistic art but still choke on screenshots and infographics, highlighting a critical gap in real-world applicability.
A Qwen3-8B model, trained with a new SFT+RLAIF recipe on a challenging new benchmark, SWE-QA-Pro, beats GPT-4o in repository-level code understanding.
MLLMs may ace your visual question answering, but VisPhyWorld reveals they're still struggling to actually *simulate* physics.