Mila

×Eval Frameworks & Benchmarks

3 papers from Mila on Eval Frameworks & Benchmarks

Apr 30, 2026

3w ago·also BAIR, Mila, Toronto Metropolitan University, UofT

A Reproducibility Study of LLM-Based Query Reformulation

LLM-powered query reformulation, a hot topic in IR, often fails to translate gains from lexical to neural retrieval, and bigger models don't always help.

Amin Bigdeli, Radin Hamidi Rad, Hai Son Le +4

Eval Frameworks & Benchmarks Open-Source Models & Weights Recommendation & Information Retrieval

Apr 6, 2026

MilaApr 6, 2026·also CIFAR, Cornell, McGill, Michigan State

Discovering Failure Modes in Vision-Language Models using RL

Forget hand-crafted prompts: RL can automatically unearth 36 new failure modes in VLMs that humans miss, revealing surprising blind spots in counting, spatial reasoning, and viewpoint understanding.

Kanishk Jain, Qian Yang, Shravan Nayak +3

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Feb 19, 2026

The Fin AIFeb 19, 2026·also Mila, California State University, Columbia, Georgia Tech +2

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

LLMs struggle to balance rational financial decisions with mimicking noisy user behavior, often overfitting to short-term market trends instead of aligning with long-term investment goals.

Yan Wang, Yi Han, Lingfei Qian +12

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Search

Mila