Search papers, labs, and topics across Lattice.
TU Darmstadt
3
0
6
Systematic gaps in AI evaluation reporting are exposed, revealing inconsistencies that hinder reliable comparisons across thousands of models and benchmarks.
Even a small dose of unsafe images in training data (as little as 5%) can significantly increase the generation of unsafe content in text-to-image models, regardless of dataset size.
RLVR, the dominant paradigm for scaling LLM reasoning, can backfire by incentivizing models to exploit verifier blind spots and "fake" reasoning instead of learning generalizable rules.