Search papers, labs, and topics across Lattice.
1
0
3
2
Current QA models boasting 70+ F1 on existing benchmarks crumble on SPARTA, a new dataset revealing their surprisingly shallow reasoning abilities across tables and text.