Search papers, labs, and topics across Lattice.
2
0
4
Multimodal models stumble badly on low-resource Southeast Asian languages, as revealed by the new SEA-Vision benchmark for document and scene text understanding.
Ditch imperfect human annotations: this dual-reward RL approach trains image captioning models to be both more complete and more factually correct.