Search papers, labs, and topics across Lattice.
1
0
3
Surface-level metrics like BLEU are misleading for evaluating dialogue systems, as human and LLM judges reveal critical flaws in coherence and consistency that these metrics miss entirely.