Search papers, labs, and topics across Lattice.
The University of Texas at Austin
7
0
6
0
GENIE reveals that traditional metrics fail to capture the nuanced dimensions of novelty, offering a sharper lens for evaluating LLM creativity.
LLMs struggle with procedural rule induction, revealing a significant performance gap that challenges current AI capabilities.
AI-generated peer reviews aren't just viable at scale, they're preferred by researchers over human reviews for technical accuracy and actionable feedback.
LLMs are twice as likely as humans to repeat the same support tactic in a conversation, but a simple RL reward for tactic novelty can fix it.
Turns out, LLMs aren't actually empathic, they're just really good at regurgitating a well-liked empathy template.
VLMs may ace the color coverage test, but they flunk the "do as I say, not as I do" test, routinely ignoring their own stated reasoning rules in ways that humans don't.
LLMs struggle to generate diverse and specific connections between concepts, even with high token budgets and "thinking" prompts, revealing a gap in creative associative reasoning.