Search papers, labs, and topics across Lattice.
This study introduces FOXGLOVE, a dataset comprising 696 expert feedback comments and 1,644 comments generated by four large language models (LLMs) on twelfth-grade argumentative essays. The analysis reveals that while both instructors and LLMs provide feedback aligned with writing goals and essay structure, they differ significantly in the specific sentences targeted for feedback, with LLMs producing more complex comments and fewer questions. Notably, LLM feedback is rated higher in quality by instructors, largely due to the length of the comments rather than their substance.
LLMs may outshine human instructors in feedback ratings, but their complexity masks critical differences in targeted sentence feedback.
While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision: goal-orientation, anchoring to specific sentences, and prioritization. We introduce FOXGLOVE, a dataset of 696 feedback comments written by trained writing instructors on 69 twelfth-grade argumentative essays, paired with 1,644 comments generated from four frontier LLMs under a shared protocol, totaling 2,340 comments. We provide expert quality ratings on a subset of both instructor and LLM comments. We find that instructors and LLMs distribute feedback similarly across goals and essay positions, yet instructors and models diverge on the specific sentences on which to provide feedback. Additionally, we find that models tend to write more complex feedback and use fewer questions than instructors. LLM feedback also receives higher ratings on most dimensions of quality, as rated by instructors, but much of this advantage appears to be attributable to lengthier comments. FOXGLOVE enables systematic comparison of where human and LLM feedback align, diverge, and differ.