Mar 16, 2026arXiv:2603.14891

Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models

AI Summary

This paper introduces Decision-Level Ordinal Modeling (DLOM) for automated essay scoring (AES) with LLMs, which explicitly models scoring as an ordinal decision by extracting score-wise logits from the language model head. For multimodal AES, DLOM-GF uses a gated fusion module to adaptively combine textual and visual score logits, while for text-only AES, DLOM-DA adds a distance-aware regularization term. Experiments on multimodal (EssayJudge) and text-only (ASAP/ASAP++) datasets demonstrate that DLOM and its variants outperform generation-based baselines and other representative methods.

Key Contribution

Ditch the black box: Directly optimizing LLM score logits for essay scoring unlocks better performance and interpretability, especially when fusing multimodal inputs.

Abstract

Automated essay scoring (AES) predicts multiple rubric-defined trait scores for each essay, where each trait follows an ordered discrete rating scale. Most LLM-based AES methods cast scoring as autoregressive token generation and obtain the final score via decoding and parsing, making the decision implicit. This formulation is particularly sensitive in multimodal AES, where the usefulness of visual inputs varies across essays and traits. To address these limitations, we propose Decision-Level Ordinal Modeling (DLOM), which makes scoring an explicit ordinal decision by reusing the language model head to extract score-wise logits on predefined score tokens, enabling direct optimization and analysis in the score space. For multimodal AES, DLOM-GF introduces a gated fusion module that adaptively combines textual and multimodal score logits. For text-only AES, DLOM-DA adds a distance-aware regularization term to better reflect ordinal distances. Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous. On the text-only ASAP/ASAP++ benchmarks, DLOM remains effective without visual inputs, and DLOM-DA further improves performance and outperforms strong representative baselines.

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models

Related Papers