May 6, 2026arXiv:2605.04727

Ensuring Reliability in Programming Knowledge Tracing: A Re-evaluation of Attention-augmented Models and Experimental Protocols

AI Summary

This paper re-evaluates attention-augmented Programming Knowledge Tracing (PKT) models, demonstrating that previously reported performance gains are sensitive to model configuration and sequence construction. Specifically, they show that attention dimension settings and improper ordering of student attempts can lead to inflated performance estimates. By enforcing a rigorous evaluation protocol involving fixed hyperparameters across cross-validation folds and careful consideration of temporal causality, the authors find that the performance gap between attention-enhanced models and standard DKT is significantly reduced.

Key Contribution

Attention-based models for programming knowledge tracing might not be as effective as previously thought; careful experimental design reveals that their gains over simpler models are often overstated.

Abstract

Programming Knowledge Tracing (PKT) has recently advanced through hybrid approaches that integrate attention-based feature modeling for code representation with RNN-based sequential prediction. While these models report strong empirical performance, their reliability can be sensitive to subtle implementation and experimental design choices. This study revisits representative PKT models and shows that reported gains can be substantially influenced by model configuration and sequence construction practices. We identify issues in attention dimension settings that affect performance estimates, and demonstrate that improper ordering of student attempts, such as ignoring ServerTimestamp, can violate temporal causality and lead to overly optimistic results. To ensure consistent evaluation, hyperparameters are selected via grid search guided by a single designated fold and then fixed uniformly across all folds during cross-validation. We further analyze the role of assignment-wise characteristics and systematically explore the impact of maximum sequence length. Using this protocol, we re-evaluate PKT models on the CodeWorkout dataset. Our results show that, under controlled and consistent settings, the performance gap between attention-enhanced models and standard DKT is significantly reduced, and increased architectural complexity does not consistently translate into superior performance. Beyond individual model comparisons, this work provides practical guidance for reliable and comparable evaluation in programming knowledge tracing.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References17

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Ensuring Reliability in Programming Knowledge Tracing: A Re-evaluation of Attention-augmented Models and Experimental Protocols

Related Papers