Search papers, labs, and topics across Lattice.
This paper explores the feasibility of generating personalized coding problems using open-source small language models in a local microservice system, addressing cost and data privacy concerns associated with cloud-based proprietary models. The authors compare a single-model baseline with a ChatGPT-4 benchmark and a multi-agent refinement loop (CHASE) involving five models to iteratively increase problem difficulty. Results indicate that while CHASE improved topic adherence, it was significantly slower, and the single open-source model outperformed GPT-4 in clarity, demonstrating the potential of local models for coding problem generation.
Forget finetuning: a single open-source SLM can generate clearer coding problems than GPT-4, challenging the assumption that larger, proprietary models are always superior in educational applications.
LLMs and their use cases within computer science education have been the subject of much discussion. However, the reliance on cloud-based services when using proprietary models like GPT-4 has barriers, such as cost and data privacy compliance. This work shows an end-to-end local microservice system to generate programming problems with open-source small language models that can run on consumer devices. In addition to a baseline that uses a single model directly, we evaluate two generation pipelines for generating problems. One is a ChatGPT-4 benchmark, and the second is a multi-agent refinement loop inspired by the CHASE paradigm. We use five models in a feedback loop that work towards increasing the depth of a problem until a target difficulty is reached or exceeded. We generated 150 problems total across the three methods, which were blindly scored by a computer science educator for metrics such as clarity, difficulty, and overall quality. The results show that CHASE did have better topic adherence, but was 18x slower than the default generation. Chaining small models might not fix the deficiencies of a single model, but rather that the deficiencies are added together. However, the single-model end-to-end method from the open-source models was reasonably fast and outperformed GPT-4 on clarity metrics. This work successfully shows the feasibility of using local models for creating meaningful coding problems, but chained-pipeline approaches may need to either have a higher degree of system robustness for storing user preferences and problem settings or simply use larger models.