SUTDApr 21, 2026arXiv:2604.19086

MUCOCO: Automated Consistency Testing of Code LLMs

Chua Jin Chou, Chua Jin Chou, Khant That Lwin, Khant That Lwin, E. Soremekun, Ezekiel Soremekun

AI Summary

This paper introduces MUCOCO, an automated consistency testing method for code LLMs that leverages semantic-preserving mutation analysis to expose inconsistent behaviors. MUCOCO generates semantically equivalent program mutants from a coding query and flags inconsistencies based on differing outputs or test failures between the original program and its mutants. Experiments across four coding tasks and seven LLMs demonstrate that MUCOCO effectively exposes inconsistencies, outperforming the TURBULENCE baseline and revealing that 15% of generated inputs expose inconsistencies.

Key Contribution

Code LLMs fail consistency checks on 15% of inputs, revealing a significant reliability gap that existing benchmarks miss.

Abstract

Code LLMs often portray inconsistent program behaviors. Developers typically employ benchmarks to assess Code LLMs, but most benchmarks are hand-crafted, static and do not target consistency property. In this work, we pose the scientific question: how can we automatically discover inconsistent program behaviors in Code LLMs? To address this challenge, we propose an automated consistency testing method, called MUCOCO, which employs semantic-preserving mutation analysis to expose inconsistent behaviors in code LLMs. Given a coding query, MUCOCO automatically transforms its program into semantically equivalent programs (aka mutants) and detects inconsistencies between the mutants and the original program (e.g., different output or test failure). We evaluate MUCOCO using four (4) coding tasks and seven (7) LLMs. Results show that MUCOCO is effective in exposing inconsistency and outperforms the closest baseline (TURBULENCE). About one in seven (15%) inputs generated by MUCOCO exposed inconsistencies. Our work motivates the need to test Code LLMs for consistency property

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MUCOCO: Automated Consistency Testing of Code LLMs

Related Papers