Search papers, labs, and topics across Lattice.
This paper introduces CGFuse, a novel framework that deeply integrates graph-derived representations of code structure into pre-trained language models at the token level. CGFuse combines a GNN with a language model, infusing learned graph features directly into the intermediate layers of the PLM to explicitly preserve fine-grained structural information from code graphs like ASTs and data-flow graphs. Experiments across multiple LLMs demonstrate that CGFuse achieves significant improvements in code generation performance, with BLEU scores increasing by 10-16% and CodeBLEU scores by 6-11%.
Injecting graph representations of code directly into LLM internals unlocks a 16% BLEU boost in code generation, suggesting that structural awareness is key to next-gen code models.
Pre-trained Language Models (PLMs) have the potential to transform software development tasks. However, despite significant advances, current PLMs struggle to capture the structured and relational attributes of code, such as control flow and data dependencies. This limitation is rooted in an architectural mismatch: whereas code structure is best represented by graphs, transformer-based LLMs process input as sequential token patterns and therefore lack explicit structural awareness. While recent research has explored integrating graph-based code representations using techniques like graph feature extraction, retrieval-augmented generation, and prompt engineering, existing approaches suffer from information loss during dense feature extraction or prompt encoding; notably, the potential of deep, token-level fusion of graph features within model internals has not been systematically explored. In this paper, we initiate such an exploration by introducing CGFuse, a novel framework that enables token-level integration of graph-derived representations by infusing learned graph features directly into the intermediate layers of pre-trained language models. CGFuse combines a graph neural network (GNN) with a language model to explicitly preserve and exploit fine-grained structural information from code graphs, including abstract syntax trees and data-flow graphs. We systematically evaluate CGFuse across multiple LLMs, demonstrating up to 10-16% BLEU and 6-11% CodeBLEU improvements in code generation performance. These results highlight the potential of deep graph-PLM integration to advance the field toward more robust, capable AI-driven software development.