Search papers, labs, and topics across Lattice.
This paper introduces an LLM-assisted pipeline for interpretable Chinese metaphor identification, operationalizing four different metaphor identification protocols (MIP/MIPVU, CMDAG, emotion-based, simile-oriented) as executable rule scripts. The pipeline combines deterministic steps with controlled LLM calls to produce structured rationales for each classification. Evaluation on seven Chinese metaphor datasets reveals significant divergence between protocols, with protocol choice being the largest source of variation, while also demonstrating competitive performance and full transparency.
Chinese metaphor identification is highly sensitive to the choice of protocol, dwarfing the impact of model-level variations, yet can be tackled with fully transparent, LLM-assisted rule scripts.
Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge. We present an LLM-assisted pipeline that operationalises four metaphor identification protocols--MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification--as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision. We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen's kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986). An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes. Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency.