BITDUTSUTDUTokyoMay 21, 2026arXiv:2605.22258

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

Jingyi Kang, Junyu Lu, Bo Xu, Hongbo Wang, Linlin zong, Roy Ka-Wei Lee, Hongfei Lin

AI Summary

This paper introduces CITA, a three-stage framework for generating Chinese toxicity attacks that combine semantic indirectness with surface obfuscation. CITA leverages harmful intent learning, implicit toxicity enhancement, and obfuscation variant rewriting to create challenging adversarial examples. Experiments show that CITA-generated attacks achieve a high attack success rate (69.48%) against existing Chinese toxicity detectors, highlighting their vulnerability to implicit and obfuscated toxicity.

Key Contribution

Chinese toxicity detectors are surprisingly easy to fool with subtle semantic indirection and obfuscation, missing almost 70% of attacks generated by the CITA framework.

Abstract

Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled red-team evaluation and defense-data generation framework, not a deployable evasion tool. CITA uses three stages: (i) Harmful Intent Learning, (ii) Implicit Toxicity Enhancement, and (iii) Obfuscation Variant Rewriting, to preserve harmful intent, increase implicitness, and add controlled surface variants. On CITA-generated evaluation samples, the seven tested detectors exhibit substantial missed-detection risks, reaching an average ASR of 69.48%; human evaluation further confirms preserved harmfulness and increased implicitness/evasiveness. As a downstream defense application, we fine-tune a Chinese Implicit Toxicity Defense model (CITD) with CITA-generated red-team data, showing that such data can improve robustness through additional training.

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

Related Papers