Search papers, labs, and topics across Lattice.
Infection-Reasoner, a 4B-parameter vision-language model, was developed for chronic wound infection classification by distilling knowledge from GPT-5.1's chain-of-thought rationales on unlabeled wound images into a smaller student model (Qwen3-VL-4B-Thinking) and further refining it with reinforcement learning on a labeled dataset. This approach addresses the scarcity of expert-labeled wound images and the need for interpretable, evidence-grounded explanations. Infection-Reasoner achieves state-of-the-art performance (86.8% accuracy) on a heterogeneous wound dataset and generates high-quality rationales as evaluated by both MLLM judges and wound experts.
A 4B-parameter model can outperform GPT-5.1 in wound infection classification by distilling its reasoning and fine-tuning with reinforcement learning, offering a path to more efficient and interpretable medical image analysis.
Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8\% accuracy, 86.4\% sensitivity, and 87.1\% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8\% of rationales as Correct and 32.4\% as Partially Correct.