Feb 24, 2026arXiv:2602.21165

PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

S. Fodeh, Samah Fodeh, Linhai Ma, Linhai Ma, Yan Wang, Srivani Talakokkul, Srivani Talakokkul, Ganesh Puthiaraju, Ganesh Puthiaraju, Afshan Khan, Afshan Khan, Ashley Hagaman, Ashley Hagaman, Sarah Lowe, Sarah Lowe, A. Roundtree, Aimee Roundtree

AI Summary

The paper introduces PVminer, a domain-specific NLP framework for detecting the patient voice (PV) in patient-generated text, addressing the limitations of existing methods that treat patient-centered communication and social determinants of health as separate tasks. PVminer formulates PV detection as a multi-label, multi-class prediction task, leveraging patient-specific BERT encoders (PV-BERT-base and PV-BERT-large) and unsupervised topic modeling for thematic augmentation. PVminer achieves state-of-the-art performance, outperforming biomedical and clinical pre-trained baselines with F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo), and demonstrates the benefits of author identity and topic-based augmentation through ablation studies.

Key Contribution

Unlock the wealth of patient insights buried in secure messages: PVminer structures the patient voice with unprecedented accuracy, outperforming existing clinical NLP models.

Abstract

Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.

Data Curation & Synthetic Data Natural Language Processing Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References75

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

Related Papers