Introduction
Fine Tuning a ClincialBERT Based Model
Enhancing PHI De-Identification for Diverse Patient Populations
Protecting patient privacy is at the heart of ethical healthcare. Regulations like HIPAA set the bar high, especially when handling Protected Health Information (PHI) in electronic health records and patient notes. As we've discussed before, clinical notes are a gold mine of valuable information. To tap into this treasure trove, we must accurately de-identify them to enable data sharing, research, and the development of machine learning models—all without compromising patient confidentiality.
De-identifying PHI in clinical text is essentially a Token Classification or Named Entity Recognition (NER) task. NLP has made significant strides here, with models like Fine-tuned ClinicalBERT—a transformer-based model, fine-tuned on medical text and the I2B2-DEID dataset. However, when we tested this model on our clients' data, we noticed it was missing names, departments, and other critical identifiers specific to their patient population. Upon digging deeper, we realized that the I2B2-DEID dataset from 2014 was not reflective of our clients' diverse demographics, leading to these gaps in performance. Even an impressive F1-score of 94% is not sufficient when patient privacy is on the line.
This mismatch highlighted a key issue with one-size-fits-all models in healthcare. What works well in one setting might fall short in another, especially when dealing with diverse patient populations. One straightforward solution would be to curate our own annotated dataset—a time-consuming and resource-intensive task. Instead, we chose a more innovative approach, one that can be easily replicated by other hospitals facing similar challenges.
Adapting Existing Datasets to Our Patient Demographics
To tackle this challenge without the hefty workload of creating a new dataset from scratch, we devised an innovative solution. We took the same I2B2-DEID dataset used for fine-tuning ClinicalBERT and customized it to better reflect our clients' patient demographics. Specifically, we replaced PHI indicators (names, organizations, and locations) with synthetic examples representative of our clients' diverse population.
This approach offered several advantages:
- Efficiency: We avoided the time-consuming process of manual annotation.
- Data Privacy: There was no need to use actual patient notes, ensuring compliance with privacy regulations.
- Enhanced Model Training: By incorporating PHI tokens that mirror our clients' demographics, we improved the model's ability to recognize and de-identify PHI specific to their patient population.
Fine-Tuning the Model with Our Customized Dataset
With our tailored dataset in hand, we ‘further’ fine-tuned the ClinicalBERT-based model that was already fine-tuned on the I2B2 DEID dataset. You can check out the model we built upon here: EHR De-identification GitHub repository.
Here's How We Did It:
- Data Preparation: We replaced PHI in the I2B2-DEID dataset with synthetic PHI tokens reflective of our clients' patient demographics, ensuring the original data structure and annotation schema remained intact.
- Training Setup: Leveraged tools and scripts similar to those in the EHR De-identification GitHub repository to set up the fine-tuning process.
- Parameter Tweaks: Adjusted learning rates and batch sizes to optimize performance without overfitting.
- Validation: Tested the model on a separate set to ensure it was learning effectively.
The Results? Pretty Impressive.
After fine-tuning, we saw significant improvements. The model accurately identified and masked names and terms specific to our patients. The recall score on our test set improved from 89% to 98%, pushing the F1 score up to 96%. For example, before fine-tuning, culturally specific names common in our clients’ patient population were often missed. After our tweaks, these names were correctly identified and de-identified.
What we accomplished was more than just fine-tuning an already fine-tuned model—we aligned it with our specific patient demographic without disrupting its existing understanding of medical language and context. By swapping in our own anonymized PHI tokens into the I2B2-DEID dataset, we exposed the model to names, organizations, and terms it hadn't encountered before, all while preserving the valuable knowledge it had already learned from clinical texts..
Our experience highlights a significant phenomenon: sometimes, the key to unlocking a model's full potential isn't starting over but rather strategically adapting what's already there. By carefully augmenting the model's training data to include our specific needs, we achieved higher accuracy and better performance without sacrificing the foundational knowledge the model possessed.
Cognome is building out collaboratives with health systems that are working on model development and optimization. If you are interested in collaborating with other organizations to share best practices and learnings. For more information reach out to me on LinkedIn or fill out the contact us form below.