BioClinical Treatment Information Detector
Model Description
This model is a specialized token classification system designed to detect treatment-related information in addiction medicine clinical notes. It is fine-tuned from thomas-sounack/BioClinical-ModernBERT-large to identify current and future treatment plans, medication decisions, and therapeutic interventions while preserving patient privacy.
Key Purpose: Prevent information leakage about locus of care and medication decisions when training clinical decision support systems in addiction medicine. This model enables researchers to mask sensitive treatment information before using clinical data for machine learning applications.
Intended Use
Primary Use Case
- Privacy-preserving clinical AI: Mask treatment-related information from clinical notes before training decision support systems
- Research data preparation: Identify and redact sensitive treatment details while preserving other clinical information
- Compliance support: Help maintain patient confidentiality when sharing clinical datasets for research
What It Detects
- Current medication prescriptions and dosages
- Treatment plans and recommendations
- Therapeutic interventions and procedures
- Follow-up care instructions
- Clinical advice and care coordination
What It Does NOT Detect
- Past treatment history (focuses only on current/future treatments)
- Personal Identifiable Information (PII) like names, addresses, phone numbers
- General medical conditions or diagnoses
- Demographics or personal details
Model Details
- Model Type: Token Classification (NER)
- Base Model: thomas-sounack/BioClinical-ModernBERT-large
- Language: English
- Domain: Clinical text (addiction medicine)
- Training Data: Single-center clinical notes from addiction medicine department
- Labels:
O: Outside treatment informationB-TREATMENT: Beginning of treatment entityI-TREATMENT: Inside treatment entity
Performance
The model achieves strong performance on treatment detection:
- Treatment F1-Score: 0.892
- Treatment Precision: 0.885
- Treatment Recall: 0.899
These metrics reflect the model's ability to accurately identify treatment-related spans while minimizing false positives and negatives.
Limitations and Bias
Domain Specificity
- Single-center training: Model is trained exclusively on data from one addiction medicine center
- Specialty focus: Optimized for addiction medicine; may not generalize well to other medical specialties
- Language limitation: English-only model
Temporal Focus
- Current/future treatments only: Does not detect historical treatment information
- Context dependency: Performance may vary with different clinical note structures
Ethical Considerations
- This model is designed for defensive security purposes only
- Should be used to protect patient privacy, not to extract sensitive information
- Users must ensure compliance with healthcare privacy regulations (HIPAA, GDPR, etc.)
Quick Start
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
model_name = "Lekhansh/bioclinical-treatment-detector"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Example clinical text
text = """
Treatment Plan:
1. Start Tablet Buprenorphine 8mg twice daily
2. Continue counseling sessions weekly
3. Follow up in outpatient clinic after 2 weeks
"""
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512,
return_offsets_mapping=True)
outputs = model(**{k: v for k, v in inputs.items() if k != 'offset_mapping'})
# Get predictions
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_labels = torch.argmax(predictions, dim=-1)[0]
# Map predictions to text spans
id2label = {0: "O", 1: "B-TREATMENT", 2: "I-TREATMENT"}
offset_mapping = inputs["offset_mapping"][0]
treatment_spans = []
current_span = None
for i, (label_id, (start, end)) in enumerate(zip(predicted_labels, offset_mapping)):
if start == 0 and end == 0: # Skip special tokens
continue
label = id2label[label_id.item()]
if label == "B-TREATMENT":
if current_span:
treatment_spans.append(current_span)
current_span = {"start": start.item(), "end": end.item()}
elif label == "I-TREATMENT" and current_span:
current_span["end"] = end.item()
else:
if current_span:
treatment_spans.append(current_span)
current_span = None
if current_span:
treatment_spans.append(current_span)
# Extract treatment text
for span in treatment_spans:
treatment_text = text[span["start"]:span["end"]]
print(f"Treatment detected: '{treatment_text}'")
Advanced Usage
For more sophisticated inference with confidence scores and batch processing, see the complete example:
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForTokenClassification
class TreatmentDetector:
def __init__(self, model_name):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForTokenClassification.from_pretrained(model_name)
self.model.eval()
self.id2label = {0: "O", 1: "B-TREATMENT", 2: "I-TREATMENT"}
def detect_treatments(self, text, confidence_threshold=0.5):
encoding = self.tokenizer(
text, return_tensors="pt", truncation=True, max_length=8192,
return_offsets_mapping=True, padding=True
)
with torch.no_grad():
outputs = self.model(**{k: v for k, v in encoding.items() if k != 'offset_mapping'})
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_labels = torch.argmax(predictions, dim=-1)[0]
confidence_scores = torch.max(predictions, dim=-1)[0][0]
treatment_spans = []
current_span = None
for i, (label_id, confidence, (start, end)) in enumerate(
zip(predicted_labels, confidence_scores, encoding["offset_mapping"][0])
):
if start == 0 and end == 0:
continue
label = self.id2label[label_id.item()]
conf = confidence.item()
if label == "B-TREATMENT" and conf > confidence_threshold:
if current_span:
treatment_spans.append(current_span)
current_span = {
"start": start.item(), "end": end.item(),
"confidence": conf
}
elif label == "I-TREATMENT" and current_span and conf > confidence_threshold:
current_span["end"] = end.item()
current_span["confidence"] = (current_span["confidence"] + conf) / 2
else:
if current_span:
treatment_spans.append(current_span)
current_span = None
if current_span:
treatment_spans.append(current_span)
# Add text content
for span in treatment_spans:
span["text"] = text[span["start"]:span["end"]]
return treatment_spans
# Usage
detector = TreatmentDetector("Lekhansh/bioclinical-treatment-detector")
treatments = detector.detect_treatments(clinical_text)
Training Details
Training Data
- Source: Single addiction medicine center clinical notes
- Annotation: Manual annotation of treatment-related text spans
- Size: Balanced dataset with both positive and negative examples
- Preprocessing: Text segmentation with sliding windows for long documents
Training Configuration
- Base Model: thomas-sounack/BioClinical-ModernBERT-large
- Training Epochs: 3
- Batch Size: 8 (with gradient accumulation)
- Learning Rate: 5e-5
- Optimizer: AdamW with weight decay
- Hardware: Single GPU training
Citation
If you use this model in your research, please cite:
@misc{bioclinical-treatment-detector,
title={Addiction Medicine Treatment Information Detector for Clinical AI},
author={[Lekhansh S, Prakrithi SN]},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/Lekhansh/bioclinical-treatment-detector}}
}
Contact
For questions about this model or its applications in privacy-preserving clinical AI, please contact [drlekhansh@gmail.com].
License
This model is released under the Apache 2.0 License. Please ensure compliance with all applicable healthcare privacy regulations when using this model.
- Downloads last month
- 5
Model tree for Lekhansh/bioclinical-treatment-detector
Base model
answerdotai/ModernBERT-largeEvaluation results
- Treatment F1-Score on Addiction Medicine Clinical Notesself-reported0.900
- Treatment Precision on Addiction Medicine Clinical Notesself-reported0.900
- Treatment Recall on Addiction Medicine Clinical Notesself-reported0.899