DNABERT-2 — ClinVar Variant Pathogenicity Classifier
Fine-tuned from zhihan1996/DNABERT-2-117M on ClinVar high-confidence (≥2 review stars) single nucleotide variants (SNVs) for binary pathogenicity classification.
| Label | ID | Meaning |
|---|---|---|
| Benign | 0 | Benign / Likely benign |
| Pathogenic | 1 | Pathogenic / Likely pathogenic |
Test-set metrics
Gene-level 80/10/10 split on GRCh38 SNVs (no gene appears in both train and test). Test set is 16.3% pathogenic — class-weighted loss used during training.
| Metric | Value |
|---|---|
| AUROC | 0.8342589406111742 |
| AUPRC | 0.6144893948752504 |
| F1 (pathogenic class) | 0.5413069162955557 |
| MCC | 0.44952950405729725 |
| Accuracy | 0.8467494610269335 |
Cancer gene panel (BRCA1/2, TP53, MLH1/2, MSH2/6, ATM, CHEK2, PALB2, APC, PTEN, RAD51C): AUROC 0.819 · AUPRC 0.733 · F1 0.629
Input format
A 512 bp DNA string (characters A/C/G/T only), centered on the variant position with the ALT allele introduced at index 255 (0-based centre of the window). Extracted from the GRCh38 reference genome using a ±256 bp window.
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tok = AutoTokenizer.from_pretrained("whitedevil0089devil/dnabert2-clinvar-pathogenicity", trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(
"whitedevil0089devil/dnabert2-clinvar-pathogenicity", trust_remote_code=True)
model.eval()
# sequence = 512 bp window from GRCh38, ALT allele injected at position 255
sequence = "ACGT" * 128 # replace with your real sequence
inputs = tok(sequence, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
prob_pathogenic = torch.softmax(logits, dim=-1)[0, 1].item()
label = "Pathogenic" if prob_pathogenic >= 0.5 else "Benign"
print(f"Pathogenicity probability: {prob_pathogenic:.4f} → {label}")
Confidence tiers (used in the demo app)
| Score | Tier | Suggested action |
|---|---|---|
| 0.00–0.20 | Likely Benign | Routine monitoring |
| 0.20–0.40 | Possibly Benign | Note in record |
| 0.40–0.60 | Uncertain | Flag for expert review |
| 0.60–0.80 | Possibly Pathogenic | Prioritize functional validation |
| 0.80–1.00 | Likely Pathogenic | Urgent — genetic counseling |
Data and training
- Source: ClinVar
variant_summary.txt(GRCh38, SNVs, ≥2 review stars, 2026) - Labels: Pathogenic + Likely pathogenic → 1 · Benign + Likely benign → 0 · VUS dropped
- Split: by
GeneSymbol(prevents gene-level leakage) - Context: 512 bp window from GRCh38 reference, ALT allele injected
- Imbalance: class-weighted CrossEntropyLoss (16.3% pathogenic)
- Optimizer: AdamW · LR 3e-5 · warmup 6% · cosine decay · 5 epochs · batch 64
- Hardware: NVIDIA RTX PRO 6000 Blackwell (96 GB) · fp16
Limitations
- SNVs only — insertions, deletions, structural variants not supported
- Trained on germline ClinVar variants — not validated on somatic mutations
- Gene-level split means genes absent from ClinVar are unseen during training
- AUROC of 0.83 is good for ranking; the model should not be used as a sole clinical decision-making tool
Citation
@inproceedings{zhou2024dnabert2,
title={DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome},
author={Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and
Davuluri, Ramana and Liu, Han},
booktitle={ICLR},
year={2024}
}
ClinVar: Landrum et al., Nucleic Acids Research 2020.
- Downloads last month
- -
Model tree for whitedevil0089devil/dnabert2-clinvar-pathogenicity
Base model
zhihan1996/DNABERT-2-117M