DNABERT-2 — ClinVar Variant Pathogenicity Classifier

Fine-tuned from zhihan1996/DNABERT-2-117M on ClinVar high-confidence (≥2 review stars) single nucleotide variants (SNVs) for binary pathogenicity classification.

Label ID Meaning
Benign 0 Benign / Likely benign
Pathogenic 1 Pathogenic / Likely pathogenic

Test-set metrics

Gene-level 80/10/10 split on GRCh38 SNVs (no gene appears in both train and test). Test set is 16.3% pathogenic — class-weighted loss used during training.

Metric Value
AUROC 0.8342589406111742
AUPRC 0.6144893948752504
F1 (pathogenic class) 0.5413069162955557
MCC 0.44952950405729725
Accuracy 0.8467494610269335

Cancer gene panel (BRCA1/2, TP53, MLH1/2, MSH2/6, ATM, CHEK2, PALB2, APC, PTEN, RAD51C): AUROC 0.819 · AUPRC 0.733 · F1 0.629

Input format

A 512 bp DNA string (characters A/C/G/T only), centered on the variant position with the ALT allele introduced at index 255 (0-based centre of the window). Extracted from the GRCh38 reference genome using a ±256 bp window.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tok = AutoTokenizer.from_pretrained("whitedevil0089devil/dnabert2-clinvar-pathogenicity", trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(
    "whitedevil0089devil/dnabert2-clinvar-pathogenicity", trust_remote_code=True)
model.eval()

# sequence = 512 bp window from GRCh38, ALT allele injected at position 255
sequence = "ACGT" * 128   # replace with your real sequence

inputs = tok(sequence, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits

prob_pathogenic = torch.softmax(logits, dim=-1)[0, 1].item()
label = "Pathogenic" if prob_pathogenic >= 0.5 else "Benign"
print(f"Pathogenicity probability: {prob_pathogenic:.4f}  →  {label}")

Confidence tiers (used in the demo app)

Score Tier Suggested action
0.00–0.20 Likely Benign Routine monitoring
0.20–0.40 Possibly Benign Note in record
0.40–0.60 Uncertain Flag for expert review
0.60–0.80 Possibly Pathogenic Prioritize functional validation
0.80–1.00 Likely Pathogenic Urgent — genetic counseling

Data and training

  • Source: ClinVar variant_summary.txt (GRCh38, SNVs, ≥2 review stars, 2026)
  • Labels: Pathogenic + Likely pathogenic → 1 · Benign + Likely benign → 0 · VUS dropped
  • Split: by GeneSymbol (prevents gene-level leakage)
  • Context: 512 bp window from GRCh38 reference, ALT allele injected
  • Imbalance: class-weighted CrossEntropyLoss (16.3% pathogenic)
  • Optimizer: AdamW · LR 3e-5 · warmup 6% · cosine decay · 5 epochs · batch 64
  • Hardware: NVIDIA RTX PRO 6000 Blackwell (96 GB) · fp16

Limitations

  • SNVs only — insertions, deletions, structural variants not supported
  • Trained on germline ClinVar variants — not validated on somatic mutations
  • Gene-level split means genes absent from ClinVar are unseen during training
  • AUROC of 0.83 is good for ranking; the model should not be used as a sole clinical decision-making tool

Citation

@inproceedings{zhou2024dnabert2,
  title={DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome},
  author={Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and
           Davuluri, Ramana and Liu, Han},
  booktitle={ICLR},
  year={2024}
}

ClinVar: Landrum et al., Nucleic Acids Research 2020.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for whitedevil0089devil/dnabert2-clinvar-pathogenicity

Finetuned
(31)
this model