DNABERT-2 — ClinVar Variant Pathogenicity Classifier

Fine-tuned from zhihan1996/DNABERT-2-117M on ClinVar high-confidence (≥2 review stars) single nucleotide variants (SNVs) for binary pathogenicity classification.

Label	ID	Meaning
Benign	0	Benign / Likely benign
Pathogenic	1	Pathogenic / Likely pathogenic

Test-set metrics

Gene-level 80/10/10 split on GRCh38 SNVs (no gene appears in both train and test). Test set is 16.3% pathogenic — class-weighted loss used during training.

Metric	Value
AUROC	0.8342589406111742
AUPRC	0.6144893948752504
F1 (pathogenic class)	0.5413069162955557
MCC	0.44952950405729725
Accuracy	0.8467494610269335

Cancer gene panel (BRCA1/2, TP53, MLH1/2, MSH2/6, ATM, CHEK2, PALB2, APC, PTEN, RAD51C): AUROC 0.819 · AUPRC 0.733 · F1 0.629

Input format

A 512 bp DNA string (characters A/C/G/T only), centered on the variant position with the ALT allele introduced at index 255 (0-based centre of the window). Extracted from the GRCh38 reference genome using a ±256 bp window.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tok = AutoTokenizer.from_pretrained("whitedevil0089devil/dnabert2-clinvar-pathogenicity", trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(
    "whitedevil0089devil/dnabert2-clinvar-pathogenicity", trust_remote_code=True)
model.eval()

# sequence = 512 bp window from GRCh38, ALT allele injected at position 255
sequence = "ACGT" * 128   # replace with your real sequence

inputs = tok(sequence, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits

prob_pathogenic = torch.softmax(logits, dim=-1)[0, 1].item()
label = "Pathogenic" if prob_pathogenic >= 0.5 else "Benign"
print(f"Pathogenicity probability: {prob_pathogenic:.4f}  →  {label}")

Confidence tiers (used in the demo app)

Score	Tier	Suggested action
0.00–0.20	Likely Benign	Routine monitoring
0.20–0.40	Possibly Benign	Note in record
0.40–0.60	Uncertain	Flag for expert review
0.60–0.80	Possibly Pathogenic	Prioritize functional validation
0.80–1.00	Likely Pathogenic	Urgent — genetic counseling

Data and training

Source: ClinVar variant_summary.txt (GRCh38, SNVs, ≥2 review stars, 2026)
Labels: Pathogenic + Likely pathogenic → 1 · Benign + Likely benign → 0 · VUS dropped
Split: by GeneSymbol (prevents gene-level leakage)
Context: 512 bp window from GRCh38 reference, ALT allele injected
Imbalance: class-weighted CrossEntropyLoss (16.3% pathogenic)
Optimizer: AdamW · LR 3e-5 · warmup 6% · cosine decay · 5 epochs · batch 64
Hardware: NVIDIA RTX PRO 6000 Blackwell (96 GB) · fp16

Limitations

SNVs only — insertions, deletions, structural variants not supported
Trained on germline ClinVar variants — not validated on somatic mutations
Gene-level split means genes absent from ClinVar are unseen during training
AUROC of 0.83 is good for ranking; the model should not be used as a sole clinical decision-making tool

Citation

@inproceedings{zhou2024dnabert2,
  title={DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome},
  author={Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and
           Davuluri, Ramana and Liu, Han},
  booktitle={ICLR},
  year={2024}
}

ClinVar: Landrum et al., Nucleic Acids Research 2020.

Downloads last month: -

Model tree for whitedevil0089devil/dnabert2-clinvar-pathogenicity

Base model

zhihan1996/DNABERT-2-117M

Finetuned

(31)

this model