Instructions to use Taykhoom/DNABERT-4mer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/DNABERT-4mer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/DNABERT-4mer", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Taykhoom/DNABERT-4mer", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
DNABERT-4mer
Weights and tokenizer for DNABERT (Ji et al., Bioinformatics 2021), 4-mer variant, loaded with the shared BERT implementation from Taykhoom/BERT-updated.
DNABERT is a BERT model pre-trained on the human reference genome using overlapping 4-mer tokenization.
This repo contains only weights and tokenizer files. The model code is loaded
automatically from Taykhoom/BERT-updated via trust_remote_code=True.
Architecture
Standard BERT-base with a 4-mer DNA vocabulary.
| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Vocabulary size | 261 (5 special + 256 DNA 4-mers) |
| Positional encoding | Learned absolute |
| Max sequence length | 512 tokens |
| Parameters | ~88M |
Tokenization
Input sequences must be pre-split into overlapping 4-mers (stride 1) with spaces between tokens before calling the tokenizer. For example:
ATCGATG -> ATCG TCGA CGAT GATG
def seq_to_kmers(seq, k=4):
return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))
Pretraining
- Objective: Masked Language Modeling
- Data: Human reference genome (GRCh38)
- Source checkpoint:
pytorch_model.binfrom zhihan1996/DNA_bert_4
Parity Verification
Hidden-state representations verified (max abs diff < 1.5e-4) relative to the
source implementation at all 13 representation levels (embedding + 12 transformer
layers). The small differences are float32 accumulation from two independent
implementations of identical mathematics; the source dnabert_layer.BertModel
is a direct subclass of transformers.BertModel with no modifications.
Verified on GPU with PyTorch 2.7 / CUDA 12.9.
Related Models
See the full DNABERT collection.
| Model | Architecture | Notes |
|---|---|---|
| DNABERT-3mer | BERT + k-mer | k=3 |
| DNABERT-4mer | BERT + k-mer | k=4 |
| DNABERT-5mer | BERT + k-mer | k=5 |
| DNABERT-6mer | BERT + k-mer | k=6 |
| DNABERT-2 | MosaicBERT + BPE + ALiBi | Multi-species pre-trained |
| DNABERT-S | MosaicBERT + BPE + ALiBi | Species-aware |
Usage
Embedding generation
import torch
from transformers import AutoTokenizer, AutoModel
def seq_to_kmers(seq, k=4):
return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT-4mer", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-4mer", trust_remote_code=True)
model.eval()
sequences = ["ATCGATCGATCG", "GCTAGCTAGCTA"]
kmer_seqs = [seq_to_kmers(s) for s in sequences]
enc = tokenizer(kmer_seqs, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**enc)
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768)
token_emb = out.last_hidden_state # (batch, seq_len, 768)
# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]
Attention implementation
# SDPA (default on PyTorch >= 2.0)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-4mer", trust_remote_code=True,
attn_implementation="sdpa")
# Flash Attention 2
model = AutoModel.from_pretrained("Taykhoom/DNABERT-4mer", trust_remote_code=True,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16)
Implementation Notes
The original DNABERT codebase has BertModel as a thin subclass of
transformers.BertModel with no modifications. This HF port uses
Taykhoom/BERT-updated which adds
attn_implementation="sdpa" and attn_implementation="flash_attention_2"
support — these were not part of the original codebase.
Citation
@article{ji2021_dnabert,
title = {{DNABERT}: pre-trained Bidirectional Encoder Representations from Transformers model for {DNA}-language in genome},
author = {Ji, Yanrong and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V},
journal = {Bioinformatics},
volume = {37},
number = {15},
pages = {2112--2120},
year = {2021},
doi = {10.1093/bioinformatics/btab083}
}
Credits
Original DNABERT model and code by Ji et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.
License
MIT, following the original repository.
- Downloads last month
- 10