DNABERT-3mer

Weights and tokenizer for DNABERT (Ji et al., Bioinformatics 2021), 3-mer variant, loaded with the shared BERT implementation from Taykhoom/BERT-updated.

DNABERT is a BERT model pre-trained on the human reference genome using overlapping 3-mer tokenization.

This repo contains only weights and tokenizer files. The model code is loaded automatically from Taykhoom/BERT-updated via trust_remote_code=True.

Architecture

Standard BERT-base with a 3-mer DNA vocabulary.

Parameter Value
Layers 12
Attention heads 12
Embedding dimension 768
Vocabulary size 69 (5 special + 64 DNA 3-mers)
Positional encoding Learned absolute
Max sequence length 512 tokens
Parameters ~86M

Tokenization

Input sequences must be pre-split into overlapping 3-mers (stride 1) with spaces between tokens before calling the tokenizer. For example:

ATCGATG  ->  ATC TCG CGA GAT ATG
def seq_to_kmers(seq, k=3):
    return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))

Pretraining

  • Objective: Masked Language Modeling
  • Data: Human reference genome (GRCh38)
  • Source checkpoint: pytorch_model.bin from zhihan1996/DNA_bert_3

Parity Verification

Hidden-state representations verified (max abs diff < 1.5e-4) relative to the source implementation at all 13 representation levels (embedding + 12 transformer layers). The small differences are float32 accumulation from two independent implementations of identical mathematics; the source dnabert_layer.BertModel is a direct subclass of transformers.BertModel with no modifications. Verified on GPU with PyTorch 2.7 / CUDA 12.9.

Related Models

See the full DNABERT collection.

Model Architecture Notes
DNABERT-3mer BERT + k-mer k=3
DNABERT-4mer BERT + k-mer k=4
DNABERT-5mer BERT + k-mer k=5
DNABERT-6mer BERT + k-mer k=6
DNABERT-2 MosaicBERT + BPE + ALiBi Multi-species pre-trained
DNABERT-S MosaicBERT + BPE + ALiBi Species-aware

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

def seq_to_kmers(seq, k=3):
    return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT-3mer", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-3mer", trust_remote_code=True)
model.eval()

sequences = ["ATCGATCGATCG", "GCTAGCTAGCTA"]
kmer_seqs = [seq_to_kmers(s) for s in sequences]
enc = tokenizer(kmer_seqs, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 768)
token_emb = out.last_hidden_state             # (batch, seq_len, 768)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]

Attention implementation

# SDPA (default on PyTorch >= 2.0)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-3mer", trust_remote_code=True,
                                   attn_implementation="sdpa")

# Flash Attention 2 (pass torch_dtype to suppress dtype warning)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-3mer", trust_remote_code=True,
                                   attn_implementation="flash_attention_2",
                                   torch_dtype=torch.bfloat16)

Implementation Notes

The original DNABERT codebase has BertModel as a thin subclass of transformers.BertModel with no modifications. This HF port uses Taykhoom/BERT-updated which adds attn_implementation="sdpa" and attn_implementation="flash_attention_2" support — these were not part of the original codebase.

Citation

@article{ji2021_dnabert,
  title   = {{DNABERT}: pre-trained Bidirectional Encoder Representations from Transformers model for {DNA}-language in genome},
  author  = {Ji, Yanrong and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V},
  journal = {Bioinformatics},
  volume  = {37},
  number  = {15},
  pages   = {2112--2120},
  year    = {2021},
  doi     = {10.1093/bioinformatics/btab083}
}

Credits

Original DNABERT model and code by Ji et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.

Downloads last month
26
Safetensors
Model size
86.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/DNABERT-3mer