UTRBERT-6mer

A BERT-base language model pre-trained on human 3' UTR sequences using 6-mer tokenization. Part of the 3UTRBERT model family introduced in Yang et al. (2024).

Architecture

Parameter	Value
Layers	12
Attention heads	12
Embedding dimension	768
Intermediate size	3072
Vocabulary size	4101 (5 special tokens + RNA 6-mers)
Positional encoding	Learned absolute (BERT-style)
Architecture	BERT-base
Max sequence length	512 tokens (~515 nucleotides for 6-mer)

Tokenization: raw RNA (or DNA) sequences are converted T->U, then split into overlapping 6-mers (stride 1). A sequence of length L produces L-5 tokens. A [CLS] and [SEP] token are prepended and appended by the tokenizer.

Pretraining

Objective: Masked Language Modeling (MLM) on 6-mer tokens
Data: Human 3' UTR sequences
Source checkpoint: 6-new-12w-0/pytorch_model.bin from figshare article 22847354 (direct download)

Checkpoint selection

The only publicly released pre-trained checkpoint for the 6-mer variant is 6-new-12w-0.

Parity Verification

Hidden-state representations verified against the original BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer layers). Max abs diff < 2.5e-5 for both eager and SDPA backends (float32 accumulation across 12 layers). Verified on GPU with PyTorch 2.7 / transformers 4.57.6.

Related Models

See the full UTRBERT collection.

Model	k-mer	Vocab size
UTRBERT-3mer	3	69
UTRBERT-4mer	4	261
UTRBERT-5mer	5	1029
UTRBERT-6mer	6	4101

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
model.eval()

sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 768) -- CLS token
token_emb = out.last_hidden_state             # (batch, seq_len, 768)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]         # (batch, seq_len, 768)

MLM logits

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
model.eval()

enc = tokenizer(["AUG[MASK]AUG"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 4101)

Fine-tuning

For sequence-level tasks, use the CLS token embedding as input to a prediction head.

import torch.nn as nn
from transformers import AutoModel

model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)

class UTRClassifier(nn.Module):
    def __init__(self, base, num_labels):
        super().__init__()
        self.base = base
        self.head = nn.Linear(768, num_labels)

    def forward(self, input_ids, attention_mask):
        cls = self.base(input_ids, attention_mask=attention_mask).last_hidden_state[:, 0]
        return self.head(cls)

Implementation Notes

This port uses a standalone UTRBertModel (custom PreTrainedModel subclass, model_type: "utrbert"). trust_remote_code=True is required for both the tokenizer and the model.

The original implementation uses standard scaled dot-product attention (post-LN BERT-base). This HF port adds attn_implementation="sdpa" and attn_implementation="flash_attention_2" support, which were not part of the original codebase.

# Faster inference with SDPA (default on modern PyTorch)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True,
                                   attn_implementation="sdpa")

# Flash Attention 2 (requires flash-attn installed)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True,
                                   attn_implementation="flash_attention_2")

Citation

@article{yang2024_utrbert,
  title   = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning},
  author  = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao},
  journal = {Advanced Science},
  volume  = {11},
  number  = {39},
  pages   = {e2407013},
  year    = {2024},
  doi     = {10.1002/advs.202407013}
}

Credits

Original model and code by Yang et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.

Downloads last month: 11

Collection including Taykhoom/UTRBERT-6mer

UTRBERT

Collection

UTRBERT: BERT-base models pre-trained on 3-prime UTR sequences, 3- through 6-mer tokenization. • 4 items • Updated about 22 hours ago