UTRBERT-6mer

A BERT-base language model pre-trained on human 3' UTR sequences using 6-mer tokenization. Part of the 3UTRBERT model family introduced in Yang et al. (2024).

Architecture

Parameter Value
Layers 12
Attention heads 12
Embedding dimension 768
Intermediate size 3072
Vocabulary size 4101 (5 special tokens + RNA 6-mers)
Positional encoding Learned absolute (BERT-style)
Architecture BERT-base
Max sequence length 512 tokens (~515 nucleotides for 6-mer)

Tokenization: raw RNA (or DNA) sequences are converted T->U, then split into overlapping 6-mers (stride 1). A sequence of length L produces L-5 tokens. A [CLS] and [SEP] token are prepended and appended by the tokenizer.

Pretraining

Checkpoint selection

The only publicly released pre-trained checkpoint for the 6-mer variant is 6-new-12w-0.

Parity Verification

Hidden-state representations verified against the original BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer layers). Max abs diff < 2.5e-5 for both eager and SDPA backends (float32 accumulation across 12 layers). Verified on GPU with PyTorch 2.7 / transformers 4.57.6.

Related Models

See the full UTRBERT collection.

Model k-mer Vocab size Notes
UTRBERT-3mer 3 69
UTRBERT-4mer 4 261
UTRBERT-5mer 5 1029
UTRBERT-6mer 6 4101

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
model.eval()

sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 768) -- CLS token
token_emb = out.last_hidden_state             # (batch, seq_len, 768)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]         # (batch, seq_len, 768)

MLM logits

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
model.eval()

enc = tokenizer(["AUG[MASK]AUG"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 4101)

Fine-tuning

For sequence-level tasks, use the CLS token embedding as input to a prediction head.

import torch.nn as nn
from transformers import AutoModel

model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)

class UTRClassifier(nn.Module):
    def __init__(self, base, num_labels):
        super().__init__()
        self.base = base
        self.head = nn.Linear(768, num_labels)

    def forward(self, input_ids, attention_mask):
        cls = self.base(input_ids, attention_mask=attention_mask).last_hidden_state[:, 0]
        return self.head(cls)

Implementation Notes

This port uses a standalone UTRBertModel (custom PreTrainedModel subclass, model_type: "utrbert"). trust_remote_code=True is required for both the tokenizer and the model.

The original implementation uses standard scaled dot-product attention (post-LN BERT-base). This HF port adds attn_implementation="sdpa" and attn_implementation="flash_attention_2" support, which were not part of the original codebase.

# Faster inference with SDPA (default on modern PyTorch)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True,
                                   attn_implementation="sdpa")

# Flash Attention 2 (requires flash-attn installed)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True,
                                   attn_implementation="flash_attention_2")

Citation

@article{yang2024_utrbert,
  title   = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning},
  author  = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao},
  journal = {Advanced Science},
  volume  = {11},
  number  = {39},
  pages   = {e2407013},
  year    = {2024},
  doi     = {10.1002/advs.202407013}
}

Credits

Original model and code by Yang et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/UTRBERT-6mer