Instructions to use Taykhoom/UTRBERT-6mer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/UTRBERT-6mer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Taykhoom/UTRBERT-6mer", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
UTRBERT-6mer
A BERT-base language model pre-trained on human 3' UTR sequences using 6-mer tokenization. Part of the 3UTRBERT model family introduced in Yang et al. (2024).
Architecture
| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Intermediate size | 3072 |
| Vocabulary size | 4101 (5 special tokens + RNA 6-mers) |
| Positional encoding | Learned absolute (BERT-style) |
| Architecture | BERT-base |
| Max sequence length | 512 tokens (~515 nucleotides for 6-mer) |
Tokenization: raw RNA (or DNA) sequences are converted T->U, then split into overlapping 6-mers (stride 1). A sequence of length L produces L-5 tokens. A [CLS] and [SEP] token are prepended and appended by the tokenizer.
Pretraining
- Objective: Masked Language Modeling (MLM) on 6-mer tokens
- Data: Human 3' UTR sequences
- Source checkpoint:
6-new-12w-0/pytorch_model.binfrom figshare article 22847354 (direct download)
Checkpoint selection
The only publicly released pre-trained checkpoint for the 6-mer variant is 6-new-12w-0.
Parity Verification
Hidden-state representations verified against the original BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer layers). Max abs diff < 2.5e-5 for both eager and SDPA backends (float32 accumulation across 12 layers). Verified on GPU with PyTorch 2.7 / transformers 4.57.6.
Related Models
See the full UTRBERT collection.
| Model | k-mer | Vocab size | Notes |
|---|---|---|---|
| UTRBERT-3mer | 3 | 69 | |
| UTRBERT-4mer | 4 | 261 | |
| UTRBERT-5mer | 5 | 1029 | |
| UTRBERT-6mer | 6 | 4101 |
Usage
Embedding generation
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
model.eval()
sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
out = model(**enc)
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768) -- CLS token
token_emb = out.last_hidden_state # (batch, seq_len, 768)
# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768)
MLM logits
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
model.eval()
enc = tokenizer(["AUG[MASK]AUG"], return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits # (1, seq_len, 4101)
Fine-tuning
For sequence-level tasks, use the CLS token embedding as input to a prediction head.
import torch.nn as nn
from transformers import AutoModel
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True)
class UTRClassifier(nn.Module):
def __init__(self, base, num_labels):
super().__init__()
self.base = base
self.head = nn.Linear(768, num_labels)
def forward(self, input_ids, attention_mask):
cls = self.base(input_ids, attention_mask=attention_mask).last_hidden_state[:, 0]
return self.head(cls)
Implementation Notes
This port uses a standalone UTRBertModel (custom PreTrainedModel subclass, model_type: "utrbert").
trust_remote_code=True is required for both the tokenizer and the model.
The original implementation uses standard scaled dot-product attention (post-LN BERT-base).
This HF port adds attn_implementation="sdpa" and attn_implementation="flash_attention_2"
support, which were not part of the original codebase.
# Faster inference with SDPA (default on modern PyTorch)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True,
attn_implementation="sdpa")
# Flash Attention 2 (requires flash-attn installed)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-6mer", trust_remote_code=True,
attn_implementation="flash_attention_2")
Citation
@article{yang2024_utrbert,
title = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning},
author = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao},
journal = {Advanced Science},
volume = {11},
number = {39},
pages = {e2407013},
year = {2024},
doi = {10.1002/advs.202407013}
}
Credits
Original model and code by Yang et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.
License
MIT, following the original repository.
- Downloads last month
- 11