SpliceBERT-510nt

SpliceBERT is a BERT-based RNA language model pre-trained on over 2 million vertebrate primary RNA sequences using a masked language modeling (MLM) objective. The 510nt vertebrate variant is trained exclusively on fixed-length 510 nt fragments.

WARNING: This model requires exactly 510 nt of input (excluding [CLS] and [SEP]). Sequences shorter or longer than 510 nt may produce incorrect outputs without fine-tuning. For general-purpose RNA embedding, use SpliceBERT-1024nt instead.

Architecture

Parameter Value
Layers 6
Attention heads 16
Embedding dimension 512
Intermediate dimension 2048
Vocabulary size 10
Positional encoding Learned absolute
Architecture BERT encoder
Max sequence length 510 (fixed-length training)
Parameters ~44M

Vocabulary: [PAD]=0, [UNK]=1, [CLS]=2, [SEP]=3, [MASK]=4, N=5, A=6, C=7, G=8, T/U=9

Pretraining

  • Objective: Masked language modeling (MLM)
  • Data: >2 million vertebrate primary RNA sequences from 72 species
  • Sequence format: Single-nucleotide tokenization with spaces; U converted to T; fixed 510 nt fragments
  • Source checkpoint: SpliceBERT.510nt/pytorch_model.bin (from zenodo:7995778)

Checkpoint selection

The 510nt vertebrate variant is intended for splice site prediction tasks where exact 510 nt windows are used (e.g., centered on a splice site). For variable-length sequences use SpliceBERT-1024nt.

Parity Verification

Hidden-state representations verified (max abs diff < 1e-5) against the original checkpoint at all 7 representation levels (embedding + 6 transformer layers), for both eager and sdpa attention backends. Verified on GPU with PyTorch 2.7 / CUDA 11.8.

Related Models

See the full SpliceBERT collection.

Model Context Training data Notes
SpliceBERT-1024nt 1024 nt 72 vertebrates Variable-length; general purpose
SpliceBERT-510nt 510 nt (fixed) 72 vertebrates This model
SpliceBERT-human-510nt 510 nt (fixed) Human only Human-specific

Usage

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-510nt", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-510nt", trust_remote_code=True)
model.eval()

# Sequence must be exactly 510 nt; tokenizer handles U->T automatically
seq = ("ATCGATCG" * 64)[:510]  # exactly 510 nt
enc = tokenizer(seq, return_tensors="pt")

with torch.no_grad():
    out = model(**enc, output_hidden_states=True)

hidden = out.last_hidden_state[0]  # (512, 512)
token_emb = hidden[1:-1]           # strip [CLS] and [SEP] -> (510, 512)
mean_emb = token_emb.mean(dim=0)   # (512,)

Fine-tuning

Standard HF conventions. For splice site prediction, token-level classification using all 510 token positions (excluding special tokens) is the typical setup.

Implementation Notes

The original checkpoint was saved as BertForMaskedLM with transformers==4.20.1. This port uses BERT-updated, which adds attn_implementation="sdpa" and attn_implementation="flash_attention_2" support not present in the original codebase.

Citation

@article{chen2024_splicebert,
  title   = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
  author  = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
  journal = {Briefings in Bioinformatics},
  volume  = {25},
  number  = {3},
  pages   = {bbae163},
  year    = {2024},
  doi     = {10.1093/bib/bbae163}
}

Credits

Original model and code by Chen et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.

Downloads last month
28
Safetensors
Model size
19.2M params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/SpliceBERT-510nt