hal / README.md
ZhiyuanChen's picture
Upload folder using huggingface_hub
8d9a34e verified
metadata
library_name: multimolecule
license: agpl-3.0
pipeline: splice-variant-effect
pipeline_tag: other
tags:
  - Biology
  - RNA
  - Splicing
  - rna
widget:
  - example_title: microRNA 21
    pipeline_tag: splice-variant-effect
    sequence_type: ncRNA
    task: splice-variant-effect
    text: UAGCUUAUCAGACUGAUGUUGA
  - example_title: microRNA 146a
    pipeline_tag: splice-variant-effect
    sequence_type: ncRNA
    task: splice-variant-effect
    text: UGAGAACUGAAUUCCAUGGGUU
  - example_title: microRNA 155
    pipeline_tag: splice-variant-effect
    sequence_type: ncRNA
    task: splice-variant-effect
    text: UUAAUGCUAAUCGUGAUAGGGGUU
  - example_title: RNA component of mitochondrial RNA processing endoribonuclease
    pipeline_tag: splice-variant-effect
    sequence_type: ncRNA
    task: splice-variant-effect
    text: >-
      GGUUCGUGCUGAAGGCCUGUAUCCUAGGCUACACACUGAGGACUCUGUUCCUCCCCUUUCCGCCUAGGGGAAAGUCCCCGGACCUCGGGCAGAGAGUGCCACGUGCAUACGCACGUAGACAUUCCCCGCUUCCCACUCCAAAGUCCGCCAAGAAGCGUAUCCCGCUGAGCGGCGUGGCGCGGGGGCGUCAUCCGUCAGCUCCCUCUAGUUACGCAGGCAGUGCGUGUCCGCGCACCAACCACACGGGGCUCAUUCUCAGCGCGGCUGUAAAAAAAAA
  - example_title: 7SK small nuclear RNA
    pipeline_tag: splice-variant-effect
    sequence_type: ncRNA
    task: splice-variant-effect
    text: >-
      GGAUGUGAGGGCGAUCUGGCUGCGACAUCUGUCACCCCAUUGAUCGCCAGGGUUGAUUCGGCUGAUCUGGCUGGCUAGGCGGGUGUCCCCUUCCUCCCUCACCGCUCCAUGUGCGUCCCUCCCGAAGCUGCGCGCUCGGUCGAAGAGGACGACCAUCCCCGAUAGAGGAGGACCGGUCUUCGGUCAAGGGUAUACGAGUAGCUGCGCUCCCCUGCUAGAACCUCCAAACAAGCUCUCAAGGUCCAUUUGUAGGAGAACGUAGGGUAGUCAAGCUUCCAAGACUCCAGACACAUCCAAAUGAGGCGCUGCAUGUGGCAGUCUGCCUUUCUUUU
  - example_title: telomerase RNA component
    pipeline_tag: splice-variant-effect
    sequence_type: ncRNA
    task: splice-variant-effect
    text: >-
      GGGUUGCGGAGGGUGGGCCUGGGAGGGGUGGUGGCCAUUUUUUGUCUAACCCUAACUGAGAAGGGCGUAGGCGCCGUGCUUUUGCUCCCCGCGCGCUGUUUUUCUCGCUGACUUUCAGCGGGCGGAAAAGCCUCGGCCUGCCGCCUUCCACCGUUCAUUCUAGAGCAAACAAAAAAUGUCAGCUGCUGGCCCGUUCGCCCCUCCCGGGGACCUGCGGCGGGUCGCCUGCCCAGCCCCCGAACCCCGCCUGGAGGCCGCGGUCGGCCCGGGGCUUCUCCGGAGGCACCCACUGCCACCGCGAAGAGUUGGGCUCUGUCAGCCGCGGGUCUCUCGGGGGCGAGGGCGAGGUUCAGGCCUUUCAGGCCGCAGGAAGAGGAACGGAGCGAGUCCCCGCGCGCGGCGCGAUUCCCUGAGCUGUGGGACGUGCACCCAGGACUCGGCUCACACAUGC
  - example_title: vault RNA 2-1
    pipeline_tag: splice-variant-effect
    sequence_type: ncRNA
    task: splice-variant-effect
    text: >-
      CGGGUCGGAGUUAGCUCAAGCGGUUACCUCCUCAUGCCGGACUUUCUAUCUGUCCAUCUCUGUGCUGGGGUUCGAGACCCGCGGGUGCUUACUGACCCUUUUAUGCAA
  - example_title: brain cytoplasmic RNA 1
    pipeline_tag: splice-variant-effect
    sequence_type: ncRNA
    task: splice-variant-effect
    text: >-
      GGCCGGGCGCGGUGGCUCACGCCUGUAAUCCCAGCUCUCAGGGAGGCUAAGAGGCGGGAGGAUAGCUUGAGCCCAGGAGUUCGAGACCUGCCUGGGCAAUAUAGCGAGACCCCGUUCUCCAGAAAAAGGAAAAAAAAAAACAAAAGACAAAAAAAAAAUAAGCGUAACUUCCCUCAAAGCAACAACCCCCCCCCCCCUUU
  - example_title: HIV-1 TAR-WT
    pipeline_tag: splice-variant-effect
    sequence_type: ncRNA
    task: splice-variant-effect
    text: GGUCUCUCUGGUUAGACCAGAUCUGAGCCUGGGAGCUCUCUGGCUAACUAGGGAACC
  - example_title: prion protein (Kanno blood group)
    pipeline_tag: splice-variant-effect
    sequence_type: mRNA
    task: splice-variant-effect
    text: AUGGCGAACCUUGGCUGCUGGAUGCUGGUUCUCUUUGUGGCCACAUGGAGUGACCUGGGCCUCUGC
  - example_title: interleukin 10
    pipeline_tag: splice-variant-effect
    sequence_type: mRNA
    task: splice-variant-effect
    text: AUGCACAGCUCAGCACUGCUCUGUUGCCUGGUCCUCCUGACUGGGGUGAGGGCC
  - example_title: Zaire ebolavirus
    pipeline_tag: splice-variant-effect
    sequence_type: mRNA
    task: splice-variant-effect
    text: >-
      AAUGUUCAAACACUUUGUGAAGCUCUGUUAGCUGAUGGUCUUGCUAAAGCAUUUCCUAGCAAUAUGAUGGUAGUCACAGAGCGUGAGCAAAAAGAAAGCUUAUUGCAUCAAGCAUCAUGGCACCACACAAGUGAUGAUUUUGGUGAGCAUGCCACAGUUAGAGGGAGUAGCUUUGUAACUGAUUUAGAGAAAUACAAUCUUGCAUUUAGAUAUGAGUUUACAGCACCUUUUAUAGAAUAUUGUAACCGUUGCUAUGGUGUUAAGAAUGUUUUUAAUUGGAUGCAUUAUACAAUCCCACAGUGUUAU
  - example_title: SARS coronavirus
    pipeline_tag: splice-variant-effect
    sequence_type: mRNA
    task: splice-variant-effect
    text: >-
      AUGUUUAUUUUCUUAUUAUUUCUUACUCUCACUAGUGGUAGUGACCUUGACCGGUGCACCACUUUUGAUGAUGUUCAAGCUCCUAAUUACACUCAACAUACUUCAUCUAUGAGGGGGGUUUACUAUCCUGAUGAAAUUUUUAGAUCAGACACUCUUUAUUUAACUCAGGAUUUAUUUCUUCCAUUUUAUUCUAAUGUUACAGGGUUUCAUACUAUUAAUCAUACGUUUGACAACCCUGUCAUACCUUUUAAGGAUGGUAUUUAUUUUGCUGCCACAGAGAAAUCAAAUGUUGUCCGUGGUUGGGUUUUUGGUUCUACCAUGAACAACAAGUCACAGUCGGUGAUUAUUAUUAACAAUUCUACUAAUGUUGUUAUACGAGCAUGUAACUUUGAAUUGUGUGACAACCCUUUCUUUGCUGUUUCUAAACCCAUGGGUACACAGACACAUACUAUGAUAUUCGAUAAUGCAUUUAAAUGCACUUUCGAGUACAUAUCU
  - example_title: insulin
    pipeline_tag: splice-variant-effect
    sequence_type: mRNA
    task: splice-variant-effect
    text: >-
      AUGGCCCUGUGGAUGCGCCUCCUGCCCCUGCUGGCGCUGCUGGCCCUCUGGGGACCUGACCCAGCCGCAGCCUUUGUGAACCAACACCUGUGCGGCUCACACCUGGUGGAAGCUCUCUACCUAGUGUGCGGGGAACGAGGCUUCUUCUACACACCCAAGACCCGCCGGGAGGCAGAGGACCUGCAGGUGGGGCAGGUGGAGCUGGGCGGGGGCCCUGGUGCAGGCAGCCUGCAGCCCUUGGCCCUGGAGGGGUCCCUGCAGAAGCGUGGCAUUGUGGAACAAUGCUGUACCAGCAUCUGCUCCCUCUACCAGCUGGAGAACUACUGCAACUAG
  - example_title: cyclin dependent kinase inhibitor 2A
    pipeline_tag: splice-variant-effect
    sequence_type: mRNA
    task: splice-variant-effect
    text: >-
      AUGGAGCCGGCGGCGGGGAGCAGCAUGGAGCCUUCGGCUGACUGGCUGGCCACGGCCGCGGCCCGGGGUCGGGUAGAGGAGGUGCGGGCGCUGCUGGAGGCGGGGGCGCUGCCCAACGCACCGAAUAGUUACGGUCGGAGGCCGAUCCAGGUCAUGAUGAUGGGCAGCGCCCGAGUGGCGGAGCUGCUGCUGCUCCACGGCGCGGAGCCCAACUGCGCCGACCCCGCCACUCUCACCCGACCCGUGCACGACGCUGCCCGGGAGGGCUUCCUGGACACGCUGGUGGUGCUGCACCGGGCCGGGGCGCGGCUGGACGUGCGCGAUGCCUGGGGCCGUCUGCCCGUGGACCUGGCUGAGGAGCUGGGCCAUCGCGAUGUCGCACGGUACCUGCGCGCGGCUGCGGGGGGCACCAGAGGCAGUAACCAUGCCCGCAUAGAUGCCGCGGAAGGUCCCUCAGACAUCCCCGAUUGA
  - example_title: human papillomavirus type 16 E6
    pipeline_tag: splice-variant-effect
    sequence_type: mRNA
    task: splice-variant-effect
    text: >-
      AUGCACCAAAAGAGAACUGCAAUGUUUCAGGACCCACAGGAGCGACCCAGAAAGUUACCACAGUUAUGCACAGAGCUGCAAACAACUAUACAUGAUAUAAUAUUAGAAUGUGUGUACUGCAAGCAACAGUUACUGCGACGUGAGGUAUAUGACUUUGCUUUUCGGGAUUUAUGCAUAGUAUAUAGAGAUGGGAAUCCAUAUGCUGUAUGUGAUAAAUGUUUAAAGUUUUAUUCUAAAAUUAGUGAGUAUAGACAUUAUUGUUAUAGUUUGUAUGGAACAACAUUAGAACAGCAAUACAACAAACCGUUGUGUGAUUUGUUAAUUAGGUGUAUUAACUGUCAAAAGCCACUGUGUCCUGAAGAAAAGCAAAGACAUCUGGACAAAAAGCAAAGAUUCCAUAAUAUAAGGGGUCGGUGGACCGGUCGAUGUAUGUCUUGUUGCAGAUCAUCAAGAACACGUAGAGAAACCCAGCUGUAA
  - example_title: NRAS proto-oncogene
    pipeline_tag: splice-variant-effect
    sequence_type: 5' UTR
    task: splice-variant-effect
    text: >-
      GGGGCCGGAAGUGCCGCUCCUUGGUGGGGGCUGUUCAUGGCGGUUCCGGGGUCUCCAACAUUUUUCCCGGCUGUGGUCCUAAAUCUGUCCAAAGCAGAGGCAGUGGAGCUUGAGGUUCUUGCUGGUGUGAA
  - example_title: amyloid beta precursor protein
    pipeline_tag: splice-variant-effect
    sequence_type: 5' UTR
    task: splice-variant-effect
    text: >-
      GUCAGUUUCCUCGGCAGCGGUAGGCGAGAGCACGCGGAGGAGCGUGCGCGGGGGCCCCGGGAGACGGCGGCGGUGGCGGCGCGGGCAGAGCAAGGACGCGGCGGAUCCCACUCGCACAGCAGCGCACUCGGUGCCCCGCGCAGGGUCGCG
  - example_title: RUNX family transcription factor 1
    pipeline_tag: splice-variant-effect
    sequence_type: 5' UTR
    task: splice-variant-effect
    text: >-
      ACUUCUUUGGGCCUCAUAAACAACCACAGAACCACAAGUUGGGUAGCCUGGCAGUGUCAGAAGUCUGAACCCAGCAUAGUGGUCAGCAGGCAGGACGAAUCACACUGAAUGCAAACCACAGGGUUUCGCAGCGUGGUAAAAGAAAUCAUUGAGUCCCCCGCCUUCAGAAGAGGGUGCAUUUUCAGGAGGAAGCG
  - example_title: fragile X messenger ribonucleoprotein 1
    pipeline_tag: splice-variant-effect
    sequence_type: 5' UTR
    task: splice-variant-effect
    text: >-
      CUCAGUCAGGCGCUCAGCUCCGUUUCGGUUUCACUUCCGGUGGAGGGCCGCCUCUGAGCGGGCGGCGGGCCGACGGCGAGCGCGGGCGGCGGCGGUGACGGAGGCGCCGCUGCCAGGGGGCGUGCGGCAGCGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGAGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCUGGGCCUCGAGCGCCCGCAGCCCACCUCUCGGGGGCGGGCUCCCGGCGCUAGCAGGGCUGAAGAGAAG
  - example_title: MYC proto-oncogene
    pipeline_tag: splice-variant-effect
    sequence_type: 5' UTR
    task: splice-variant-effect
    text: >-
      AACUCGCUGUAGUAAUUCCAGCGAGAGGCAGAGGGAGCGAGCGGGCGGCCGGCUAGGGUGGAAGAGCCGGGCGAGCAGAGCUGCGCUGCGGGCGUCCUGGGAAGGGAGAUCCGGAGCGAAUAGGGGGCUUCGCCUCUGGCCCAGCCCUCCCGCUGAUCCCCCAGCCAGCGGUCCGCAACCCUUGCCGCAUCCACGAAACUUUGCCCAUAGCAGCGGGCGGGCACUUUGCACUGGAACUUACAACACCCGAGCAAGGACGCGACUCUCCCGACGCGGGGAGGCUAUUCUGCCCAUUUGGGGACACUUCCCCGCCGCUGCCAGGACCCGCUUCUCUGAAAGGCUCUCCUUGCAGCUGCUUAGACG
  - example_title: activating transcription factor 4
    pipeline_tag: splice-variant-effect
    sequence_type: 5' UTR
    task: splice-variant-effect
    text: >-
      CAUUUCUACUUUGCCCGCCCACAGAUGUAGUUUUCUCUGCGCGUGUGCGUUUUCCCUCCUCCCCGCCCUCAGGGUCCACGGCCACCAUGGCGUAUUAGGGGCAGCAGUGCCUGCGGCAGCAUUGGCCUUUGCAGCGGCGGCAGCAGCACCAGGCUCUGCAGCGGCAACCCCCAGCGGCUUAAGCCAUGGCGCUUCUCACGGCAUUCAGCAGCAGCGUUGCUGUAACCGACAAAGACACCUUCGAAUUAAGCACAUUCCUCGAUUCCAGCAAAGCACCGCAAC
  - example_title: Human GPI protein p137
    pipeline_tag: splice-variant-effect
    sequence_type: 3' UTR
    task: splice-variant-effect
    text: >-
      UUUUUAAAAGGAAAAGAUACCAAAUGCCUGCUGCUACCACCCUUUUCAAUUGCUAUGUUUUGAAAGGCACCAGUAUGUGUUUUAGAUUGAUUUAAAUGUUUCAUUUAAAUCACGGACAGUAGUUUCAGUUCUGAUGGUAUAAGCAAAACAAAUAAAACGUUUAUAAAAGUUGUAUCUUGAAACACUGGUGUUCAACAGCUAGCAGCUUAUGUGAUUCACCCCAUGCCACGUUAGUGUCACAAAUUUUAUGGUUUAUCUCCAGCAACAUUUCUCUAGUACUUGCACUUAUUAUCUGAAUUC
  - example_title: nucleophosmin 1
    pipeline_tag: splice-variant-effect
    sequence_type: 3' UTR
    task: splice-variant-effect
    text: >-
      GAAAAUAGUUUAAACAAUUUGUUAAAAAAUUUUCCGUCUUAUUUCAUUUCUGUAACAGUUGAUAUCUGGCUGUCCUUUUUAUAAUGCAGAGUGAGAACUUUCCCUACCGUGUUUGAUAAAUGUUGUCCAGGUUCUAUUGCCAAGAAUGUGUUGUCCAAAAUGCCUGUUUAGUUUUUAAAGAUGGAACUCCACCCUUUGCUUGGUUUUAAGUAUGUAUGGAAUGUUAUGAUAGGACAUAGUAGUAGCGGUGGUCAGACAUGGAAAUGGUGGGGAGACAAAAAUAUACAUGUGAAAUAAAACUCAGUAUUUUAAUAAAGUAGCACGGUUUCUAUUGA
  - example_title: superoxide dismutase 1
    pipeline_tag: splice-variant-effect
    sequence_type: 3' UTR
    task: splice-variant-effect
    text: >-
      ACAUUCCCUUGGAUGUAGUCUGAGGCCCCUUAACUCAUCUGUUAUCCUGCUAGCUGUAGAAAUGUAUCCUGAUAAACAUUAAACACUGUAAUCUUAAAAGUGUAAUUGUGUGACUUUUUCAGAGUUGCUUUAAAGUACCUGUAGUGAGAAACUGAUUUAUGAUCACUUGGAAGAUUUGUAUAGUUUUAUAAAACUCAGUUAAAAUGUCUGUUUCAAUGACCUGUAUUUUGCCAGACUUAAAUCACAGAUGGGUAUUAAACUUGUCAGAAUUUCUUUGUCAUUCAAGCCUGUGAAUAAAAACCCUGUAUGGCACUUAUUAUGAGGCUAUUAAAAGAAUCCAAAUUCAAACUAAA
  - example_title: hemoglobin subunit alpha 2
    pipeline_tag: splice-variant-effect
    sequence_type: 3' UTR
    task: splice-variant-effect
    text: >-
      CUGGAGCCUCGGUAGCCGUUCCUCCUGCCCGCUGGGCCUCCCAACGGGCCCUCCUCCCCUCCUUGCACCGGCCCUUCCUGGUCUUUGAAUAAAGUCUGAGUGGGCAGCA
  - example_title: BRAF proto-oncogene
    pipeline_tag: splice-variant-effect
    sequence_type: 3' UTR
    task: splice-variant-effect
    text: >-
      AACAAAUGAGUGAGAGAGUUCAGGAGAGUAGCAACAAAAGGAAAAUAAAUGAACAUAUGUUUGCUUAUAUGUUAAAUUGAAUAAAAUACUCUCUUUUUUUUUAAGGUGAACCAAAGAACACUUGUGUGGUUAAAGACUAGAUAUAAUUUUUCCCCAAACUAAAAUUUAUACUUAACAUUGGAUUUUUAACAUCCAAGGGUUAAAAUACAUAGACAUUGCUAAAAAUUGGCAGAGCCUCUUCUAGAGGCUUUACUUUCUGUUCCGGGUUUGUAUCAUUCACUUGGUUAUUUUAAGUAGUAAACUUCAGUUUCUCAUGCAACUUUUGUUGCCAGCUAUCACAUGUCCACUAGGGACUCCAGAAGAAGACCCUACCUAUGCCUGUGUUUGCAGGUGAGAAGUUGGCAGUCGGUUAGCCUGGG
  - example_title: H3 clustered histone 1
    pipeline_tag: splice-variant-effect
    sequence_type: 3' UTR
    task: splice-variant-effect
    text: UUACUGUGGUCUCUCUGACGGUCCAAGCAAAGGCUCUUUUCAGAGCCACCACCUUUUC

HAL

Hexamer Additive Linear model for predicting alternative splicing from sequence.

Disclaimer

This is an UNOFFICIAL implementation of Learning the Sequence Determinants of Alternative Splicing from Millions of Random Sequences by Alexander B. Rosenberg, et al.

The OFFICIAL repository of HAL is at Alex-Rosenberg/cell-2015.

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing HAL did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details

HAL is a linear (additive) model that scores alternative 5' splice-site usage from normalized hexamer (6-mer) frequencies across a 160-nucleotide donor-region window. It was learned from massively parallel reporter assays measuring splicing of millions of random synthetic sequences. The published coefficient table contains a (4096, 8) matrix of hexamer effects; the model averages the eight coefficient columns into one effect per hexamer and applies those effects to normalized hexamer frequencies.

Model Specification

Window Published Coefficient Columns Hexamer Features Num Parameters (M) FLOPs MACs
160 nt 8 averaged 4,096 0.004 8,192 4,096

Links

Usage

The model file depends on the multimolecule library. You can install it using pip:

pip install multimolecule

Direct Use

Alternative Splicing Prediction

You can use this model directly to predict a splicing score for a 160-nucleotide RNA sequence window:

>>> import torch
>>> from multimolecule import RnaTokenizer, HalForSequencePrediction

>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/hal")
>>> model = HalForSequencePrediction.from_pretrained("multimolecule/hal")
>>> sequence = "ACGU" * 40
>>> input = tokenizer(sequence, add_special_tokens=False, return_tensors="pt")
>>> output = model(**input)

>>> output.logits.shape
torch.Size([1, 1])

Interface

  • Input length: 160 nt fixed donor-region window
  • Alphabet: ACGU only; any hexamer spanning an unknown / N token is ignored
  • Special tokens: do not add (add_special_tokens=False)
  • Output: single scalar splicing score per window
  • Variant effect: subtract two window scores and apply sigmoid externally for paired donor comparisons

Training Details

HAL was learned from massively parallel splicing reporter assays in which millions of random synthetic sequences were inserted into an alternatively spliced reporter minigene. Splicing outcomes were measured by high-throughput sequencing of the resulting mRNA isoforms.

Training Data

The model was trained on the splicing measurements of millions of degenerate (random) sequences from the reporter library described in the HAL paper. Hexamer coefficients were estimated by regressing the measured splicing index against the hexamer composition of each sequence.

Training Procedure

Pre-training

HAL is a linear regression model. The published hexamer coefficient table is fit to the measured splicing index, and the model prediction is the linear combination of normalized hexamer frequencies with the averaged hexamer effects.

The HAL model uses the published HAL_mer_scores.npz hexamer coefficient table from Rosenberg et al. The table stores 4,096 hexamer rows and eight coefficient columns; the eight columns are averaged into the single per-hexamer effect used by the HAL formula.

Citation

@article{rosenberg2015learning,
  author    = {Rosenberg, Alexander B. and Patwardhan, Rupali P. and Shendure, Jay and Seelig, Georg},
  journal   = {Cell},
  number    = 3,
  pages     = {698--711},
  publisher = {Elsevier BV},
  title     = {Learning the Sequence Determinants of Alternative Splicing from Millions of Random Sequences},
  volume    = 163,
  year      = 2015,
  doi       = {10.1016/j.cell.2015.09.054}
}

The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:

@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the HAL paper for questions or comments on the paper/model.

License

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

SPDX-License-Identifier: AGPL-3.0-or-later