File size: 11,651 Bytes

e2477a4

---
datasets:
- multimolecule/oas
library_name: multimolecule
license: agpl-3.0
mask_token: <mask>
pipeline_tag: fill-mask
tags:
- Biology
- Protein
- Antibody
- protein
widget:
- example_title: prion protein (Kanno blood group)
  mask_index: 13
  mask_index_1based: 14
  masked_char: A
  output:
  - label: L
    score: 0.240365
  - label: A
    score: 0.162092
  - label: S
    score: 0.10155
  - label: V
    score: 0.049911
  - label: G
    score: 0.045028
  pipeline_tag: fill-mask
  sequence_type: Protein
  task: fill-mask
  text: MANLGCWMLVLFV<mask>TWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG
- example_title: interleukin 10
  mask_index: 17
  mask_index_1based: 18
  masked_char: A
  output:
  - label: S
    score: 0.239462
  - label: P
    score: 0.119321
  - label: L
    score: 0.05651
  - label: C
    score: 0.053079
  - label: T
    score: 0.047578
  pipeline_tag: fill-mask
  sequence_type: Protein
  task: fill-mask
  text: MHSSALLCCLVLLTGVR<mask>SPGQGTQSENSCTHFPGNLPNMLRDLRDAFSRVKTFFQMKDQLDNLLLKESLLEDFKGYLGCQALSEMIQFYLEEVMPQAENQDPDIKAHVNSLGENLKTLRLRLRRCHRFLPCENKSKAVEQVKNAFNKLQEKGIYKAMSEFDIFINYIEAYMTMKIRN
- example_title: Zaire ebolavirus
  mask_index: 10
  mask_index_1based: 11
  masked_char: A
  output:
  - label: P
    score: 0.299027
  - label: L
    score: 0.081528
  - label: Q
    score: 0.078362
  - label: J
    score: 0.07693
  - label: I
    score: 0.072591
  pipeline_tag: fill-mask
  sequence_type: Protein
  task: fill-mask
  text: NVQTLCEALL<mask>DGLAKAFPSNMMVVTEREQKESLLHQASWHHTSDDFGEHATVRGSSFVTDLEKYNLAFRYEFTAPFIEYCNRCYGVKNVFNWMHYTIPQCY
- example_title: SARS coronavirus
  mask_index: 26
  mask_index_1based: 27
  masked_char: A
  output:
  - label: T
    score: 0.103118
  - label: M
    score: 0.093444
  - label: K
    score: 0.082981
  - label: I
    score: 0.075711
  - label: N
    score: 0.074848
  pipeline_tag: fill-mask
  sequence_type: Protein
  task: fill-mask
  text: MFIFLLFLTLTSGSDLDRCTTFDDVQ<mask>PNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINHTFDNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFELCDNPFFAVSKPMGTQTHTMIFDNAFKCTFEYIS
- example_title: insulin
  mask_index: 11
  mask_index_1based: 12
  masked_char: A
  output:
  - label: S
    score: 0.207179
  - label: A
    score: 0.130214
  - label: P
    score: 0.089813
  - label: T
    score: 0.076863
  - label: V
    score: 0.058957
  pipeline_tag: fill-mask
  sequence_type: Protein
  task: fill-mask
  text: MALWMRLLPLL<mask>LLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
- example_title: cyclin dependent kinase inhibitor 2A
  mask_index: 12
  mask_index_1based: 13
  masked_char: A
  output:
  - label: L
    score: 0.121965
  - label: W
    score: 0.100387
  - label: G
    score: 0.085488
  - label: T
    score: 0.067139
  - label: R
    score: 0.067001
  pipeline_tag: fill-mask
  sequence_type: Protein
  task: fill-mask
  text: MEPAAGSSMEPS<mask>DWLATAAARGRVEEVRALLEAGALPNAPNSYGRRPIQVMMMGSARVAELLLLHGAEPNCADPATLTRPVHDAAREGFLDTLVVLHRAGARLDVRDAWGRLPVDLAEELGHRDVARYLRAAAGGTRGSNHARIDAAEGPSDIPD
- example_title: human papillomavirus type 16 E6
  mask_index: 52
  mask_index_1based: 53
  masked_char: A
  output:
  - label: T
    score: 0.260283
  - label: S
    score: 0.067951
  - label: G
    score: 0.057361
  - label: K
    score: 0.047576
  - label: P
    score: 0.04267
  pipeline_tag: fill-mask
  sequence_type: Protein
  task: fill-mask
  text: MHQKRTAMFQDPQERPRKLPQLCTELQTTIHDIILECVYCKQQLLRREVYDF<mask>FRDLCIVYRDGNPYAVCDKCLKFYSKISEYRHYCYSVYGTTLEQQYNKPLCDLLIRCINCQKPLCPEEKQRHLDKKQRFHNIRGRWTGRCMSCCRSSRTRRETQL
---

# AbLang2

Pre-trained model on paired and unpaired antibody sequences using a modified masked language modeling objective.

## Disclaimer

This is an UNOFFICIAL implementation of [Addressing the antibody germline bias and its effect on language models for improved antibody design](https://doi.org/10.1093/bioinformatics/btae618) by Tobias H. Olsen, et al.

The OFFICIAL repository of AbLang2 is at [oxpig/AbLang2](https://github.com/oxpig/AbLang2).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing AbLang2 did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

AbLang2 is an antibody-specific encoder-only protein language model trained to reduce antibody germline bias in masked residue prediction. It uses multi-head self-attention with rotary position embeddings and SwiGLU feed-forward blocks. The released paired model is trained on paired and unpaired antibody sequence data and is optimized for non-germline residue prediction.

### Model Specification

| Num Layers | Hidden Size | Num Heads | Intermediate Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
| ---------- | ----------- | --------- | ----------------- | ------------------ | --------- | -------- | -------------- |
| 12         | 480         | 20        | 1920              | 44.82              | 24.48     | 12.20    | 256            |

> [!NOTE]
> `Max Num Tokens` reflects the training sequence length of the released checkpoint. AbLang2 uses rotary position
> embeddings and has no `max_position_embeddings` field, so the architecture itself does not impose a hard length limit.

### Links

- **Code**: [multimolecule.ablang2](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/ablang2)
- **Data**: [Observed Antibody Space](https://opig.stats.ox.ac.uk/webapps/oas/)
- **Paper**: [Addressing the antibody germline bias and its effect on language models for improved antibody design](https://doi.org/10.1093/bioinformatics/btae618)
- **Developed by**: Tobias H. Olsen, Iain H. Moal, Charlotte M. Deane
- **Model type**: Encoder-only antibody language model with rotary position embeddings and SwiGLU feed-forward blocks
- **Original Repository**: [oxpig/AbLang2](https://github.com/oxpig/AbLang2)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

#### Masked Language Modeling

You can use this model directly with a pipeline for masked language modeling:

```python
import multimolecule  # you must import multimolecule to register models
from transformers import pipeline

predictor = pipeline("fill-mask", model="multimolecule/ablang2")
output = predictor("EVQLVESGGGLVQPGGSLRLSCAAS<mask>FTFSSYAMSWVRQAPGKGLEWV")
```

### Downstream Use

#### Extract Features

Here is how to use this model to get the features of a given antibody sequence in PyTorch:

```python
from multimolecule import ProteinTokenizer, AbLang2Model


tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang2")
model = AbLang2Model.from_pretrained("multimolecule/ablang2")

text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWV"
input = tokenizer(text, return_tensors="pt")

output = model(**input)
```

#### Sequence Classification / Regression

> [!NOTE]
> This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

```python
import torch
from multimolecule import ProteinTokenizer, AbLang2ForSequencePrediction


tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang2")
model = AbLang2ForSequencePrediction.from_pretrained("multimolecule/ablang2")

text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWV"
input = tokenizer(text, return_tensors="pt")
label = torch.tensor([1])

output = model(**input, labels=label)
```

#### Token Classification / Regression

> [!NOTE]
> This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression.

Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch:

```python
import torch
from multimolecule import ProteinTokenizer, AbLang2ForTokenPrediction


tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang2")
model = AbLang2ForTokenPrediction.from_pretrained("multimolecule/ablang2")

text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWV"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (1, len(text)))

output = model(**input, labels=label)
```

#### Contact Classification / Regression

> [!NOTE]
> This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

```python
import torch
from multimolecule import ProteinTokenizer, AbLang2ForContactPrediction


tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang2")
model = AbLang2ForContactPrediction.from_pretrained("multimolecule/ablang2")

text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWV"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (1, len(text), len(text)))

output = model(**input, labels=label)
```

## Training Details

AbLang2 was trained with masked language modeling as the pre-training objective. The model is bidirectional, so each masked position attends to surrounding residues on both sides.

### Training Data

AbLang2 is trained on sequences derived from the Observed Antibody Space (OAS), including 35.6 million unpaired heavy/light-chain sequences and 1.26 million paired antibody sequences for the final released model.

### Training Procedure

The AbLang2 paper focuses on reducing antibody germline bias in residue prediction and model-guided antibody design.
Please refer to the original paper for details on the training setup.

## Citation

```bibtex
@article{olsen2024ablang2,
  title   = {Addressing the antibody germline bias and its effect on language models for improved antibody design},
  author  = {Olsen, Tobias H. and Moal, Iain H. and Deane, Charlotte M.},
  year    = {2024},
  journal = {Bioinformatics},
  volume  = {40},
  number  = {11},
  pages   = {btae618},
  doi     = {10.1093/bioinformatics/btae618},
  url     = {https://doi.org/10.1093/bioinformatics/btae618},
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If MultiMolecule supports your research, please cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [AbLang2 paper](https://doi.org/10.1093/bioinformatics/btae618) for questions or comments on the paper/model.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```