--- datasets: - multimolecule/oas library_name: multimolecule license: agpl-3.0 mask_token: pipeline_tag: fill-mask tags: - Biology - Protein - Antibody - protein widget: - example_title: prion protein (Kanno blood group) mask_index: 13 mask_index_1based: 14 masked_char: A output: - label: L score: 0.240365 - label: A score: 0.162092 - label: S score: 0.10155 - label: V score: 0.049911 - label: G score: 0.045028 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MANLGCWMLVLFVTWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG - example_title: interleukin 10 mask_index: 17 mask_index_1based: 18 masked_char: A output: - label: S score: 0.239462 - label: P score: 0.119321 - label: L score: 0.05651 - label: C score: 0.053079 - label: T score: 0.047578 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MHSSALLCCLVLLTGVRSPGQGTQSENSCTHFPGNLPNMLRDLRDAFSRVKTFFQMKDQLDNLLLKESLLEDFKGYLGCQALSEMIQFYLEEVMPQAENQDPDIKAHVNSLGENLKTLRLRLRRCHRFLPCENKSKAVEQVKNAFNKLQEKGIYKAMSEFDIFINYIEAYMTMKIRN - example_title: Zaire ebolavirus mask_index: 10 mask_index_1based: 11 masked_char: A output: - label: P score: 0.299027 - label: L score: 0.081528 - label: Q score: 0.078362 - label: J score: 0.07693 - label: I score: 0.072591 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: NVQTLCEALLDGLAKAFPSNMMVVTEREQKESLLHQASWHHTSDDFGEHATVRGSSFVTDLEKYNLAFRYEFTAPFIEYCNRCYGVKNVFNWMHYTIPQCY - example_title: SARS coronavirus mask_index: 26 mask_index_1based: 27 masked_char: A output: - label: T score: 0.103118 - label: M score: 0.093444 - label: K score: 0.082981 - label: I score: 0.075711 - label: N score: 0.074848 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MFIFLLFLTLTSGSDLDRCTTFDDVQPNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINHTFDNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFELCDNPFFAVSKPMGTQTHTMIFDNAFKCTFEYIS - example_title: insulin mask_index: 11 mask_index_1based: 12 masked_char: A output: - label: S score: 0.207179 - label: A score: 0.130214 - label: P score: 0.089813 - label: T score: 0.076863 - label: V score: 0.058957 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MALWMRLLPLLLLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN - example_title: cyclin dependent kinase inhibitor 2A mask_index: 12 mask_index_1based: 13 masked_char: A output: - label: L score: 0.121965 - label: W score: 0.100387 - label: G score: 0.085488 - label: T score: 0.067139 - label: R score: 0.067001 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MEPAAGSSMEPSDWLATAAARGRVEEVRALLEAGALPNAPNSYGRRPIQVMMMGSARVAELLLLHGAEPNCADPATLTRPVHDAAREGFLDTLVVLHRAGARLDVRDAWGRLPVDLAEELGHRDVARYLRAAAGGTRGSNHARIDAAEGPSDIPD - example_title: human papillomavirus type 16 E6 mask_index: 52 mask_index_1based: 53 masked_char: A output: - label: T score: 0.260283 - label: S score: 0.067951 - label: G score: 0.057361 - label: K score: 0.047576 - label: P score: 0.04267 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MHQKRTAMFQDPQERPRKLPQLCTELQTTIHDIILECVYCKQQLLRREVYDFFRDLCIVYRDGNPYAVCDKCLKFYSKISEYRHYCYSVYGTTLEQQYNKPLCDLLIRCINCQKPLCPEEKQRHLDKKQRFHNIRGRWTGRCMSCCRSSRTRRETQL --- # AbLang2 Pre-trained model on paired and unpaired antibody sequences using a modified masked language modeling objective. ## Disclaimer This is an UNOFFICIAL implementation of [Addressing the antibody germline bias and its effect on language models for improved antibody design](https://doi.org/10.1093/bioinformatics/btae618) by Tobias H. Olsen, et al. The OFFICIAL repository of AbLang2 is at [oxpig/AbLang2](https://github.com/oxpig/AbLang2). > [!TIP] > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. **The team releasing AbLang2 did not write this model card for this model so this model card has been written by the MultiMolecule team.** ## Model Details AbLang2 is an antibody-specific encoder-only protein language model trained to reduce antibody germline bias in masked residue prediction. It uses multi-head self-attention with rotary position embeddings and SwiGLU feed-forward blocks. The released paired model is trained on paired and unpaired antibody sequence data and is optimized for non-germline residue prediction. ### Model Specification | Num Layers | Hidden Size | Num Heads | Intermediate Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens | | ---------- | ----------- | --------- | ----------------- | ------------------ | --------- | -------- | -------------- | | 12 | 480 | 20 | 1920 | 44.82 | 24.48 | 12.20 | 256 | > [!NOTE] > `Max Num Tokens` reflects the training sequence length of the released checkpoint. AbLang2 uses rotary position > embeddings and has no `max_position_embeddings` field, so the architecture itself does not impose a hard length limit. ### Links - **Code**: [multimolecule.ablang2](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/ablang2) - **Data**: [Observed Antibody Space](https://opig.stats.ox.ac.uk/webapps/oas/) - **Paper**: [Addressing the antibody germline bias and its effect on language models for improved antibody design](https://doi.org/10.1093/bioinformatics/btae618) - **Developed by**: Tobias H. Olsen, Iain H. Moal, Charlotte M. Deane - **Model type**: Encoder-only antibody language model with rotary position embeddings and SwiGLU feed-forward blocks - **Original Repository**: [oxpig/AbLang2](https://github.com/oxpig/AbLang2) ## Usage The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: ```bash pip install multimolecule ``` ### Direct Use #### Masked Language Modeling You can use this model directly with a pipeline for masked language modeling: ```python import multimolecule # you must import multimolecule to register models from transformers import pipeline predictor = pipeline("fill-mask", model="multimolecule/ablang2") output = predictor("EVQLVESGGGLVQPGGSLRLSCAASFTFSSYAMSWVRQAPGKGLEWV") ``` ### Downstream Use #### Extract Features Here is how to use this model to get the features of a given antibody sequence in PyTorch: ```python from multimolecule import ProteinTokenizer, AbLang2Model tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang2") model = AbLang2Model.from_pretrained("multimolecule/ablang2") text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWV" input = tokenizer(text, return_tensors="pt") output = model(**input) ``` #### Sequence Classification / Regression > [!NOTE] > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression. Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch: ```python import torch from multimolecule import ProteinTokenizer, AbLang2ForSequencePrediction tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang2") model = AbLang2ForSequencePrediction.from_pretrained("multimolecule/ablang2") text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWV" input = tokenizer(text, return_tensors="pt") label = torch.tensor([1]) output = model(**input, labels=label) ``` #### Token Classification / Regression > [!NOTE] > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression. Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch: ```python import torch from multimolecule import ProteinTokenizer, AbLang2ForTokenPrediction tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang2") model = AbLang2ForTokenPrediction.from_pretrained("multimolecule/ablang2") text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWV" input = tokenizer(text, return_tensors="pt") label = torch.randint(2, (1, len(text))) output = model(**input, labels=label) ``` #### Contact Classification / Regression > [!NOTE] > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression. Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch: ```python import torch from multimolecule import ProteinTokenizer, AbLang2ForContactPrediction tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang2") model = AbLang2ForContactPrediction.from_pretrained("multimolecule/ablang2") text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWV" input = tokenizer(text, return_tensors="pt") label = torch.randint(2, (1, len(text), len(text))) output = model(**input, labels=label) ``` ## Training Details AbLang2 was trained with masked language modeling as the pre-training objective. The model is bidirectional, so each masked position attends to surrounding residues on both sides. ### Training Data AbLang2 is trained on sequences derived from the Observed Antibody Space (OAS), including 35.6 million unpaired heavy/light-chain sequences and 1.26 million paired antibody sequences for the final released model. ### Training Procedure The AbLang2 paper focuses on reducing antibody germline bias in residue prediction and model-guided antibody design. Please refer to the original paper for details on the training setup. ## Citation ```bibtex @article{olsen2024ablang2, title = {Addressing the antibody germline bias and its effect on language models for improved antibody design}, author = {Olsen, Tobias H. and Moal, Iain H. and Deane, Charlotte M.}, year = {2024}, journal = {Bioinformatics}, volume = {40}, number = {11}, pages = {btae618}, doi = {10.1093/bioinformatics/btae618}, url = {https://doi.org/10.1093/bioinformatics/btae618}, } ``` > [!NOTE] > The artifacts distributed in this repository are part of the MultiMolecule project. > If MultiMolecule supports your research, please cite the MultiMolecule project as follows: ```bibtex @software{chen_2024_12638419, author = {Chen, Zhiyuan and Zhu, Sophia Y.}, title = {MultiMolecule}, doi = {10.5281/zenodo.12638419}, publisher = {Zenodo}, url = {https://doi.org/10.5281/zenodo.12638419}, year = 2024, month = may, day = 4 } ``` ## Contact Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. Please contact the authors of the [AbLang2 paper](https://doi.org/10.1093/bioinformatics/btae618) for questions or comments on the paper/model. ## License This model implementation is licensed under the [GNU Affero General Public License](license.md). For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). ```spdx SPDX-License-Identifier: AGPL-3.0-or-later ```