dipayan26 commited on
Commit
2e6bb66
·
verified ·
1 Parent(s): 409035c

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +165 -0
  2. config.json +32 -0
  3. model.safetensors +3 -0
  4. tokenizer_config.json +53 -0
  5. vocab.txt +33 -0
README.md ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - biology
7
+ - protein
8
+ - esm2
9
+ - plant
10
+ - viridiplantae
11
+ - masked-language-modeling
12
+ - domain-adaptation
13
+ base_model: facebook/esm2_t6_8M_UR50D
14
+ datasets:
15
+ - uniprot-trembl-viridiplantae
16
+ pipeline_tag: fill-mask
17
+ ---
18
+
19
+ # PlantPLM-8M
20
+
21
+ **ESM-2 8M parameter model continued-pretrained on 19.9 million Viridiplantae (plant) protein sequences.**
22
+
23
+ This is a domain-adapted version of [`facebook/esm2_t6_8M_UR50D`](https://huggingface.co/facebook/esm2_t6_8M_UR50D), fine-tuned on a curated subset of UniProt TrEMBL containing only plant-kingdom proteins. The adaptation improves representation quality for plant-specific protein tasks compared to the general-purpose ESM-2 baseline.
24
+
25
+ Part of the **[Plant-Protein-BERT collection](https://huggingface.co/collections/dipayan26/plant-protein-bert)** — ESM-2 models at 8M, 35M, 150M, and 650M parameters, each adapted on the same plant protein corpus.
26
+
27
+ ---
28
+
29
+ ## Model Description
30
+
31
+ | Property | Value |
32
+ |---|---|
33
+ | Base model | `facebook/esm2_t6_8M_UR50D` |
34
+ | Architecture | ESM-2 · 6 layers · hidden=320 · heads=20 · FFN=1280 |
35
+ | Position embeddings | Rotary (RoPE) |
36
+ | Vocabulary | 33 tokens (20 standard + rare amino acids + special tokens) |
37
+ | Parameters | 7.5M (full-parameter continued pretraining) |
38
+ | Training objective | Masked Language Modeling (MLM, 15% masking) |
39
+
40
+ ---
41
+
42
+ ## Training Data
43
+
44
+ | Property | Value |
45
+ |---|---|
46
+ | Source | UniProt TrEMBL — Viridiplantae (plant kingdom) subset |
47
+ | Taxonomy filter | Viridiplantae only (NCBI TaxID tree walk — removes oomycetes and dinoflagellates misclassified as plants in UniProt's keyword-based plant subset) |
48
+ | Sequences | **19,938,415** protein sequences |
49
+ | Avg sequence length | 339 AA · median 291 AA |
50
+ | Estimated total tokens | **~6.76 billion** amino acid tokens |
51
+ | Tokens seen during training | **800 million** (≈ 0.12 passes over the full dataset) |
52
+
53
+ ---
54
+
55
+ ## Training Details
56
+
57
+ | Hyperparameter | Value |
58
+ |---|---|
59
+ | Token budget | 800M tokens (training stopped at budget, not epoch end) |
60
+ | Steps completed | 41,036 of 55,000 max |
61
+ | Batch size | 64 sequences |
62
+ | Max sequence length | 514 tokens (512 AA + `<cls>` + `<eos>`) |
63
+ | Optimizer | AdamW · β=(0.9, 0.98) · ε=1e-8 · weight_decay=0.01 |
64
+ | Learning rate | 2e-5 (20× lower than ESM-2 from-scratch to prevent catastrophic forgetting) |
65
+ | LR schedule | Linear warmup (500 steps) → linear decay |
66
+ | Gradient clipping | 1.0 |
67
+ | Precision | 16-bit mixed (bf16 activations, fp32 optimizer states) |
68
+ | Hardware | NVIDIA RTX 3060 12 GB |
69
+ | Training time | ~14.9 hours |
70
+
71
+ **Final metrics (validation set, 5% holdout):**
72
+
73
+ | Metric | Value |
74
+ |---|---|
75
+ | `val/mlm_loss` | 2.292 |
76
+ | `val/perplexity` | 9.92 |
77
+ | `val/masked_token_acc` | 31.0% |
78
+
79
+ ---
80
+
81
+ ## Downstream Task Performance (Linear Probe)
82
+
83
+ Frozen [CLS] embeddings evaluated on 2,000 reviewed *Arabidopsis thaliana* proteins from UniProt SwissProt using a logistic regression linear probe. Compared against the vanilla `facebook/esm2_t6_8M_UR50D` baseline.
84
+
85
+ | Task | Vanilla ESM-2 8M | PlantPLM-8M | Δ |
86
+ |---|---|---|---|
87
+ | Subcellular localization (9-class accuracy) | 91.6% | **93.7%** | +2.1% |
88
+ | GO-term prediction (macro-AUROC, top-50 terms) | 94.7% | **95.0%** | +0.3% |
89
+
90
+ ---
91
+
92
+ ## Usage
93
+
94
+ ```python
95
+ from transformers import EsmForMaskedLM, EsmTokenizer
96
+ import torch
97
+
98
+ model = EsmForMaskedLM.from_pretrained("dipayan26/PlantPLM-8M")
99
+ tokenizer = EsmTokenizer.from_pretrained("dipayan26/PlantPLM-8M")
100
+
101
+ # --- Masked token prediction ---
102
+ sequence = "MSPQTETKASVGFKAGVKDYKLTYYTPEYETK"
103
+ inputs = tokenizer(sequence, return_tensors="pt")
104
+
105
+ # mask one position
106
+ inputs["input_ids"][0, 5] = tokenizer.mask_token_id
107
+
108
+ with torch.no_grad():
109
+ logits = model(**inputs).logits
110
+
111
+ masked_pos = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0, 1]
112
+ top5 = logits[0, masked_pos].topk(5)
113
+ print(tokenizer.convert_ids_to_tokens(top5.indices.tolist()))
114
+
115
+ # --- Sequence embedding ([CLS] token) ---
116
+ inputs = tokenizer(sequence, return_tensors="pt")
117
+ with torch.no_grad():
118
+ hidden = model.esm(**inputs).last_hidden_state
119
+ cls_embedding = hidden[0, 0, :] # shape: [320]
120
+ print("Embedding shape:", cls_embedding.shape)
121
+ ```
122
+
123
+ ---
124
+
125
+ ## Intended Use
126
+
127
+ - **Plant protein function prediction** — GO term annotation, subcellular localization, signal peptide detection
128
+ - **Plant-specific protein embeddings** — clustering, retrieval, similarity search
129
+ - **Transfer learning starting point** — fine-tune on small labeled plant protein datasets
130
+ - **Baseline comparison** — benchmark against larger PlantPLM-35M / 150M / 650M variants
131
+
132
+ ## Out-of-scope Use
133
+
134
+ - Non-plant organisms — the model has been shifted toward Viridiplantae statistics; use the original `facebook/esm2_t6_8M_UR50D` for general protein tasks
135
+ - Structural prediction — not trained for structure; use ESMFold for that
136
+
137
+ ---
138
+
139
+ ## Limitations
140
+
141
+ - Trained for only 0.12 passes over the plant corpus (800M / 6.76B tokens) — larger models in this collection see more of the data
142
+ - 8M capacity limits representation richness; the 35M and 150M variants are recommended for downstream fine-tuning
143
+ - Taxonomy filter removes ~15.7% contamination from the UniProt plant keyword subset, but a small fraction of misclassified non-plant sequences may remain in TrEMBL
144
+
145
+ ---
146
+
147
+ ## Citation
148
+
149
+ If you use this model, please cite:
150
+
151
+ ```bibtex
152
+ @misc{sarkar2026plantplm,
153
+ author = {Sarkar, Dipayan},
154
+ title = {PlantPLM: Domain-Adaptive Pretraining of ESM-2 on Viridiplantae Proteins},
155
+ year = {2026},
156
+ publisher = {Hugging Face},
157
+ howpublished = {\url{https://huggingface.co/dipayan26/PlantPLM-8M}},
158
+ }
159
+ ```
160
+
161
+ ---
162
+
163
+ ## Training Code
164
+
165
+ [github.com/Dipayan26/Plant-Protein-BERT](https://github.com/Dipayan26/Plant-Protein-BERT)
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_cross_attention": false,
3
+ "architectures": [
4
+ "EsmForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.0,
7
+ "classifier_dropout": null,
8
+ "dtype": "float32",
9
+ "emb_layer_norm_before": false,
10
+ "esmfold_config": null,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.0,
13
+ "hidden_size": 320,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 1280,
16
+ "is_decoder": false,
17
+ "is_folding_model": false,
18
+ "layer_norm_eps": 1e-05,
19
+ "mask_token_id": 32,
20
+ "max_position_embeddings": 1026,
21
+ "model_type": "esm",
22
+ "num_attention_heads": 20,
23
+ "num_hidden_layers": 6,
24
+ "pad_token_id": 1,
25
+ "position_embedding_type": "rotary",
26
+ "tie_word_embeddings": true,
27
+ "token_dropout": true,
28
+ "transformers_version": "5.1.0",
29
+ "use_cache": true,
30
+ "vocab_list": null,
31
+ "vocab_size": 33
32
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:035f4112003d7c47a40fe93df7b61a3a7dc8e103be122d7588328f363b2dbd0c
3
+ size 30062528
tokenizer_config.json ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<cls>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "<eos>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "32": {
36
+ "content": "<mask>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "backend": "custom",
45
+ "cls_token": "<cls>",
46
+ "eos_token": "<eos>",
47
+ "is_local": false,
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 1000000000000000019884624838656,
50
+ "pad_token": "<pad>",
51
+ "tokenizer_class": "EsmTokenizer",
52
+ "unk_token": "<unk>"
53
+ }
vocab.txt ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <cls>
2
+ <pad>
3
+ <eos>
4
+ <unk>
5
+ L
6
+ A
7
+ G
8
+ V
9
+ S
10
+ E
11
+ R
12
+ T
13
+ I
14
+ D
15
+ P
16
+ K
17
+ Q
18
+ N
19
+ F
20
+ Y
21
+ M
22
+ H
23
+ W
24
+ C
25
+ X
26
+ B
27
+ U
28
+ Z
29
+ O
30
+ .
31
+ -
32
+ <null_1>
33
+ <mask>