gbyuvd commited on
Commit
5f0a90f
·
verified ·
1 Parent(s): 6cbfeb0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -25
README.md CHANGED
@@ -27,7 +27,7 @@ The Barlow Twins objective explicitly minimizes redundancy between embedding dim
27
 
28
  > Note: This is an experimental prototype.
29
  > Feel free to experiment with and edit the training script as you wish!
30
- > Correcting my mistake(s), tweaking augmentations, loss weights, optimizer settings, or network architecture could lead to even better representations.
31
  ---
32
 
33
  ## Model Details
@@ -77,27 +77,26 @@ pip install -U sentence-transformers rdkit-pypi
77
  ```python
78
  from sentence_transformers import SentenceTransformer
79
 
80
- # Load from Hugging Face Hub
81
  model = SentenceTransformer("gbyuvd/miniChembed-prototype")
82
-
83
- # Encode SMILES
84
  sentences = [
85
- 'O=C1/C=C\\C=C2/N1C[C@@H]3CNC[C@H]2C3', # Cytisine
86
- "n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4", # Varenicline
87
- "c1ncccc1[C@@H]2CCCN2C", # Nicotine
88
- 'Nc1nc2cncc-2co1', # CID: 162789184
89
  ]
90
-
91
  embeddings = model.encode(sentences)
92
- print(embeddings.shape) # (4, 320)
 
93
 
94
- # Compute pairwise cosine similarities
95
  similarities = model.similarity(embeddings, embeddings)
96
  print(similarities)
97
- # tensor([[1.0000, 0.4342, 0.5141, 0.2582],
98
- # [0.4342, 1.0000, 0.8779, 0.8886],
99
- # [0.5141, 0.8779, 1.0000, 0.9551],
100
- # [0.2582, 0.8886, 0.9551, 1.0000]])
101
  ```
102
 
103
  High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
@@ -107,7 +106,7 @@ High cosine similarity suggests structural or topological relatedness learned pu
107
  ---
108
 
109
  ## Comparison to Traditional Fingerprints
110
-
111
  | Feature | ECFP4 / MACCS | miniChembed-prototype |
112
  |--------|----------------|------------------------|
113
  | **Representation** | Hand-crafted binary fingerprint | Learned dense embedding |
@@ -115,6 +114,18 @@ High cosine similarity suggests structural or topological relatedness learned pu
115
  | **Global semantics** | Captures only local substructures | Learns global invariances via augmentation |
116
  | **Redundancy control** | Not applicable | Explicitly minimized (Barlow objective) |
117
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  ---
119
 
120
  ## Training Summary
@@ -123,8 +134,13 @@ High cosine similarity suggests structural or topological relatedness learned pu
123
  - **Key metric**: Barlow Health Score = `mean(same-molecule cosine) – mean(cross-molecule cosine)`
124
  → Higher = better separation between intra- and inter-molecular similarity.
125
  - **Validation**: Evaluated every 25% of training; best checkpoint selected by health score.
126
- - **Final health**: , indicating strong disentanglement.
127
 
 
 
 
 
 
128
  ---
129
 
130
  ## Limitations
@@ -138,15 +154,14 @@ High cosine similarity suggests structural or topological relatedness learned pu
138
 
139
  ## Reproducibility
140
 
141
- This model was trained using a custom script based on **Sentence Transformers v5.1.0**, with the following environment:
142
 
143
- - **Python**: 3.13.0
144
- - **sentence-transformers**: 5.1.0
145
- - **PyTorch**: 2.6.0
146
- - **RDKit**: 2023.09.3
147
- - **Optimizer**: Ranger21 (with epoch-aware warmup/warmdown)
148
- - **Loss**: Custom `BarlowTwinsLoss` (λ = 5.0)
149
- - **Augmentation**: RDKit-based stochastic SMILES
150
 
151
  Training code, config, and evaluation are available on this repo under `train_barlow.py` and `config.yaml`
152
 
@@ -173,6 +188,7 @@ Do note that the method used here doesn't use a target network, rather, using RD
173
  If you use this model, please cite:
174
 
175
  ```bibtex
 
176
  @inproceedings{reimers-2019-sentence-bert,
177
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
178
  author = "Reimers, Nils and Gurevych, Iryna",
@@ -181,6 +197,18 @@ If you use this model, please cite:
181
  url = "https://arxiv.org/abs/1908.10084"
182
  }
183
 
 
 
 
 
 
 
 
 
 
 
 
 
184
  @article{sorokina2021coconut,
185
  title={COCONUT online: Collection of Open Natural Products database},
186
  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
 
27
 
28
  > Note: This is an experimental prototype.
29
  > Feel free to experiment with and edit the training script as you wish!
30
+ > Correcting my mistakes, tweaking augmentations, loss weights, optimizer settings, or network architecture could lead to even better representations.
31
  ---
32
 
33
  ## Model Details
 
77
  ```python
78
  from sentence_transformers import SentenceTransformer
79
 
80
+ # Download from the 🤗 Hub
81
  model = SentenceTransformer("gbyuvd/miniChembed-prototype")
82
+ # Run inference
 
83
  sentences = [
84
+ 'O=C1/C=C\\C=C2/N1C[C@@H]3CNC[C@H]2C3', # Cytisine
85
+ "n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4", # Varenicline
86
+ "c1ncccc1[C@@H]2CCCN2C", # Nicotine
87
+ 'Nc1nc2cncc-2co1', # CID: 162789184
88
  ]
 
89
  embeddings = model.encode(sentences)
90
+ print(embeddings.shape)
91
+ # (4, 320)
92
 
93
+ # Get the similarity scores for the embeddings
94
  similarities = model.similarity(embeddings, embeddings)
95
  print(similarities)
96
+ # tensor([[ 1.0000, 0.2279, -0.1979, -0.3754],
97
+ # [ 0.2279, 1.0000, 0.7371, 0.6745],
98
+ # [-0.1979, 0.7371, 1.0000, 0.9803],
99
+ # [-0.3754, 0.6745, 0.9803, 1.0000]])
100
  ```
101
 
102
  High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
 
106
  ---
107
 
108
  ## Comparison to Traditional Fingerprints
109
+ ### Overview
110
  | Feature | ECFP4 / MACCS | miniChembed-prototype |
111
  |--------|----------------|------------------------|
112
  | **Representation** | Hand-crafted binary fingerprint | Learned dense embedding |
 
114
  | **Global semantics** | Captures only local substructures | Learns global invariances via augmentation |
115
  | **Redundancy control** | Not applicable | Explicitly minimized (Barlow objective) |
116
 
117
+ ### Clustering
118
+
119
+ Preliminary clustering evaluation vs. ECFP4 on 64 molecules with 4 classes:
120
+
121
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/SNH7u0tegdzmYGFbJ9F-0.png)
122
+
123
+ ARI (Embeddings) : 0.084
124
+ ARI (ECFP4) : 0.024
125
+ Silhouette (Embeddings) : 0.398
126
+ Silhouette (ECFP4) : 0.025
127
+ Top-5 retrieval accuracy of embeddings : 0.341
128
+
129
  ---
130
 
131
  ## Training Summary
 
134
  - **Key metric**: Barlow Health Score = `mean(same-molecule cosine) – mean(cross-molecule cosine)`
135
  → Higher = better separation between intra- and inter-molecular similarity.
136
  - **Validation**: Evaluated every 25% of training; best checkpoint selected by health score.
137
+ - **Final health**: 0.891 at step 1885, indicating strong disentanglement.
138
 
139
+ ```
140
+ Step 1885 | Alignment=0.017 | Uniformity=-1.338
141
+ Same-mol cos: 0.983±0.032 | Pairwise: 0.093±0.518
142
+ Barlow Health: 0.891
143
+ ```
144
  ---
145
 
146
  ## Limitations
 
154
 
155
  ## Reproducibility
156
 
157
+ This model was trained using a custom script based on Sentence Transformers v5.1.0, with the following environment:
158
 
159
+ - Python: 3.13.0
160
+ - Transformers: 4.56.2
161
+ - PyTorch: 2.6.0+cu126
162
+ - Accelerate: 1.10.1
163
+ - Datasets: 4.0.0
164
+ - Tokenizers: 0.22.0
 
165
 
166
  Training code, config, and evaluation are available on this repo under `train_barlow.py` and `config.yaml`
167
 
 
188
  If you use this model, please cite:
189
 
190
  ```bibtex
191
+ SBERT:
192
  @inproceedings{reimers-2019-sentence-bert,
193
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
194
  author = "Reimers, Nils and Gurevych, Iryna",
 
197
  url = "https://arxiv.org/abs/1908.10084"
198
  }
199
 
200
+ Tokenizer:
201
+ @misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
202
+ title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction},
203
+ author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
204
+ year={2020},
205
+ eprint={2010.09885},
206
+ archivePrefix={arXiv},
207
+ primaryClass={cs.LG},
208
+ url={https://arxiv.org/abs/2010.09885},
209
+ }
210
+
211
+ Data:
212
  @article{sorokina2021coconut,
213
  title={COCONUT online: Collection of Open Natural Products database},
214
  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},