Update README.md
Browse files
README.md
CHANGED
|
@@ -27,7 +27,7 @@ The Barlow Twins objective explicitly minimizes redundancy between embedding dim
|
|
| 27 |
|
| 28 |
> Note: This is an experimental prototype.
|
| 29 |
> Feel free to experiment with and edit the training script as you wish!
|
| 30 |
-
> Correcting my
|
| 31 |
---
|
| 32 |
|
| 33 |
## Model Details
|
|
@@ -77,27 +77,26 @@ pip install -U sentence-transformers rdkit-pypi
|
|
| 77 |
```python
|
| 78 |
from sentence_transformers import SentenceTransformer
|
| 79 |
|
| 80 |
-
#
|
| 81 |
model = SentenceTransformer("gbyuvd/miniChembed-prototype")
|
| 82 |
-
|
| 83 |
-
# Encode SMILES
|
| 84 |
sentences = [
|
| 85 |
-
'O=C1/C=C\\C=C2/N1C[C@@H]3CNC[C@H]2C3',
|
| 86 |
-
"n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4",
|
| 87 |
-
"c1ncccc1[C@@H]2CCCN2C",
|
| 88 |
-
'Nc1nc2cncc-2co1',
|
| 89 |
]
|
| 90 |
-
|
| 91 |
embeddings = model.encode(sentences)
|
| 92 |
-
print(embeddings.shape)
|
|
|
|
| 93 |
|
| 94 |
-
#
|
| 95 |
similarities = model.similarity(embeddings, embeddings)
|
| 96 |
print(similarities)
|
| 97 |
-
# tensor([[1.0000,
|
| 98 |
-
# [0.
|
| 99 |
-
# [0.
|
| 100 |
-
# [0.
|
| 101 |
```
|
| 102 |
|
| 103 |
High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
|
|
@@ -107,7 +106,7 @@ High cosine similarity suggests structural or topological relatedness learned pu
|
|
| 107 |
---
|
| 108 |
|
| 109 |
## Comparison to Traditional Fingerprints
|
| 110 |
-
|
| 111 |
| Feature | ECFP4 / MACCS | miniChembed-prototype |
|
| 112 |
|--------|----------------|------------------------|
|
| 113 |
| **Representation** | Hand-crafted binary fingerprint | Learned dense embedding |
|
|
@@ -115,6 +114,18 @@ High cosine similarity suggests structural or topological relatedness learned pu
|
|
| 115 |
| **Global semantics** | Captures only local substructures | Learns global invariances via augmentation |
|
| 116 |
| **Redundancy control** | Not applicable | Explicitly minimized (Barlow objective) |
|
| 117 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
---
|
| 119 |
|
| 120 |
## Training Summary
|
|
@@ -123,8 +134,13 @@ High cosine similarity suggests structural or topological relatedness learned pu
|
|
| 123 |
- **Key metric**: Barlow Health Score = `mean(same-molecule cosine) – mean(cross-molecule cosine)`
|
| 124 |
→ Higher = better separation between intra- and inter-molecular similarity.
|
| 125 |
- **Validation**: Evaluated every 25% of training; best checkpoint selected by health score.
|
| 126 |
-
- **Final health**: , indicating strong disentanglement.
|
| 127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
---
|
| 129 |
|
| 130 |
## Limitations
|
|
@@ -138,15 +154,14 @@ High cosine similarity suggests structural or topological relatedness learned pu
|
|
| 138 |
|
| 139 |
## Reproducibility
|
| 140 |
|
| 141 |
-
This model was trained using a custom script based on
|
| 142 |
|
| 143 |
-
-
|
| 144 |
-
-
|
| 145 |
-
-
|
| 146 |
-
-
|
| 147 |
-
-
|
| 148 |
-
-
|
| 149 |
-
- **Augmentation**: RDKit-based stochastic SMILES
|
| 150 |
|
| 151 |
Training code, config, and evaluation are available on this repo under `train_barlow.py` and `config.yaml`
|
| 152 |
|
|
@@ -173,6 +188,7 @@ Do note that the method used here doesn't use a target network, rather, using RD
|
|
| 173 |
If you use this model, please cite:
|
| 174 |
|
| 175 |
```bibtex
|
|
|
|
| 176 |
@inproceedings{reimers-2019-sentence-bert,
|
| 177 |
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
|
| 178 |
author = "Reimers, Nils and Gurevych, Iryna",
|
|
@@ -181,6 +197,18 @@ If you use this model, please cite:
|
|
| 181 |
url = "https://arxiv.org/abs/1908.10084"
|
| 182 |
}
|
| 183 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
@article{sorokina2021coconut,
|
| 185 |
title={COCONUT online: Collection of Open Natural Products database},
|
| 186 |
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
|
|
|
|
| 27 |
|
| 28 |
> Note: This is an experimental prototype.
|
| 29 |
> Feel free to experiment with and edit the training script as you wish!
|
| 30 |
+
> Correcting my mistakes, tweaking augmentations, loss weights, optimizer settings, or network architecture could lead to even better representations.
|
| 31 |
---
|
| 32 |
|
| 33 |
## Model Details
|
|
|
|
| 77 |
```python
|
| 78 |
from sentence_transformers import SentenceTransformer
|
| 79 |
|
| 80 |
+
# Download from the 🤗 Hub
|
| 81 |
model = SentenceTransformer("gbyuvd/miniChembed-prototype")
|
| 82 |
+
# Run inference
|
|
|
|
| 83 |
sentences = [
|
| 84 |
+
'O=C1/C=C\\C=C2/N1C[C@@H]3CNC[C@H]2C3', # Cytisine
|
| 85 |
+
"n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4", # Varenicline
|
| 86 |
+
"c1ncccc1[C@@H]2CCCN2C", # Nicotine
|
| 87 |
+
'Nc1nc2cncc-2co1', # CID: 162789184
|
| 88 |
]
|
|
|
|
| 89 |
embeddings = model.encode(sentences)
|
| 90 |
+
print(embeddings.shape)
|
| 91 |
+
# (4, 320)
|
| 92 |
|
| 93 |
+
# Get the similarity scores for the embeddings
|
| 94 |
similarities = model.similarity(embeddings, embeddings)
|
| 95 |
print(similarities)
|
| 96 |
+
# tensor([[ 1.0000, 0.2279, -0.1979, -0.3754],
|
| 97 |
+
# [ 0.2279, 1.0000, 0.7371, 0.6745],
|
| 98 |
+
# [-0.1979, 0.7371, 1.0000, 0.9803],
|
| 99 |
+
# [-0.3754, 0.6745, 0.9803, 1.0000]])
|
| 100 |
```
|
| 101 |
|
| 102 |
High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
|
|
|
|
| 106 |
---
|
| 107 |
|
| 108 |
## Comparison to Traditional Fingerprints
|
| 109 |
+
### Overview
|
| 110 |
| Feature | ECFP4 / MACCS | miniChembed-prototype |
|
| 111 |
|--------|----------------|------------------------|
|
| 112 |
| **Representation** | Hand-crafted binary fingerprint | Learned dense embedding |
|
|
|
|
| 114 |
| **Global semantics** | Captures only local substructures | Learns global invariances via augmentation |
|
| 115 |
| **Redundancy control** | Not applicable | Explicitly minimized (Barlow objective) |
|
| 116 |
|
| 117 |
+
### Clustering
|
| 118 |
+
|
| 119 |
+
Preliminary clustering evaluation vs. ECFP4 on 64 molecules with 4 classes:
|
| 120 |
+
|
| 121 |
+

|
| 122 |
+
|
| 123 |
+
ARI (Embeddings) : 0.084
|
| 124 |
+
ARI (ECFP4) : 0.024
|
| 125 |
+
Silhouette (Embeddings) : 0.398
|
| 126 |
+
Silhouette (ECFP4) : 0.025
|
| 127 |
+
Top-5 retrieval accuracy of embeddings : 0.341
|
| 128 |
+
|
| 129 |
---
|
| 130 |
|
| 131 |
## Training Summary
|
|
|
|
| 134 |
- **Key metric**: Barlow Health Score = `mean(same-molecule cosine) – mean(cross-molecule cosine)`
|
| 135 |
→ Higher = better separation between intra- and inter-molecular similarity.
|
| 136 |
- **Validation**: Evaluated every 25% of training; best checkpoint selected by health score.
|
| 137 |
+
- **Final health**: 0.891 at step 1885, indicating strong disentanglement.
|
| 138 |
|
| 139 |
+
```
|
| 140 |
+
Step 1885 | Alignment=0.017 | Uniformity=-1.338
|
| 141 |
+
Same-mol cos: 0.983±0.032 | Pairwise: 0.093±0.518
|
| 142 |
+
Barlow Health: 0.891
|
| 143 |
+
```
|
| 144 |
---
|
| 145 |
|
| 146 |
## Limitations
|
|
|
|
| 154 |
|
| 155 |
## Reproducibility
|
| 156 |
|
| 157 |
+
This model was trained using a custom script based on Sentence Transformers v5.1.0, with the following environment:
|
| 158 |
|
| 159 |
+
- Python: 3.13.0
|
| 160 |
+
- Transformers: 4.56.2
|
| 161 |
+
- PyTorch: 2.6.0+cu126
|
| 162 |
+
- Accelerate: 1.10.1
|
| 163 |
+
- Datasets: 4.0.0
|
| 164 |
+
- Tokenizers: 0.22.0
|
|
|
|
| 165 |
|
| 166 |
Training code, config, and evaluation are available on this repo under `train_barlow.py` and `config.yaml`
|
| 167 |
|
|
|
|
| 188 |
If you use this model, please cite:
|
| 189 |
|
| 190 |
```bibtex
|
| 191 |
+
SBERT:
|
| 192 |
@inproceedings{reimers-2019-sentence-bert,
|
| 193 |
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
|
| 194 |
author = "Reimers, Nils and Gurevych, Iryna",
|
|
|
|
| 197 |
url = "https://arxiv.org/abs/1908.10084"
|
| 198 |
}
|
| 199 |
|
| 200 |
+
Tokenizer:
|
| 201 |
+
@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
|
| 202 |
+
title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction},
|
| 203 |
+
author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
|
| 204 |
+
year={2020},
|
| 205 |
+
eprint={2010.09885},
|
| 206 |
+
archivePrefix={arXiv},
|
| 207 |
+
primaryClass={cs.LG},
|
| 208 |
+
url={https://arxiv.org/abs/2010.09885},
|
| 209 |
+
}
|
| 210 |
+
|
| 211 |
+
Data:
|
| 212 |
@article{sorokina2021coconut,
|
| 213 |
title={COCONUT online: Collection of Open Natural Products database},
|
| 214 |
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
|