gbyuvd
/

miniChembed-prototype

@@ -27,7 +27,7 @@ The Barlow Twins objective explicitly minimizes redundancy between embedding dim
 > Note: This is an experimental prototype.
 > Feel free to experiment with and edit the training script as you wish!
-> Correcting my mistake(s), tweaking augmentations, loss weights, optimizer settings, or network architecture could lead to even better representations.
 ---
 ## Model Details
@@ -77,27 +77,26 @@ pip install -U sentence-transformers rdkit-pypi
 ```python
 from sentence_transformers import SentenceTransformer
-# Load from Hugging Face Hub
 model = SentenceTransformer("gbyuvd/miniChembed-prototype")
-# Encode SMILES
 sentences = [
-    'O=C1/C=C\\C=C2/N1C[C@@H]3CNC[C@H]2C3',  # Cytisine
-    "n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4",   # Varenicline
-    "c1ncccc1[C@@H]2CCCN2C",                 # Nicotine
-    'Nc1nc2cncc-2co1',                       # CID: 162789184
 ]
 embeddings = model.encode(sentences)
-print(embeddings.shape)  # (4, 320)
-# Compute pairwise cosine similarities
 similarities = model.similarity(embeddings, embeddings)
 print(similarities)
-# tensor([[1.0000, 0.4342, 0.5141, 0.2582],
-#         [0.4342, 1.0000, 0.8779, 0.8886],
-#         [0.5141, 0.8779, 1.0000, 0.9551],
-#         [0.2582, 0.8886, 0.9551, 1.0000]])
 ```
 High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
@@ -107,7 +106,7 @@ High cosine similarity suggests structural or topological relatedness learned pu
 ---
 ## Comparison to Traditional Fingerprints
 | Feature | ECFP4 / MACCS | miniChembed-prototype |
 |--------|----------------|------------------------|
 | **Representation** | Hand-crafted binary fingerprint | Learned dense embedding |
@@ -115,6 +114,18 @@ High cosine similarity suggests structural or topological relatedness learned pu
 | **Global semantics** | Captures only local substructures | Learns global invariances via augmentation |
 | **Redundancy control** | Not applicable | Explicitly minimized (Barlow objective) |
 ---
 ## Training Summary
@@ -123,8 +134,13 @@ High cosine similarity suggests structural or topological relatedness learned pu
 - **Key metric**: Barlow Health Score = `mean(same-molecule cosine) – mean(cross-molecule cosine)`
   → Higher = better separation between intra- and inter-molecular similarity.
 - **Validation**: Evaluated every 25% of training; best checkpoint selected by health score.
-- **Final health**: , indicating strong disentanglement.
 ---
 ## Limitations
@@ -138,15 +154,14 @@ High cosine similarity suggests structural or topological relatedness learned pu
 ## Reproducibility
-This model was trained using a custom script based on **Sentence Transformers v5.1.0**, with the following environment:
-- **Python**: 3.13.0
-- **sentence-transformers**: 5.1.0
-- **PyTorch**: 2.6.0
-- **RDKit**: 2023.09.3
-- **Optimizer**: Ranger21 (with epoch-aware warmup/warmdown)
-- **Loss**: Custom `BarlowTwinsLoss` (λ = 5.0)
-- **Augmentation**: RDKit-based stochastic SMILES
 Training code, config, and evaluation are available on this repo under `train_barlow.py` and `config.yaml`
@@ -173,6 +188,7 @@ Do note that the method used here doesn't use a target network, rather, using RD
 If you use this model, please cite:
 ```bibtex
 @inproceedings{reimers-2019-sentence-bert,
   title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
   author = "Reimers, Nils and Gurevych, Iryna",
@@ -181,6 +197,18 @@ If you use this model, please cite:
   url = "https://arxiv.org/abs/1908.10084"
 }
 @article{sorokina2021coconut,
   title={COCONUT online: Collection of Open Natural Products database},
   author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},

 > Note: This is an experimental prototype.
 > Feel free to experiment with and edit the training script as you wish!
+> Correcting my mistakes, tweaking augmentations, loss weights, optimizer settings, or network architecture could lead to even better representations.
 ---
 ## Model Details
 ```python
 from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
 model = SentenceTransformer("gbyuvd/miniChembed-prototype")
+# Run inference
 sentences = [
+    'O=C1/C=C\\C=C2/N1C[C@@H]3CNC[C@H]2C3', # Cytisine
+    "n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4",  # Varenicline
+    "c1ncccc1[C@@H]2CCCN2C",                # Nicotine
+    'Nc1nc2cncc-2co1',                      # CID: 162789184
 ]
 embeddings = model.encode(sentences)
+print(embeddings.shape)
+# (4, 320)
+# Get the similarity scores for the embeddings
 similarities = model.similarity(embeddings, embeddings)
 print(similarities)
+# tensor([[ 1.0000,  0.2279, -0.1979, -0.3754],
+#         [ 0.2279,  1.0000,  0.7371,  0.6745],
+#         [-0.1979,  0.7371,  1.0000,  0.9803],
+#         [-0.3754,  0.6745,  0.9803,  1.0000]])
 ```
 High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
 ---
 ## Comparison to Traditional Fingerprints
+### Overview
 | Feature | ECFP4 / MACCS | miniChembed-prototype |
 |--------|----------------|------------------------|
 | **Representation** | Hand-crafted binary fingerprint | Learned dense embedding |
 | **Global semantics** | Captures only local substructures | Learns global invariances via augmentation |
 | **Redundancy control** | Not applicable | Explicitly minimized (Barlow objective) |
+### Clustering
+Preliminary clustering evaluation vs. ECFP4 on 64 molecules with 4 classes:
+![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/SNH7u0tegdzmYGFbJ9F-0.png)
+ARI (Embeddings)                       : 0.084
+ARI (ECFP4)                            : 0.024
+Silhouette (Embeddings)                : 0.398
+Silhouette (ECFP4)                     : 0.025
+Top-5 retrieval accuracy of embeddings : 0.341
 ---
 ## Training Summary
 - **Key metric**: Barlow Health Score = `mean(same-molecule cosine) – mean(cross-molecule cosine)`
   → Higher = better separation between intra- and inter-molecular similarity.
 - **Validation**: Evaluated every 25% of training; best checkpoint selected by health score.
+- **Final health**: 0.891 at step 1885, indicating strong disentanglement.
+```
+   Step 1885 | Alignment=0.017 | Uniformity=-1.338
+   Same-mol cos: 0.983±0.032 | Pairwise: 0.093±0.518
+   Barlow Health: 0.891
+```
 ---
 ## Limitations
 ## Reproducibility
+This model was trained using a custom script based on Sentence Transformers v5.1.0, with the following environment:
+- Python: 3.13.0
+- Transformers: 4.56.2
+- PyTorch: 2.6.0+cu126
+- Accelerate: 1.10.1
+- Datasets: 4.0.0
+- Tokenizers: 0.22.0
 Training code, config, and evaluation are available on this repo under `train_barlow.py` and `config.yaml`
 If you use this model, please cite:
 ```bibtex
+SBERT:
 @inproceedings{reimers-2019-sentence-bert,
   title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
   author = "Reimers, Nils and Gurevych, Iryna",
   url = "https://arxiv.org/abs/1908.10084"
 }
+Tokenizer:
+@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
+      title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction},
+      author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
+      year={2020},
+      eprint={2010.09885},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2010.09885},
+}
+Data:
 @article{sorokina2021coconut,
   title={COCONUT online: Collection of Open Natural Products database},
   author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},