Update README.md
Browse files
README.md
CHANGED
|
@@ -5,7 +5,7 @@ tags:
|
|
| 5 |
- chemistry
|
| 6 |
- molecular-similarity
|
| 7 |
- cheminformatics
|
| 8 |
-
-
|
| 9 |
- smiles
|
| 10 |
- feature-extraction
|
| 11 |
pipeline_tag: sentence-similarity
|
|
@@ -63,7 +63,6 @@ SentenceTransformer(
|
|
| 63 |
```
|
| 64 |
|
| 65 |
> Note: The model was not initialized from a language model, it is trained from scratch on SMILES using only the Barlow Twins objective.
|
| 66 |
-
|
| 67 |
---
|
| 68 |
|
| 69 |
## Usage
|
|
@@ -73,7 +72,7 @@ SentenceTransformer(
|
|
| 73 |
pip install -U sentence-transformers rdkit-pypi
|
| 74 |
```
|
| 75 |
|
| 76 |
-
###
|
| 77 |
```python
|
| 78 |
from sentence_transformers import SentenceTransformer
|
| 79 |
|
|
@@ -101,9 +100,20 @@ print(similarities)
|
|
| 101 |
|
| 102 |
High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
|
| 103 |
|
|
|
|
| 104 |
> Tip: For large-scale similarity search, integrate embeddings with Meta's FAISS.
|
| 105 |
|
| 106 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
|
| 108 |
## Comparison to Traditional Fingerprints
|
| 109 |
### Overview
|
|
@@ -120,11 +130,12 @@ Preliminary clustering evaluation vs. ECFP4 on 64 molecules with 4 classes:
|
|
| 120 |
|
| 121 |

|
| 122 |
|
|
|
|
| 123 |
ARI (Embeddings) : 0.084
|
| 124 |
ARI (ECFP4) : 0.024
|
| 125 |
Silhouette (Embeddings) : 0.398
|
| 126 |
Silhouette (ECFP4) : 0.025
|
| 127 |
-
|
| 128 |
|
| 129 |
---
|
| 130 |
|
|
|
|
| 5 |
- chemistry
|
| 6 |
- molecular-similarity
|
| 7 |
- cheminformatics
|
| 8 |
+
- ssl
|
| 9 |
- smiles
|
| 10 |
- feature-extraction
|
| 11 |
pipeline_tag: sentence-similarity
|
|
|
|
| 63 |
```
|
| 64 |
|
| 65 |
> Note: The model was not initialized from a language model, it is trained from scratch on SMILES using only the Barlow Twins objective.
|
|
|
|
| 66 |
---
|
| 67 |
|
| 68 |
## Usage
|
|
|
|
| 72 |
pip install -U sentence-transformers rdkit-pypi
|
| 73 |
```
|
| 74 |
|
| 75 |
+
### Direct Usage (Sentence Transformers)
|
| 76 |
```python
|
| 77 |
from sentence_transformers import SentenceTransformer
|
| 78 |
|
|
|
|
| 100 |
|
| 101 |
High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
|
| 102 |
|
| 103 |
+
### Testing Similarity Search
|
| 104 |
> Tip: For large-scale similarity search, integrate embeddings with Meta's FAISS.
|
| 105 |
|
| 106 |
+
Cytisine as query, on 24K embedded index:
|
| 107 |
+

|
| 108 |
+
|
| 109 |
+
```
|
| 110 |
+
Rank 1: SMILES = O=C1OC2C(O)CC1C1C2N(Cc2ccc(F)cc2)C(=S)N1CC1CCCCC1, Cosine Similarity = 0.9944
|
| 111 |
+
Rank 2: SMILES = CN1C(CCC(=O)N2CCC(O)CC2)CNC(=O)C2C1CCN2Cc1ncc[nH]1, Cosine Similarity = 0.9940
|
| 112 |
+
Rank 3: SMILES = CC1C(=O)OC2C1CCC1(C)Cc3sc(NC(=O)Nc4cccc(F)c4)nc3C(C)C21, Cosine Similarity = 0.9938
|
| 113 |
+
Rank 4: SMILES = Cc1ccc(NC(=O)Nc2nc3c(s2)CC2(C)CCC4C(C)C(=O)OC4C2C3C)cc1, Cosine Similarity = 0.9938
|
| 114 |
+
Rank 5: SMILES = O=C(CC1CC2OC(CNC3Cc4ccccc4C3)C(O)C2O1)N1CCC(F)(F)C1, Cosine Similarity = 0.9929
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
|
| 118 |
## Comparison to Traditional Fingerprints
|
| 119 |
### Overview
|
|
|
|
| 130 |
|
| 131 |

|
| 132 |
|
| 133 |
+
```
|
| 134 |
ARI (Embeddings) : 0.084
|
| 135 |
ARI (ECFP4) : 0.024
|
| 136 |
Silhouette (Embeddings) : 0.398
|
| 137 |
Silhouette (ECFP4) : 0.025
|
| 138 |
+
```
|
| 139 |
|
| 140 |
---
|
| 141 |
|