gbyuvd commited on
Commit
d4d41d0
·
verified ·
1 Parent(s): 5f0a90f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -5
README.md CHANGED
@@ -5,7 +5,7 @@ tags:
5
  - chemistry
6
  - molecular-similarity
7
  - cheminformatics
8
- - unsupervised-learning
9
  - smiles
10
  - feature-extraction
11
  pipeline_tag: sentence-similarity
@@ -63,7 +63,6 @@ SentenceTransformer(
63
  ```
64
 
65
  > Note: The model was not initialized from a language model, it is trained from scratch on SMILES using only the Barlow Twins objective.
66
-
67
  ---
68
 
69
  ## Usage
@@ -73,7 +72,7 @@ SentenceTransformer(
73
  pip install -U sentence-transformers rdkit-pypi
74
  ```
75
 
76
- ### Encoding Molecules
77
  ```python
78
  from sentence_transformers import SentenceTransformer
79
 
@@ -101,9 +100,20 @@ print(similarities)
101
 
102
  High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
103
 
 
104
  > Tip: For large-scale similarity search, integrate embeddings with Meta's FAISS.
105
 
106
- ---
 
 
 
 
 
 
 
 
 
 
107
 
108
  ## Comparison to Traditional Fingerprints
109
  ### Overview
@@ -120,11 +130,12 @@ Preliminary clustering evaluation vs. ECFP4 on 64 molecules with 4 classes:
120
 
121
  ![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/SNH7u0tegdzmYGFbJ9F-0.png)
122
 
 
123
  ARI (Embeddings) : 0.084
124
  ARI (ECFP4) : 0.024
125
  Silhouette (Embeddings) : 0.398
126
  Silhouette (ECFP4) : 0.025
127
- Top-5 retrieval accuracy of embeddings : 0.341
128
 
129
  ---
130
 
 
5
  - chemistry
6
  - molecular-similarity
7
  - cheminformatics
8
+ - ssl
9
  - smiles
10
  - feature-extraction
11
  pipeline_tag: sentence-similarity
 
63
  ```
64
 
65
  > Note: The model was not initialized from a language model, it is trained from scratch on SMILES using only the Barlow Twins objective.
 
66
  ---
67
 
68
  ## Usage
 
72
  pip install -U sentence-transformers rdkit-pypi
73
  ```
74
 
75
+ ### Direct Usage (Sentence Transformers)
76
  ```python
77
  from sentence_transformers import SentenceTransformer
78
 
 
100
 
101
  High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
102
 
103
+ ### Testing Similarity Search
104
  > Tip: For large-scale similarity search, integrate embeddings with Meta's FAISS.
105
 
106
+ Cytisine as query, on 24K embedded index:
107
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/kZciikiDjFOCXJrCzb1Lh.png)
108
+
109
+ ```
110
+ Rank 1: SMILES = O=C1OC2C(O)CC1C1C2N(Cc2ccc(F)cc2)C(=S)N1CC1CCCCC1, Cosine Similarity = 0.9944
111
+ Rank 2: SMILES = CN1C(CCC(=O)N2CCC(O)CC2)CNC(=O)C2C1CCN2Cc1ncc[nH]1, Cosine Similarity = 0.9940
112
+ Rank 3: SMILES = CC1C(=O)OC2C1CCC1(C)Cc3sc(NC(=O)Nc4cccc(F)c4)nc3C(C)C21, Cosine Similarity = 0.9938
113
+ Rank 4: SMILES = Cc1ccc(NC(=O)Nc2nc3c(s2)CC2(C)CCC4C(C)C(=O)OC4C2C3C)cc1, Cosine Similarity = 0.9938
114
+ Rank 5: SMILES = O=C(CC1CC2OC(CNC3Cc4ccccc4C3)C(O)C2O1)N1CCC(F)(F)C1, Cosine Similarity = 0.9929
115
+ ```
116
+
117
 
118
  ## Comparison to Traditional Fingerprints
119
  ### Overview
 
130
 
131
  ![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/SNH7u0tegdzmYGFbJ9F-0.png)
132
 
133
+ ```
134
  ARI (Embeddings) : 0.084
135
  ARI (ECFP4) : 0.024
136
  Silhouette (Embeddings) : 0.398
137
  Silhouette (ECFP4) : 0.025
138
+ ```
139
 
140
  ---
141