jglaser/binding_affinity
Viewer • Updated • 3.67M • 2.08k • 20
Fine-tuned ChemBERTa-77M-MLM for predicting protein-ligand binding affinity (pKd) from SMILES strings. Part of the OmniBiMol platform.
| Metric | Validation | Test |
|---|---|---|
| RMSE | 1.3924 | 1.3672 |
| MAE | 1.1447 | 1.0982 |
| R² | 0.1163 | 0.1671 |
| Pearson r | 0.3653 | 0.4093 |
| Spearman ρ | 0.3584 | 0.3983 |
Note: Scaffold splits are significantly harder than random splits (literature shows 30-50% degradation). These results are on par with published baselines for structure-free binding affinity prediction on diverse datasets.
| Variant | Test RMSE ↓ | Test MAE ↓ | Test R² ↑ | Pearson r ↑ | Spearman ρ ↑ | Train Time |
|---|---|---|---|---|---|---|
| V1: ECFP4 + GBM (baseline) | 1.3431 | 1.0835 | 0.1962 | 0.4439 | 0.4206 | 3 min |
| V2: ChemBERTa-77M (this model) | 1.3672 | 1.0982 | 0.1671 | 0.4093 | 0.3983 | 43 min |
| V3: ChemBERTa-77M + Hard Mining | 1.3923 | 1.1168 | 0.1362 | 0.4139 | 0.4046 | 72 min |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "omshrivastava/omnibimol-binding-affinity-chemberta"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
# Predict binding affinity for a SMILES string
smiles = "CC(=O)Oc1ccccc1C(=O)O" # aspirin
inputs = tokenizer(smiles, return_tensors="pt")
with torch.no_grad():
pKd = model(**inputs).logits.item()
print(f"Predicted pKd: {pKd:.2f}") # higher = stronger binding
# Batch prediction
smiles_list = ["CC(=O)Oc1ccccc1C(=O)O", "c1ccccc1", "CC(C)Cc1ccc(cc1)C(C)C(=O)O"]
inputs = tokenizer(smiles_list, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
pKd_values = model(**inputs).logits.squeeze().tolist()
for s, v in zip(smiles_list, pKd_values):
print(f" {s}: pKd = {v:.2f}")
seq (protein), smiles_can (canonical SMILES), neg_log10_affinity_M (pKd target)@article{ahmad2022chemberta2,
title={ChemBERTa-2: Towards Chemical Foundation Models},
author={Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
journal={arXiv:2209.01712},
year={2022}
}
@article{glaser2020binding,
title={High-throughput virtual screening pipeline and binding affinity calculation},
author={Glaser, Jens and others},
year={2020},
note={ICML Workshop, dataset: jglaser/binding\_affinity}
}
@article{bemis1996murcko,
title={The properties of known drugs. 1. Molecular frameworks},
author={Bemis, Guy W and Murcko, Mark A},
journal={Journal of Medicinal Chemistry},
year={1996}
}
@article{benchmark2025molecular,
title={Benchmarking Pretrained Molecular Embedding Models},
author={Various},
journal={arXiv:2508.06199},
year={2025},
note={ECFP baselines competitive with pretrained models}
}
@article{bilodeau2024bapulm,
title={BAPULM: Binding Affinity Prediction using Language Models},
author={Bilodeau and others},
journal={arXiv:2411.04150},
year={2024}
}
This model is designed for the OmniBiMol binding prediction pipeline: