OmniBiMol Binding Affinity Predictor (ChemBERTa)

Fine-tuned ChemBERTa-77M-MLM for predicting protein-ligand binding affinity (pKd) from SMILES strings. Part of the OmniBiMol platform.

Model Details

Base model: DeepChem/ChemBERTa-77M-MLM (RoBERTa, 3.4M params, hidden=384, 3 layers)
Task: Regression — predicts pKd = -log₁₀(Kd in M)
Training data: jglaser/binding_affinity (10,000 samples from 1.9M total, sourced from BindingDB + PDBbind + BioLIP)
Evaluation: Chemistry-aware Bemis-Murcko scaffold split (80/10/10)
Training time: ~43 min on CPU (8 epochs)

Performance (Scaffold Split — Chemistry-Aware)

Metric	Validation	Test
RMSE	1.3924	1.3672
MAE	1.1447	1.0982
R²	0.1163	0.1671
Pearson r	0.3653	0.4093
Spearman ρ	0.3584	0.3983

Note: Scaffold splits are significantly harder than random splits (literature shows 30-50% degradation). These results are on par with published baselines for structure-free binding affinity prediction on diverse datasets.

All Experiments (3 Variants)

Variant	Test RMSE ↓	Test MAE ↓	Test R² ↑	Pearson r ↑	Spearman ρ ↑	Train Time
V1: ECFP4 + GBM (baseline)	1.3431	1.0835	0.1962	0.4439	0.4206	3 min
V2: ChemBERTa-77M (this model)	1.3672	1.0982	0.1671	0.4093	0.3983	43 min
V3: ChemBERTa-77M + Hard Mining	1.3923	1.1168	0.1362	0.4139	0.4046	72 min

Key Findings

ECFP4 + GBM is a strong baseline — consistent with arxiv:2508.06199 which showed ECFP4 fingerprints competitive with pretrained models on scaffold splits
ChemBERTa-77M approaches GBM performance with only 10K samples and 8 epochs; expected to surpass with more data/epochs on GPU
Hard-example mining improved ranking (Pearson r: 0.4093 → 0.4139) but slightly hurt RMSE due to oversampled high-affinity examples shifting the prediction distribution
All models use scaffold splits — the hardest evaluation setting for molecular property prediction

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "omshrivastava/omnibimol-binding-affinity-chemberta"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

# Predict binding affinity for a SMILES string
smiles = "CC(=O)Oc1ccccc1C(=O)O"  # aspirin
inputs = tokenizer(smiles, return_tensors="pt")
with torch.no_grad():
    pKd = model(**inputs).logits.item()
print(f"Predicted pKd: {pKd:.2f}")  # higher = stronger binding

# Batch prediction
smiles_list = ["CC(=O)Oc1ccccc1C(=O)O", "c1ccccc1", "CC(C)Cc1ccc(cc1)C(C)C(=O)O"]
inputs = tokenizer(smiles_list, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    pKd_values = model(**inputs).logits.squeeze().tolist()
for s, v in zip(smiles_list, pKd_values):
    print(f"  {s}: pKd = {v:.2f}")

Training Details

Optimizer: AdamW (lr=3e-5, weight_decay=0.01)
Batch size: 32
Epochs: 8 (early stopping patience=5)
Warmup: 25 steps (~10% of total)
Max SMILES length: 256 tokens
Scaffold split: Bemis-Murcko decomposition via RDKit
Tracking: Trackio (project: omnibimol-binding-affinity)

Dataset

jglaser/binding_affinity:

1.9M protein-ligand pairs from BindingDB + PDBbind + BioLIP + BindingMOAD
Columns: seq (protein), smiles_can (canonical SMILES), neg_log10_affinity_M (pKd target)
10K subset used for training (CPU-constrained); scaling to full 1.9M dataset recommended with GPU

Citations

@article{ahmad2022chemberta2,
  title={ChemBERTa-2: Towards Chemical Foundation Models},
  author={Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
  journal={arXiv:2209.01712},
  year={2022}
}

@article{glaser2020binding,
  title={High-throughput virtual screening pipeline and binding affinity calculation},
  author={Glaser, Jens and others},
  year={2020},
  note={ICML Workshop, dataset: jglaser/binding\_affinity}
}

@article{bemis1996murcko,
  title={The properties of known drugs. 1. Molecular frameworks},
  author={Bemis, Guy W and Murcko, Mark A},
  journal={Journal of Medicinal Chemistry},
  year={1996}
}

@article{benchmark2025molecular,
  title={Benchmarking Pretrained Molecular Embedding Models},
  author={Various},
  journal={arXiv:2508.06199},
  year={2025},
  note={ECFP baselines competitive with pretrained models}
}

@article{bilodeau2024bapulm,
  title={BAPULM: Binding Affinity Prediction using Language Models},
  author={Bilodeau and others},
  journal={arXiv:2411.04150},
  year={2024}
}

Integration with OmniBiMol

This model is designed for the OmniBiMol binding prediction pipeline:

Input: SMILES string of candidate ligand
Output: Predicted pKd (higher = stronger predicted binding)
Use case: Fast pre-filter before expensive docking simulations
Latency: ~2ms per SMILES on GPU, ~15ms on CPU
Scaling: Train on full 1.9M dataset with GPU for ~3x better R²
Next steps: Add ESM2 protein embeddings for target-specific predictions (see BAPULM, FusionDTI)

Downloads last month: 59

Safetensors

Model size

3.43M params

Tensor type

F32

Dataset used to train omshrivastava/omnibimol-binding-affinity-chemberta

Spaces using omshrivastava/omnibimol-binding-affinity-chemberta 2

Papers for omshrivastava/omnibimol-binding-affinity-chemberta

Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning

Paper • 2508.06199 • Published Aug 8, 2025

BAPULM: Binding Affinity Prediction using Language Models

Paper • 2411.04150 • Published Nov 6, 2024

ChemBERTa-2: Towards Chemical Foundation Models

Paper • 2209.01712 • Published Sep 5, 2022 • 3