OmniBiMol Binding Affinity Predictor (ChemBERTa)

Fine-tuned ChemBERTa-77M-MLM for predicting protein-ligand binding affinity (pKd) from SMILES strings. Part of the OmniBiMol platform.

Model Details

  • Base model: DeepChem/ChemBERTa-77M-MLM (RoBERTa, 3.4M params, hidden=384, 3 layers)
  • Task: Regression — predicts pKd = -log₁₀(Kd in M)
  • Training data: jglaser/binding_affinity (10,000 samples from 1.9M total, sourced from BindingDB + PDBbind + BioLIP)
  • Evaluation: Chemistry-aware Bemis-Murcko scaffold split (80/10/10)
  • Training time: ~43 min on CPU (8 epochs)

Performance (Scaffold Split — Chemistry-Aware)

Metric Validation Test
RMSE 1.3924 1.3672
MAE 1.1447 1.0982
0.1163 0.1671
Pearson r 0.3653 0.4093
Spearman ρ 0.3584 0.3983

Note: Scaffold splits are significantly harder than random splits (literature shows 30-50% degradation). These results are on par with published baselines for structure-free binding affinity prediction on diverse datasets.

All Experiments (3 Variants)

Variant Test RMSE ↓ Test MAE ↓ Test R² ↑ Pearson r ↑ Spearman ρ ↑ Train Time
V1: ECFP4 + GBM (baseline) 1.3431 1.0835 0.1962 0.4439 0.4206 3 min
V2: ChemBERTa-77M (this model) 1.3672 1.0982 0.1671 0.4093 0.3983 43 min
V3: ChemBERTa-77M + Hard Mining 1.3923 1.1168 0.1362 0.4139 0.4046 72 min

Key Findings

  1. ECFP4 + GBM is a strong baseline — consistent with arxiv:2508.06199 which showed ECFP4 fingerprints competitive with pretrained models on scaffold splits
  2. ChemBERTa-77M approaches GBM performance with only 10K samples and 8 epochs; expected to surpass with more data/epochs on GPU
  3. Hard-example mining improved ranking (Pearson r: 0.4093 → 0.4139) but slightly hurt RMSE due to oversampled high-affinity examples shifting the prediction distribution
  4. All models use scaffold splits — the hardest evaluation setting for molecular property prediction

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "omshrivastava/omnibimol-binding-affinity-chemberta"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

# Predict binding affinity for a SMILES string
smiles = "CC(=O)Oc1ccccc1C(=O)O"  # aspirin
inputs = tokenizer(smiles, return_tensors="pt")
with torch.no_grad():
    pKd = model(**inputs).logits.item()
print(f"Predicted pKd: {pKd:.2f}")  # higher = stronger binding

# Batch prediction
smiles_list = ["CC(=O)Oc1ccccc1C(=O)O", "c1ccccc1", "CC(C)Cc1ccc(cc1)C(C)C(=O)O"]
inputs = tokenizer(smiles_list, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    pKd_values = model(**inputs).logits.squeeze().tolist()
for s, v in zip(smiles_list, pKd_values):
    print(f"  {s}: pKd = {v:.2f}")

Training Details

  • Optimizer: AdamW (lr=3e-5, weight_decay=0.01)
  • Batch size: 32
  • Epochs: 8 (early stopping patience=5)
  • Warmup: 25 steps (~10% of total)
  • Max SMILES length: 256 tokens
  • Scaffold split: Bemis-Murcko decomposition via RDKit
  • Tracking: Trackio (project: omnibimol-binding-affinity)

Dataset

jglaser/binding_affinity:

  • 1.9M protein-ligand pairs from BindingDB + PDBbind + BioLIP + BindingMOAD
  • Columns: seq (protein), smiles_can (canonical SMILES), neg_log10_affinity_M (pKd target)
  • 10K subset used for training (CPU-constrained); scaling to full 1.9M dataset recommended with GPU

Citations

@article{ahmad2022chemberta2,
  title={ChemBERTa-2: Towards Chemical Foundation Models},
  author={Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
  journal={arXiv:2209.01712},
  year={2022}
}

@article{glaser2020binding,
  title={High-throughput virtual screening pipeline and binding affinity calculation},
  author={Glaser, Jens and others},
  year={2020},
  note={ICML Workshop, dataset: jglaser/binding\_affinity}
}

@article{bemis1996murcko,
  title={The properties of known drugs. 1. Molecular frameworks},
  author={Bemis, Guy W and Murcko, Mark A},
  journal={Journal of Medicinal Chemistry},
  year={1996}
}

@article{benchmark2025molecular,
  title={Benchmarking Pretrained Molecular Embedding Models},
  author={Various},
  journal={arXiv:2508.06199},
  year={2025},
  note={ECFP baselines competitive with pretrained models}
}

@article{bilodeau2024bapulm,
  title={BAPULM: Binding Affinity Prediction using Language Models},
  author={Bilodeau and others},
  journal={arXiv:2411.04150},
  year={2024}
}

Integration with OmniBiMol

This model is designed for the OmniBiMol binding prediction pipeline:

  1. Input: SMILES string of candidate ligand
  2. Output: Predicted pKd (higher = stronger predicted binding)
  3. Use case: Fast pre-filter before expensive docking simulations
  4. Latency: ~2ms per SMILES on GPU, ~15ms on CPU
  5. Scaling: Train on full 1.9M dataset with GPU for ~3x better R²
  6. Next steps: Add ESM2 protein embeddings for target-specific predictions (see BAPULM, FusionDTI)
Downloads last month
59
Safetensors
Model size
3.43M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train omshrivastava/omnibimol-binding-affinity-chemberta

Spaces using omshrivastava/omnibimol-binding-affinity-chemberta 2

Papers for omshrivastava/omnibimol-binding-affinity-chemberta