Predicting binding affinity

May 22
2 min read

I wanted to share a small personal project I have been working on: an end-to-end machine learning pipeline for protein–ligand binding prediction.

The question I wanted to explore was:

Can protein foundation model embeddings, combined with molecular fingerprints, help predict whether a small molecule is likely to bind a protein?

In this project, I used ESM-2 to generate protein sequence embeddings and combined them with ligand representations such as Morgan fingerprints, MACCS keys, atom-pair fingerprints, and RDKit molecular descriptors. These combined protein–ligand features were then used to train binary classifiers for binding prediction based on KIBA scores.

One of the biggest lessons I learned was that evaluation strategy matters as much as model choice.

With a random split, a model can easily benefit from leakage by seeing the same proteins or related ligands during training and testing. That can make performance look stronger than it really is.

To make the evaluation more realistic, I tested several split strategies:

• cold-protein: unseen proteins in test

• cold-ligand: unseen ligands in test

• scaffold split: unseen chemical scaffolds

• cold-both: unseen proteins and ligands

This changed how I thought about the problem. The key question is not only “Can the model predict similar known pairs?” but also “Can it generalize to new proteins and new molecules?”

I compared several downstream models, including Logistic Regression, Random Forest, XGBoost, LightGBM, and an interaction MLP. Interestingly, with good representations and careful splitting, several model classes performed comparably well.

My main takeaway:

In biology and drug discovery ML, representation quality and evaluation design can matter as much as the final model architecture.

Foundation models can provide powerful protein representations, while molecular fingerprints capture complementary chemical information. Yet without careful dataset construction and leakage-aware testing, it is easy to overestimate model quality.

As foundation models become more widely used in biology, I think this point will become increasingly important: we need to test whether models are learning generalizable biology and chemistry. Otherwise the models would just memorize the patterns in the dataset.

Project link: https://lnkd.in/gh6Ymgu2

#MachineLearning #DrugDiscovery #ComputationalBiology #Bioinformatics #ProteinLanguageModels #FoundationModels #Cheminformatics #AIforScienc

Anatoly Buchin
Computational Biology | Machine Learning | Data Science

Predicting binding affinity

Comments

Featured Posts

Industry panel for students at University of Washington