Project 02 Structural Bioinformatics · Machine Learning

Ligand Binding Site Predictor

A machine learning model that predicts potential ligand binding pockets on protein 3D structures, aiming to accelerate early-stage drug discovery by identifying interaction sites automatically.

PythonPyTorchBioPythonPyMOLAlphaFold

Launch App Live

Motivation

Identifying where a small molecule (ligand) binds on a protein is one of the first and most critical steps in drug development. Traditionally this relies on expensive wet-lab experiments or computationally heavy docking simulations. This project explores whether a machine learning approach can predict binding pockets directly from protein structure, fast, cheaply and at scale.

The intersection of structural biology and machine learning is exactly the kind of problem I find most exciting, because it requires understanding the biology deeply enough to engineer the right features, then applying the right model architecture to extract patterns that aren't visible to the naked eye.

Approach

Parse PDB/mmCIF protein structure files with BioPython
Represent protein surface as a 3D point cloud or voxel grid
Train a ML model to classify residues as pocket / non-pocket
Evaluate against known binding sites from sc-PDB database
Visualise predicted pockets in Chimera

Tech Stack

Language: Python 3.10+
Structure Parsing: BioPython, MDAnalysis
Structures: PDB, AlphaFold2 predictions
Visualisation: Chimera, Matplotlib

Scientific Background

Proteins are large, folded molecules whose surface contains cavities and grooves, some of these are functional binding sites where small molecules (drugs, cofactors, substrates) can bind and modulate the protein's activity. The geometry and chemical environment of these pockets determines binding specificity.

This model takes a 3D protein structure and outputs a probability score for each surface residue indicating how likely it is to belong to a binding pocket. The features fed to the network include local geometry (curvature, solvent accessibility), amino acid physicochemical properties and neighbourhood context encoded through graph convolutions.

Status & Next Steps

The project is currently in active development as part of my Master's research at UPF-UB. The data pipeline is complete and baseline models are being trained. Upcoming work includes:

Benchmarking against established tools (fpocket, SiteMap)
Exploring graph neural network architectures for better spatial reasoning
Testing on AlphaFold2-predicted structures to assess generalisation
Building a small web demo for interactive pocket visualisation