AlphaFold and Similar AI Models Go All Atoms, Paving New Roads to Drug Development

Advancing RoseTTAFold-AllAtoms and the next generation of AlphaFold

Understanding interactions between proteins and small molecules is crucial for advancing fundamental biology and drug discovery. While experimentation is of course reliable but very slow and expensive, computational methods are cheap but face challenges in accuracy, prompting a shift towards AI-driven solutions. Here we present three cutting-edge models that break barriers in protein-ligand complex prediction, inspired by novel technologies that fuel AlphaFold 2 and similar software for protein structure prediction. Read on to learn how these models work, how they perform, their pros and cons compared to traditional software, the challenges ahead, and how we at Nexco see them reshaping computational drug design and discovery, holding promise for accelerated pharma research.

Ibuprofen (magenta carbons) bound to serum albumin as in PDB 2BXG, with two central electrostatic interactions shown as black dashed lines.

A detailed understanding of the intricate interactions between proteins and small molecules is paramount in advancing fundamental biology, and is critical in advancing pharmaceutical research and drug discovery and development. The pursuit of developing a new drug is not only intellectually challenging but also entails substantial financial investments and time commitments, often stretching over a decade and requiring billions of dollars. Therefore, accelerating and automating any part of the process can provide shortcuts to the hurdles of the drug development process.

One important part of the workflow going from new protein targets and ideas on how to modulate them to new small-molecule based therapeutics (or simply of understanding new biology if we talk about fundamental research) entails getting to know the chemical and physical details of the interactions established by the drug, lead, or candidate small molecules, to the protein of interest. Traditionally, this is done either by solving experimental structures of the proteins bound to the small molecules, or by predicting the three-dimensional structures of these protein-small molecule complexes. The former can be challenging and expensive, and is certainly very low-throughput, while the latter can be extremely fast but suffers from several problems: low accuracy, unclear effects of different forces driving binding, reliance on the availability of experimental structures or accurate models for the target protein, bias towards starting structures which might be well off from their bound forms, and other more technical challenges.

Certainly, if a computer-based system could assess binding of a protein-small molecule pair at high confidence and speed, this would unlock a whole new way to do biology akin to what AlphaFold 2 meant for regular protein structure prediction. In the quest for such kind of computational solutions, the most important academic and private laboratories pursuing AI models for chemistry and biology have independently come up with 3 new AI programs. Without going into details, these programs utilize AI architectures inspired in those that compose AlphaFold 2, but rewired and with modules added to treat not only protein sequences in their inputs but also string-based representations of small molecules, most specifically SMILES representations. Just like the output from AlphaFold 2 or similar programs like RoseTTAFold is a set of atom coordinates describing the 3D structure of the protein, the output of these new models is a set of atom coordinates for the protein and the small molecule, together describing the 3D structure of the complex formed by the two molecules.

Let’s quickly review these 3 new models.

1. RoseTTAFold All-Atom (RFAA) and its Approach to Comprehensive Modeling of Biological Assemblies

In a preprint from the Institute of Protein Design led by Prof. David Baker, who developed RoseTTAFold as a response to Deepmind’s AlphaFold 2, a new deep learning network called RoseTTAFold All-Atom (RFAA) has been introduced just over two months ago. RFAA goes beyond protein and protein-small molecule modeling, being capable of handling biological assemblies that include proteins in combination with DNA, RNA, small molecules, and metals, even considering covalent modifications of amino acids.

In particular for protein-small molecule docking and like the 2 other new methods described here, RFAA “folds” the protein structure concurrently with binding of the small molecule to it, thus effectively accounting for flexibility both in the protein and in the small molecule, and not being biased by any starting structure.

RFAA’s core mechanism combines sequence-based descriptions of proteins and nucleic acids with atomic graph representations of small molecules and amino acid modifications. For this it builds on the 3-track neural network on which RoseTTAFold is based, here hacked to allow it to process not only protein sequences in the 1D track but also DNA sequences and SMILES representations of small molecules, ions, and post-translational modifications.

Pushing RFAA even further, the preprint also presents a model that implements diffusion thus allowing the system to design proteins that can accommodate small molecules, DNA, metal ions, and more. We will soon cover this further in an article dedicated to modern AI tools for protein design.

2. Next Generation AlphaFold — Expanding Beyond Proteins

Just after the preprint describing RFAA was out, DeepMind posted a blog article describing the next generation of AlphaFold that it is developing to tackle the problem of modeling protein-small molecule complexes.

DeepMind developed during 2020 AlphaFold 2, a revolutionary and first-of-kind AI program for protein structure prediction that changed biology for ever. In their blog post they present an evolved model of AlphaFold 2 that can generate predictions for protein structures bound to small molecules. Despite no details were provided about how it exactly works, the results advanced are quite impressive and show that the technology could soon be good enough to replace, or at least complement, traditional methods for protein-small molecule docking, actually performing better than many of them with the advantage that it doesn’t require any protein structures to be provided because they are predicted together with the small molecule binding pose.

3. Umol — Predicting Flexible Protein-Ligand Structures

Another preprint published just after the two above, this time from the Bryant and Noe labs (the latter with a position at Microsoft research) describes Umol.

The Umol (Universal molecular) AI system again allows users to predict the fully flexible, all-atom structure of protein-ligand complexes directly from a multiple sequence alignment of the protein and a SMILES string representing the ligand. An easy-to-use Google Colab notebook exemplifies how the user must enter the sequence of the target protein, a SMILES string coding for the small molecule, a set of residues expected to be involved in binding the small molecule, and a few other parameters, to then start the modeling process which can take a few minutes to run.

According to the preprint presenting Umol, it surpasses RoseTTAFold-AA and classical docking methods at specific accuracy thresholds. No comparisons are presented against Deepmind’s new program because it is not available for use.

Pros and Cons over Traditional Approaches

In all 3 all-atoms protein structure prediction programs presented above, the protein is provided as a sequence or multiple sequence alignment, and the ligand as a SMILES string. The latter is a string-based representation of the atomic composition, connectivity and chirality of the molecule. It offers a compact and practical format, but of most importance for the purpose of ligand docking and virtual screening is the fact that it imposes no starting conformations. This allows for fully unconstrained exploration of the molecule’s flexibility together with the flexibility of the target protein as its structure is predicted. This unbiased, on-the-fly sampling of conformations for both the ligand and the protein contrasts strongly with regular ligand docking programs, which ask the user to input both a protein structure and some starting structure for the small molecule and then sample conformations starting from these structures. In these traditional methods, if the starting structures for the free ligand and free protein are too different from those of the bound form, then the docking procedure has a high chance of failing.

On the downside for the new AI models, with the information available at the moment it seems that they cannot utilize experimental structures of the target proteins, relying entirely on their prediction. Surely, this will change in the future as the methods are further developed.

Another downside of the new AI models is that they are far slower than regular docking methods. Therefore, for their application to virtual screening (where large numbers of molecules are docked on the target protein to then fish out promising drug leads) we would need them not only to perform as well as or better than traditional approaches but also much faster, probably orders of magnitude faster.

The Future of AI-based Molecular Docking and Virtual Screening, and how we at Nexco can put it to work for you

The accuracy of these new methods for protein-small molecule docking is not outstanding, but it is slightly above that of free traditional software for molecular docking (with the advantage of not requiring any input structures). They might indeed be useful already today to expand or complement computational studies of protein-small molecule binding, but they are too slow to replace traditional virtual screening.

Despite these negative hues, the relevance of these new works relies in proving that AI architectures can be adapted to the problem. They can perfectly parse the information required to tackle it, effectively blend inputs of different formats (small molecule structures and protein sequences) in tractable ways, then process them end-to-end to output atomic coordinates of protein-small molecule complexes, and even perform decently at the problem already today as compared to traditional alternatives.

Given their promising start and the fierce competition (the three works presented here were all published within less than a month), it looks like rather sooner than later will we see these new AI methods in our toolbox, bringing AI-based drug discovery one step closer to reality. And we at Nexco are following all developments closely, as well as integrating open forms of these new tools into our pipelines.

References

Preprint and blog post presenting RFAA:

A more technical summary of how RFAA and RFAA-Diffusion (the latter not described in this post) work:

New AI Method for Protein Structure Prediction Handles All Kinds of Biologically Relevant Molecules

Blog post presenting Deepmind’s advances:

A glimpse of the next generation of AlphaFold

Preprint presenting Umol:

Structure prediction of protein-ligand complexes from sequence information with Umol

Monday, Jan 8, 2024, 12:44 PM

drug-discovery, computational-biology, protein, alphafold

Share this post

+41 76 509 73 73

contact@nexco.ch