The Limitations of Protein-Ligand Co-folding with AlphaFold 3, Unveiled

A small molecule docked on a protein, render from PDB ID 4OBE.

The arrival of AlphaFold 3 generated quite some excitement in the scientific communities, especially as it is a “multimodal” AI system that can predict not only the structures of proteins but also those of nucleic acids, small molecules, ions, and biomolecular assemblies containing all these components. AF3 was extensively evaluated in CASP16, where it was used as a “baseline predictor”, with all the main (and very interesting!) results covered by us earlier this year here:

Peeking Into CASP16 and the Future of Biomolecular Structure Prediction

Protein-ligand modeling via co-folding

One particularly enticing capability of AlphaFold 3 (and of similar tools like RoseTTAFold-AllAtoms, Boltz, Chai, etc.) is that of predicting the structures of proteins and nucleic acids with small molecules by folding and docking them together, a.k.a. “co-folding”. Such technology really has a potential to revolutionize drug discovery, by helping to accelerate the discovery of small molecules that alter protein function and thus work as novel therapeutic drugs. The co-folding approach drew considerable attention at the recent CASP16 benchmark, as we covered in the above post and as the official CASP assessments, now practically all out in peer-reviewed form, put forward.

The CASP16 results were especially promising because, at least as presented, co-folding with AlphaFold 3 seemed to outperform many specialized docking tools. However, a series of new studies have dampened the hype, as they unveiled several limitations of the approach at least in its current state. In brief, these analyses suggest that while AlphaFold 3 is definitely a technological leap, it may be relying a bit too much on memorization, without any understanding of the fundamental physics of molecular interactions whatsoever, which exerts substantial limitations to its actualy applicability. And most likely the same holds for all other multimodal AI systems for molecular modeling.

A Question of Physics

We will cover here two articles that, at the very least, cast doubts on AlphaFold 3’s capabilities for co-folding. First, a smart study that (just out) put the physical intuition of AlphaFold 3 and other co-folding models to the test using “adversarial examples” in which the researchers introduced chemically plausible modifications to protein-ligand systems to see if the models (not only AF3 but also RoseTTAFold-AllAtoms, Chai and Boltz) would react in a physically realistic way.

Scientists computationally mutated key amino acids in a ligand’s binding pocket, and tested AlphaFold 3. The several scenarios including removing a bindings site (by changing their residues to glycine, effectively erasing the specific side-chain interactions crucial for binding but keeping most of the structure the same), packing binding sites with bulky sidechains (by mutating residues to bulky phenylalanines, physically blocking the pocket), and inverting binding sites (by swapping sidechains for others with opposite chemical properties).

In a physically accurate simulation, these changes should have prevented the ligand from binding in its original spot. However, AlphaFold 3 frequently placed the ligand in the exact same pose as it did with the unmodified protein. In some cases, this led to the prediction of impossible structures with severe steric clashes and overlapping atoms, being totally senseless and unphysical.

These stress experiments show quite clearly that the AlphaFold 3 model did not learn any kind of physical principles such as electrostatics or van der Waals forces, and not even about clashes. Instead, the model appears to be recognizing broader patterns in the protein’s sequence or global fold that it associates “by heart” with a the cognate ligand, regardless of the specific, crucial interactions at the binding site.

For more about these interesting tests, check out the article reporting this in Nature communications — but keep reading for our own critics of it:

Investigating whether deep learning models for co-folding learn the physics of protein-ligand interactions - Nature Communications

Memorization and Generalization vs. Recitation

Another (earlier) critical study investigated whether co-folding methods moved beyond simply memorizing the data they were trained on. For this, the authors curated a large benchmark dataset of over 2,600 protein-ligand structures that were deposited in the Protein Data Bank (PDB) after the AF3 training cutoff.

Very strikingly (for bad), the study observed a strong correlation between prediction accuracy and the similarity of a test case to an entry in the model’s training data. When a protein-ligand complex was highly similar to a known structure, AlphaFold 3 performed well. However, for genuinely novel systems (i.e. those of most interest in drug design efforts!) its performance dropped significantly.

The conclusion is stark: AlphaFold 3, and possibly the other tools capable of protein-ligand co-folding, rely largely on memorized ligand poses from their training sets. This severely hinders their reliability for discovering drugs against new targets or with novel chemical scaffolds. The success seen with common molecules like cofactors and nucleotides seems to stem from their prevalence in the training data, not from a generalized understanding of molecular recognition — totally compatible with the paper discussed in the previous section. For more about this study, check the preprint here.

A More “Sober” View from CASP16

Even the more optimistic takeaways from the CASP16 assessment come with strong caveats that align with these critical findings. While the official CASP16 analysis showed that AlphaFold 3 outperformed all participants on a set of pharma-relevant targets, the official assessment acknowledges the challenge is far from solved. More importantly, the assessors themselves warn that co-folding methods must be exercised with caution.

This then creates a more nuanced picture of co-folding: CASP16 showcased its “raw” potential, but these deeper, targeted investigations have exposed somewhat foundational cracks. AlphaFold 3 is an incredibly powerful tool for generating hypotheses, but it is not the infallible oracle some had hoped for. And from what we have learnt, what it is missing to become a more reliable oracle is probably some more explicit physics and chemistry.

The Role of Confidence Metrics

Not to defend AlphaFold 3 and DeepMind, but it is important to recognize that despite the model’s lack of deep chemical and physical knowledge, it is also true that the system does “realize” about its limitations. It does so by flashing warnings through its various confidence scores. In our view, evaluations and evaluators tend to overlook confidence metrics more than we’d expect, despite their major relevance as some works do highlight — see for example section 6 of this official CASP16 paper, very detailed about AlphaFold 3’s confidence scores.

Of the two articles we commented throughout this post, the one in Nature communications does look at scores related to ligand modeling, namely pTM and ipTM, saying that they are high for the native ligand pocket and remain high after modifications. However, their own tables (in the SI material) show substantial drops in confidence metrics for some cases. Similarly, the other study does not take into consideration the scores of the protein-ligand complexes, analyzing predictions all together as if the AI system were equally highly confident about them all.

We are not saying, by any means, that these observations invalidate the core conclusions of the two studies; rather, we try to provide our readers the most complete and fair picture. As we have covered throughout this post, this picture includes three components: a view on the quality of the structural predictions themselves, as the two papers assessed; consideration of the quality metrics, as we just stressed; and, as always with computer-based predictions and we touch upon next, the key role of human expertise — covered next.

Nexco’s Perspective: the Key Role of Expertise

The unveiling of AlphaFold 3’s limitations reinforces a core principle at Nexco: cutting-edge tools are only as powerful as the expertise wielding them. These studies clearly show that naively accepting the outputs of AI models without rigorous, physics-based validation can be misleading and costly.

The future of computational structural biology lies not in a blind faith in AI, but in a synergistic interaction between AI systems and human expertise and intuition. The predictions from models like AlphaFold 3 are invaluable starting points, but they must be critically assessed and refined using expert knowledge and validated with more traditional, physics-based methods like molecular dynamics simulations and free energy calculations, as well as with chemical intuition and, of course, with experimentation.

At Nexco, we understand that true innovation comes from integrating the latest AI-driven technologies with a deep fundamental understanding of biology and chemistry. By leveraging the strengths of each approach while being acutely aware of their limitations, we can provide robust, reliable insights to drive your research forward.

Want to harness the power of AI without falling into its traps? Check our bioinformatics services here and contact Nexco today to learn how our expert-guided computational strategies can turn promising predictions into validated success.

And to get the latest news, follow us on LinkedIn: https://www.linkedin.com/company/nexco-analytics

Montag, 17. Nov. 2025, 18:16

protein-folding, drug-discovery, alphafold, structural-biology

Share this post

+41 76 509 73 73

contact@nexco.ch