Peeking Into CASP16 and the Future of Biomolecular Structure Prediction
A review of the freshest results compiled from freely available CASP16 presentations, preprints, videos, rankings, and predictor abstracts.

Like many other branches of bioinformatics, that of biomolecular modeling is evolving at an unprecedented pace by the hand of Artificial Intelligence. At Nexco we are committed to staying at the forefront of these advancements. So, we took the time to peek into CASP16’s first outputs (presentations, abstracts book, videos, preprints, rankings, models, etc.; all freely available online) to review the state of the art of biomolecular structure prediction as of early 2025.
CASP16 found out how cutting-edge deep learning methods, particularly AlphaFold2 and AlphaFold3, take the lead and are reshaping the landscape of structural biology, either alone or at the core of other protocols and pipelines. Yet, there are many challenges, open questions and caveats, including cases where AI methods aren’t (yet) the best.
Read this detailed blog post, pretty much a whole review, where we have distilled everything you need to understand how AI especially AlphaFold powers most top-performers yet where AI is still capped, to learn about the use of co-folding AI methods for protein-ligand docking, and to see where the power and limitations of current molecular modeling tools stands. Or, broadly speaking, to stay up to date with the field of biomolecular structure prediction as benchmarked by the most important evaluation world- and history-wide.
Introduction and the Main Takeaways From CASP16
In late 2024 CASP ran its biannual conference to present the results from its 16th edition, releasing all the results, models and rankings as well as the presentations, videos and abstracts contributed by all the assessors and predictors. Then, in the following weeks the predictors and assessors began to post preprints which, together with the other material, allowed us to peek into the event to advance some results that we have summarized and compiled here for you.
Let’s first give a glimpse at the key takeaways, to then explore it all a bit more deeply.
First, wonderfully but expectedly based on the two last CASP editions, CASP16 found that no protein domain was incorrectly predicted, demonstrating just how reliable modern AI-driven methods have become at this level of resolution. In turn, full modeling of large multidomain proteins and assemblies isn’t perfect yet, actually being quite limited for very complex topologies when no good templates are available. Yet there was some small improvement over CASP15, which in turn had shown improvement over CASP14 when AlphaFold 2 broke in. Probably the most interesting point regarding modeling of complexes was a specific sub-experiment of CASP16 which found that knowing the right stoichiometry of a complex upfront helps substantially in the modeling efforts. This makes sense and is kind of expected, but had never been measured this well.
Yet, even with known stoichiometry, modeling of large multicomponent complexes still remains a quite hard problem. And when the end goal requires highly accurate structural models as in virtual drug screening, molecular docking, or computer-assisted drug design, the expectations must be kept reasonable. This depends on the target: with the current state of the art, we can probably rely on molecular models for protein-ligand docking only if at least the protein region involved in ligand binding can be modeled with high local accuracy, for example by its very high local homology to a template.
Following along the track of docking ligands into proteins in silico, procedures based on “co-folding”, that is the simultaneous prediction of a protein’s structure and of the pose of a ligand bound to it (as enabled by the latest AI models starting with RoseTTAFold-AllAtoms and then AlphaFold 3 and similar AI systems), were a very hot topic in CASP16. The main advantage of folding together a protein and its ligand/s compared to docking the ligand onto a given structure or model is that we do not need to explicitly sample the conformations of the protein, which might in its ligand-free form be actually quite far from that of the bound form.
Notably, in CASP16 there was a special track dedicated to the problem of protein-ligand docking, with examples kindly provided by a pharma company that tackled not only the problem of molecular docking itself but also that of affinity prediction. These were rather difficult cases and the ligands involved looked like realistic drugs, thus enabling a quite informative experiment. Several interesting conclusions were drawn from this, the main one being that while some predictors and tools including co-folding programs can in many cases produce quite good structural models of the complexes, this isn’t consistent nor fully reliable. In turn, affinity estimation is as of today terrible, to the extent that some intrinsic properties of the ligands such as the molecular weight correlate with binding affinities better than any ad hoc prediction tool among those tested in CASP16.
Nucleic acids were also a hot topic, with the largest-ever set of targets to predict including DNA and RNA molecules, alone or in complexes with each other as well as with proteins and other molecules. Some interesting conclusions came up along this track, that we can summarize by stating confidently that AI for RNA structure prediction is still not as good as for proteins, and there’s hope to get good models only if there are good templates for homology modeling.
Let’s now dive a bit deeper into some CASP16 highlights. We stress that all this post is based on free material released by the CASP website, predictors, and assessors, mainly the predictor’s abstracts and presentations, the rankings and results at the official website, the assessors’ presentations, and the assessors’ preprints, the latter two being the richest sources of information. You will find links to the most important resources as you read, and you can find all resources at the end of the post under References.
AlphaFold’s Continued Dominance, Especially for Domain Predictions — a “Solved Problem”
Probably the overarching theme in CASP16 is a demonstrated continued supremacy of AlphaFold-based approaches; mainly of AlphaFold 2 because AlphaFold 3 had just been released in full when the conference started — yet we know about its full impact by the time of writing this post because preprints from CASP16 results include it in their analyses. All groups that ranked above the raw AlphaFold 3 server used AlphaFold 2 and/or AlphaFold 3, any not just “off the shelf” but rather enhancing them in different ways. The general winning strategies involved improving multiple sequence alignments (MSA), either by expanding them and/or by cleaning them up carefully before feeding them to AlphaFold; however, some predictors noticed that this was slightly less critical than in CASP15. All this is backed up quantitatively by the following plots presented by the assessors, which show better modeling when AlphaFold models are used (left) and a slight improvement when MSAs are improved:

Careful selection of templates for domains, subunits and assemblies was also important, but more so was to enhance sampling of various different models and then how to score large numbers of models. Note that MSA and template optimization were only possible with AlphaFold 2 during the rounds of prediction, because AlphaFold 3 was only available as a server back then. In turn, extended sampling and scoring was possible when using both AlphaFold systems, although achieving more conformational variability with AlphaFold 2 as predictors could tune their parameters differently while on AlphaFold 3 the only hope was on getting different random seeds. Now, with AlphaFold 3 available for local use and being fully customizable, these can work better, and surely the developers of structure prediction methods will exploit this in the future — together with the chance to now add any ligand, of course.
With more sampling, AlphaFold works better
Regarding extensive sampling, one special component of CASP16 was the use of MassiveFold as a source of large numbers of models sampled with AlphaFold 2 for each target. CASP15 had shown that increasing the structural diversity in predictions led to quite some improvement in predictions, especially for complexes. Therefore, the goal of using this program in CASP16 was to provide the community with large numbers of predictions that they could utilize for further modeling or simply to apply on them their model scoring protocols. Additionally to providing massive sampling data to increase the chances of producing better models, this allowed for a more fair competition especially for those groups with more limited resources; it also avoided having many groups that burnt hours of GPU work for the same type of computation, and it was probably boosted the development of much-needed scoring protocols.
MassiveFold then carried out thousands of predictions with diversity of parameters: exact neural network version, use of dropout, templates, recycles, etc. The models were used in CASP16 and in CAPRI, a related contest focused on protein-complex predictions. Like in the previous round of CASP, analysis of the MassiveFold models once the target structures were known indicated that the approach does indeed sample models that are more similar to the target than any other, especially for assemblies and even coming top with quite substantial improvements. However, unfortunately the analyses also revealed that scoring functions still fail at reliably identifying such models.
Good doesn’t mean perfect, not even for domains
We finally note that although no domain fold was missed and we can accept that overall fold topology can be confidently predicted, the often-perceived idea that AlphaFold is perfect at the domain level is incorrect, with many cases where the folds and quite some details of the topologies are correct yet the model is off, and a few cases where it is rather far from perfect:

AlphaFold dominates the scenario, even when predictors perform somewhat “better” than baseline AlphaFold
As we mentioned earlier, all the groups ranking at the top used AlphaFold 2 and/or 3 in one way or the other, with enhancements. One very important point must be made about these enhancements and also about the apparently better performance of these predictors compared to the AlphaFold models. At some levels of resolution, most notably when looking at predictions for domains only, the models produced by AlphaFold 2 or 3 are already so good, that there isn’t much space left for improvement. On first look, the CASP16 rankings for this track might make you think otherwise, because since they involve a z-score calculation then the differences among the different systems get amplified, while the actual average accuracy of the models isn’t that different. See for examples the following plots from the official CASP16 website.
First, here’s the CASP16 standard ranking for Protein Domains, where AlphaFold 3 server ranks 23rd with a sum of z-scores (>0) of 26.3829 against 40.8978 for the top method:

Now, if we only look at the GDT_HA (high-accuracy version of the GDT_TS score, now standard in CASP) with some minimal processing, we find AlphaFold 3 server in position 17, and what is most relevant here, just very slightly away from the top method (slide taken from the assessors’ presentation available here):

In fact, using different metrics for ranking results in some reshuffling of the top 50 methods, all actually being very close to each other. For example, using the final ranking function defined by the assessors, AlphaFold 3 server ranks fourth, or second if only servers are included (remember that CASP16 allows automated servers and also humans to participate).
Similar kinds of observations hold for much of the rest of CASP16, where fine-tuned metrics rank groups other than AlphaFold 2 or 3 at the top but with actually small differences in prediction accuracy. This is most notable for domain predictions as just presented, and for multimeric predictions as we show in next section; and less marked yet there for the other tracks. Being pragmatic, this all means that somehow, and especially if you aren’t an expert in modeling, you’re best-off with AlphaFold 3 than with the other tools, which is kept stable (i.e. without continuous updates that might change its performance), runs online easily and fast, and features a very nice GUI.
Next down the line of increasing complexity for more control and wider range of applicability, is probably the use of AlphaFold 3 locally.
In any case, what’s still a big deal is the user’s own involvement in understanding the full results, for example looking not just at the structural models but also at the accuracy metrics.
Predicting the Structures of Biomolecular Assemblies
One of the biggest challenges in structural biology has been the accurate modeling of complexes between multiple proteins and other molecules, most notably nucleic acids and ligands — with CASP16 also looking at ligands and even at structural waters. CASP16 showcased some remarkable progress in this area, but overall the predictions aren’t as good as for protein domains and indeed there were some targets that were predicted quite poorly, even by the AlphaFold systems. Here are three examples of target structures for poorly predicted targets:

Again, all the top approaches used AlphaFold 2-Multimer and/or AlphaFold 3, and optimized MSAs, refined templates, used extended sampling, etc.; and also tested various stoichiometries (interesting note on this later on in this section). AlphaFold 3 was also leveraged, first with its server mode and through manual local runs that unleash its full potential towards the end of the competition and beyond.
Some predictors observed issues such as memory limits when attempting to model very large structures, particularly some very large multisubunit assemblies. The most successful approaches rely on splitting the targets into overlapping parts that are predicted separately and then assembled together, but even then, this didn’t always work perfectly well. Among very difficult targets that were not predicted very well by any group or program, were a filament held together by weak interactions and a complex of unusual shape where the same components engaged on multiple different interfaces.
As observed for protein domains albeit slightly less marked, AlphaFold 3 server isn’t too far from the top groups in terms of actual model scores. But notably, they are all (including AlphaFold 3) notably better than the baseline AlphaFold 2-multimer run through ColabFold:

Knowing stoichiometry helps
One special twist in CASP16 regarding multimeric assembly was that target prediction was split into two stages: one where no stoichiometries were given, so they had to be predicted, and another where the stoichiometries were provided — information that in real life one might get from low-resolution techniques such as SEC-MALS or native MS. The conclusion here was solid: knowing the true stoichiometry helps human and automated predictors quite substantially, by all scoring metrics used:

Antibody-antigen complexes
One particularly interesting kind of assembly, not only at the fundamental level but also regarding tons of practical applications, is that constituted by an antibody bound to its target — the latter usually a protein but sometimes other kinds of molecules too, even nucleic acids. Antibody-antigen complexes have been historically hard to predict, for some unclear region not behaving as “one more case of protein-molecule interaction”. CASP16 still showed limited performance, even by AlphaFold 2 and 3; however, it found one protocol that stood out. Although not much has been revealed about this pipeline yet, it was presented as the ClusPro method for protein-protein docking augmented with two different stages of AlphaFold predictions with enhanced sampling and culminating with ad hoc scoring.
Protein-Small Molecule Complexes and a Look Into the Potential of Co-Folding for Drug Discovery
A particularly exciting track of modeling explored in CASP16 was that of protein-ligand interactions, which is central to the rational development and optimization of small molecules with therapeutic properties, and therefore to pharma and the clinic.
In line with the ultimate goal (or rather dream!) of reliably computing protein-ligand binding, this was probably the deepest-ever attempt to evaluate pharma-relevant protein-ligand docking in CASP. This was possible thanks to contacts with people from pharma companies and a structural genomics consortium, which allowed the organizers to secure a clean, well-defined dataset on which to benchmark the predictors by using unreleased structures of protein-ligand complexes for drug-like molecules, as well as measurements of their binding affinities. Importantly, Tanimoto scores for the target ligands showed that they had mid- to low similarity to small molecules found in the PDB, making the competition realistic. In addition, they featured functional groups, sizes and other properties typical of small molecular therapeutics, making them closer than ever to a CASP benchmark for drug design. And although there were only 5 protein targets involved, three of them were paired with a few to several dozens of ligands each — and most of the evaluation seems to have focused on them.
On these pharma-compatible molecules and their targets, CASP16 offered a chance to evaluate (i) modeling the structures of highly specific complexes between proteins and drug-like ligands and (ii) prediction of the binding affinities of these protein-ligand complexes. In turn, there were two core evaluations: the regular one, on the models and affinities submitted by the predictors; and one carried out by the assessors using baseline models which gave the chance to test AlphaFold 3 and similar methods (which were not available when the predictors worked on their predictions).
Evaluation of protein-ligand docking by the predictors
Modeling of protein-ligand complex proceeded in two stages. In a first stage, predictors had to model the protein-ligand complexes up from the protein sequences and the ligand SMILES strings, and they had to predict the affinities for the pharma ligands from their models. In a second stage, predictors were given the actual structures and they were asked to predict the new affinities again.
The conclusions were rather positive, or at least promising for the future, regarding the structural modeling of protein-ligand complexes. However, binding affinity estimation turned out terrible, even in stage 2 where the predictors were given the experimental structures of the complexes (but this made no difference at all). Let’s now delve a bit deeper.
For structural modeling, the assessment relied on metrics that describe the structure of the protein at the ligand pocket, the ligand pose, and the number and quality of protein-ligand contacts. Notably, two methods from the ClusPro group (highlighted earlier in this post for its top antibody-antigen models) came at the top, followed closely by several others:

Here are a few more important points:
- Although the final ranking scores look good, there’s quite substantial variability per target; see for instance the examples shown by one of the top predictors (MULTICOM, fifth in the ranking not far from the top) on Figure 4 of their preprint.
- Further analyses showed weak correlations suggesting that small molecules with more atoms and with more rotatable bonds are harder to predict, both kind of expected especially the latter. Besides, (Tanimoto) similarity of the ligands to ligands present in the PDB didn’t help to produce better models (this was analyzed with a different angle in the second part of the evaluation, see next subsection):

- Another notable finding is that the top predictors consistently tended to perform better on the different targets, meaning that their methods probably do capture certain physics properly and globally — with the caveat that this all relies on very few targets.
- The assessors also observed that the good but limited performance was not a matter of scoring only (which, by the way, could be related to the terrible affinity estimations), because picking the best out of each predictor’s five models didn’t improve much over their top pick.
- Therefore, we conclude the unsuccessful cases are probably a combination of poor sampling of the protein and ligand conformations and interactions, with poor estimation of the interaction strengths themselves.
To conclude this subsection, we have a practical critique regarding this track of CASP16: as far as we can judge from the abstracts, presentations and talks, none of the major players and expert developers of methods for protein-small molecule docking participated. In particular not for the most widely used academic and commercial protein-small molecule docking and virtual screening suites — but see below.
Evaluation of baseline methods for protein-ligand docking
In a related sub-track of CASP16’s protein-ligand assessments, a group of assessors explored the performance of fully automated and “naive” protein-ligand docking methods in producing protein-ligand complexes without any manual intervention, for the three pharma-relevant targets paired with several ligands. Somehow, these assessors wanted to test how far one can go in modeling such complexes if not an expert. For this, they ran structure modeling and affinity prediction “baselines” using programs that are widely available in easy-to-use forms, with default parameters.
The baselines for structural modeling included mainly co-folding with AlphaFold 3, Boltz-1 and RosettaFold All-Atom (we dedicate a specific subsection to co-folding below), and naïve docking with Autodock Vina centering the search box at the known ligand site. Affinity baselines used scores from programs like AutoDock-Vina and GNINA as well as simple metrics derived from the ligands alone such as their molecular weights and partition coefficients.
The results showed that the two modern multimodal AI systems tested, AlphaFold 3 through a local installation and Boltz-1, performed the best together with ClusPro (top predictor as evaluated in the regular assessment, to which we criticize that the runners-up should have been tested too). Among these three, AlphaFold 3 and Cluspro probably perform the best and very similarly— but remember this is just on 3 targets each with a few tens of ligands crystallized on them, so clearly more extensive tests are needed.
Meanwhile, RoseTTAFold-AllAtoms which is also multimodal but based on a different (simpler) architecture performed slightly worse than AlphaFold 3, Boltz-1 and ClusPro; and naive docking with Autodock Vina worked around as well/bad as RoseTTAFold-AllAtoms but behaving differently with the different targets:

On binding affinities, none of the tested methods worked well; in fact, ligand descriptors like molecular weight correlated better (or rather, less bad) with the experimental affinities than the scores produced by the docking prediction methods. We note however that the scores produced by docking programs are not supposed to quantify affinities, and are not meant for the kind of analyses performed here.
This said, and despite looking with a good prospect especially for AlphaFold 3 and ClusPro, we note again that these conclusions were drawn from only 3 targets. Therefore, while it is clear that co-folding methods based on multimodal AI systems have potential, more extensive tests are required.
A final note on protein-ligand docking by co-folding
As we just presented, the evaluation of baseline methods for protein-ligand docking in CASP16 allowed for the first time ever to test an emerging methodology: modeling protein-ligand complexes by “co-folding” as enabled by the most modern multimodal AI tools for biomolecular structure prediction, which handle not only protein components but also nucleic acids, ions, and ligands. And you also saw that according to this limited amount of tests, AlphaFold 3 performs at the top and hence the technology looks promising.
Essentially, the idea behind protein-ligand docking by co-folding rather than by traditional docking-and-scoring is that the ligand and the protein are modeled together from the start, making it possible to predict their structures in a way that accounts for mutual conformational adaptation. This approach contrasts with classical docking methods, where a pre-folded protein structure is used as a rigid or semi-flexible receptor for ligand placement. By co-folding both components, these AI methods can access to much broader sampling of states competent for binding, which might be too far (in conformational space) from the structure of the unbound protein that a regular docking protocol would use. We covered this idea briefly in our blog post dedicated to AlphaFold 3 and other multimodal AI systems for structure prediction such as Chai-1 and Boltz-1:
- AlphaFold 3 Advances the Future AI Technologies for Pharma and Biotech
- AlphaFold and Similar AI Models Go All Atoms, Paving New Roads to Drug Development
We are eager to see what happens next in this exciting and critical area intersecting structural biology, molecular modeling, and pharma, where AI could revolutionize drug design, drug repurposing, and drug optimization. We are also eager to hear the opinion of long-standing experts in protein-small molecule docking and virtual screening, especially those who develop specific methods and software and hence know how to get the most of them — who were largely absent from the contest.
Modeling of Complexes Involving Nucleic Acids
CASP16 also shed light on structure prediction of nucleic acids and nucleic acid-containing complexes, all crucial for understanding cellular mechanisms and with applications in RNA-based therapeutics, codon optimization protocols for protein expression, the design of CRISPR-Cas9 systems, etc. Notably, CASP16 had the largest-ever number of nucleic acid and nucleic acid-containing targets, including DNA aptamers, RNA aptamers, DNA/RNA-protein complexes, and nucleic acid-small molecule complexes.
Overall, CASP16 observed that at least the coarse topology was predicted correctly for all nucleic-acid containing targets; however, the “best” models were very far from perfect for several targets, with TM scores barely reaching around 0.4 for a large fraction of them. The assessors observed that model accuracy depended strongly on the availability of suitable templates in the PDB. Nucleic acid multimers and ribonucleoproteins remain especially difficult to model, by all methods. Discouragingly, there was no big improvement from CASP15, which was probably the first CASP edition with a good number of nucleic acid targets and with predictors investing time and effort on them.
The following figure, taken from the assessors’ talk, compares the target structures to the top models, conveying a global idea about the state of the art of modeling including nucleic acids:

Another important takeaway from this track of CASP16 was that none of the top four predictor teams seemed to have used AI. Repeating ourselves from the previous paragraph, all the best strategies required good templates in order to produce reasonable or better models. In addition, the predictors’ talks revealed much more weight of human intervention on modeling and on scoring models.
Regarding the AlphaFold systems, while AlphaFold 2 cannot handle nucleic acids, AlphaFold 3 can do it and was tested as a baseline. It ranked 10th in the overall ranking, this time lagging behind the top performers by a larger margin in the model accuracy metrics compared to the smaller gaps observed for protein-centered predictions.
Other Frontiers Explored in CASP16
CASP often attempts to explore new frontiers in biomolecular modeling. Just a couple editions ago, evaluating the state of the art of protein-ligand docking and nucleic acid modeling was more of a wish than a reality, with very small tests materialized through few specific cases. Since CASP15 and 16 these tracks seem to be gaining traction, as we have covered.
Other fronts explored in CASP15 and especially in CASP16 included modeling conformational dynamics, predicting the location of protein-bound waters and ions, and the use of integrative modeling approaches that combine and improve upon regular molecular modeling with low-resolution information from techniques such as SAXS. These still remain experimental, with occasional interesting findings that you can discovered in the dedicated presentations from Day 4 of the CASP16 conference; for example, that atomistic molecular dynamics simulations provide the best way to identify tightly bound waters and ions.
Staying Ahead with Nexco
We hope you have found this rich blog post enlightening. At Nexco we are constantly updating ourselves on the various areas of bioinformatics, to leverage the cutting-edge advancements in the field and continuously refine our services. From classical and AI-assisted genomics to AI-driven protein structure prediction and molecular docking, our team integrates the latest breakthroughs to deliver world-class bioinformatics solutions.
Overall, this last edition of CASP and all the analysis we have carried out here reinforces what is already established: AI is revolutionizing all of bioinformatics, including biomolecular modeling as we have covered here. However, the real power lies in how we refine, optimize, and apply these tools. As the field moves forward, Nexco remains dedicated to driving innovation and empowering researchers with the most advanced computational strategies available.
Note that while Nexco cannot directly use AlphaFold 3 and other similar AI tools for commercial services due to its licensing restrictions, we can still provide valuable support to academic researchers who want to leverage these tools for non-commercial purposes. Our expertise in computational biology and biomolecular modeling allows us to assist academics in optimizing their workflows, refining modeling strategies, and interpreting structural insights effectively. All this at the same time that dedicated experts can assist you in other fields of bioinformatics with the same depth. That is how at Nexco we are committed to empowering the scientific community, academic or private, with cutting-edge bioinformatics expertise along all its fronts.
Want to harness the latest in bioinformatics for your research? Contact Nexco today and let’s push the boundaries of science together.
Key references
Domain, whole structure, and multimer prediction
- Domains and Multimers from CASP assessors perspective
- Multimers from CAPRI assessors perspective
- MassiveFold presentation
- Original MassiveFold paper
Protein-ligand docking, co-folding, and affinity estimation
- Presentation on protein-ligand docking assessment
- Preprint by one of the predictors
- Related preprint from the Schwede group
Nucleic acids and nucleic acid-containing targets
General
- CASP16 landing page (with direct links to add rankings and evaluations)
- All CASP16 presentations
- CASP16 videos of selected talks
- Book with predictors’ abstracts for CASP16
Related Posts
Our location
Nexco Analytics Bâtiment Alanine, Startlab Route de la Corniche 5A 1066 Epalinges, SwitzerlandGive us a call
+41 76 509 73 73Leave us a message
contact@nexco.chDo not hesitate to contact us
We will answer you shortly with the optimal solution to answer your needs