Use case AlphaFold
Background
AlphaFold is a tool based on neural networks (IA) developed by the DeepMind company (https://www.deepmind.com/research/highlighted-research/alphafold) that is dedicated to protein structure prediction (Jumper et al., 2021). It predicts 3D coordinates of protein structures with high accuracy from amino acids sequences only. It first runs sequence alignements with state of the art tools for the submited sequence(s) that are further used in a deep learning algorithm to produce structure predictions.
The version 2 of the software showed impressive results at the CASP14 (Critical Assessment of Techniques for Protein Structure Prediction) in 2020 for single proteins (Kryshtafovych et al., 2021). The currently running CASP15 coupled with CAPRI (Critical Assessment of PRedicted Interactions) should provide insights about the accuracy of the tool for quaternary structure predictions (Evans et al., 2021). The code of AlphaFold 2.x was made available on github at the same time as the article (https://github.com/deepmind/alphafold).
The quality of the predictions of AlphaFold obviously boosted the structural bioinformatics community but other communities have also started to think about integrating these predictions in currently running or new projects.
Many predictions are already provided in the ever-growing AlphaFold database, which shows predictions of proteins for more and more species (Varadi et al., 2022). These precomputed structure predictions can already respond to many needs but the demand for using AlphaFold stays very high because e.g. one may need to model a structure for a sequence that is not in the AlphaFold database (or not yet), one may like to get all the predictions provided by AlphaFold (only the best one is available in the Alphafold database), or one may want to model quaternary structures, which is not provided in the database.
In addition, DeepMind provides AlphaFold on a Colab notebook (https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb). It allows to run predictions through a web interface but with limitations.
Next to it, Mirdita et al. recently published ColabFold (Mirdita et al., 2022), which is a tuned version of AlphaFold2, which uses a different and faster alignment software for sequence alignment. It is also available as a dedicated Colab notebook (https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb)
Need
The growing demand to produce structure predictions requires to provide a wider access of the software to the community. This access needs to be:
- in command line with a lot of computing resources behind for users who have to run a lot of modelings
- through a simple web interface for people who only need to run a few modelings and don’t know much about command line
- both DeepMind’s AlphaFold and ColabFold should be accessible to the community
Resources
The tool is based on a deep learning approach which requires heavy computing that can partially run on GPU (Graphics Processing Unit). The process is divided into 3 steps:
- sequence alignments (if not directly provided by the user)
- structure modeling
- relaxation (optional)
The sequence alignment part runs on CPU. DeepMind’s Alphafold integrates JackHMMER and HHsearch/HHblits which require a ~2.5TB database. ColabFold integrates MMseqs2, which is faster and requires a ~1TB database.
The structure modeling can run both on GPU and CPU, the CPU option being very slow compared to the GPU one. The problem of the GPU option is that it may require a lot of memory for large sequences. This value seems however to also depend on the type of GPU used.
Objectives
The objectives of the use case is to:
- have AlphaFold and ColabFold with required databases installed on HPC/IA infrastructures
- benchmark the tools with diverse sequences on different GPUs with various amounts of memory, these GPU being single or combined. The benchmark will be mainly focused on the GPU computing part.
- create a web interface for users to simply run modelings with choice of AlphaFold2 or Colabfold
- eventually split the code to run a part on CPU and a part on specific GPU(s), the GPU(s) being chosen in function of the sequence length
The sequences for the benchmark will have to be defined.
Perspectives
Since AlphaFold, many laboratories develop new software based on deep learning approaches to predict structures. Like AlphaFold and ColabFold, they could be included in this use case to make them accessible to the community and benchmark them (e.g.: https://github.com/HeliXonProtein/OmegaFold, https://github.com/aqlaboratory/openfold)
References
- Evans,R. et al. (2021) Protein complex prediction with AlphaFold-Multimer.
- Jumper,J. et al. (2021) Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589.
- Kryshtafovych,A. et al. (2021) Critical assessment of methods of protein structure prediction (CASP)-Round XIV. Proteins, 89, 1607–1617.
- Mirdita,M. et al. (2022) ColabFold: making protein folding accessible to all. Nat. Methods, 19, 679–682.
- Varadi,M. et al. (2022) AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res., 50, D439–D444.