Artificial intelligence to the challenge of protein design: prowess and limits of AlphaFold

The pandemic linked to SARS-CoV-2 acutely raises the question of the design of molecules capable of limiting the action of a virus on our cells – a mechanism that involves very large molecules that are difficult to model, proteins, which are moreover… in permanent motion.

Artificial intelligence systems, starting with Google’s AlphaFold2, now predict the configuration of these proteins in an impressive way, which is revolutionizing research in the field. How do these methods work? What are their current limits?

Entry of SARS-CoV-2: a soaring story

L’infection of one of our cells by SARS-CoV-2, the virus responsible for Covid-19, begins with a kind of break-in : the virus, a bristling envelope of proteins inside which is its genetic material, behaves like a thief entering an apartment on the first floor of a building. With a grappling hook (the “receptor binding domain” or RBD found on the famous protein “ spike “), it clings to the railing (the no less famous protein “ACE2”). Then, using a hammer (the fusion domain, another region of the spike), he breaks the glass (the cell membrane) and injects his genetic material.

This mechanism is dynamic, that is to say that the molecules change conformation (shape) during the breach. On the one hand, the virus only “draws” its hook at the last moment; on the other hand, the “window breakage” uses a kind of telescopic pole whose assembly is complex.

Fusion of cell membrane and virus; source: ClarafiSciViz.

These two protagonists (spike and ACE2) are proteins. Interactions between proteins are the basis of the vast majority of biological functions, and understanding these interactions first requires knowledge of the geometric shape of the partners – we often speak of “key” and “lock” to visualize the fact that the geometry of proteins must be adequate for them to interact.

These molecular conformations have been studied experimentally since the 1950s-60s and stored in an international database, the Protein Data Bank.

[Près de 80 000 lecteurs font confiance à la newsletter de The Conversation pour mieux comprendre les grands enjeux du monde. Abonnez-vous aujourd’hui]

In the case of SARS-CoV-2, the Spike protein has been widely presented as such a key, which would fit into the ACE2 “lock”. But the key-lock mechanism is a somewhat simplistic vision, and as we have seen, proteins are endowed with a certain flexibility (they deform), which also allows them to adapt.

Indeed, one way to block infection by SARS-CoV-2 is to prevent the attachment of the grappling hook (the Spike protein), and more specifically of its receptor binding domain (RBD) to the ACE2 target. This is the goal of certain antibodies secreted by our immune system.

Unfortunately, through mutations, the virus is constantly trying to escape this control: certain amino acids change, which means that the conformation of its spike protein is no longer recognized by antibodies. As these no longer have sufficient affinity, the immune system must adapt, which is a challenge when it comes to to be effective against a wide range of viral strains.

Affinity between two biomolecules: structure and dynamics

To better understand the attachment of the “grappling hook” (RBD) to the “railing” (ACE2), let’s look at the interaction of two proteins A and B forming a complex C.

At the atomic scale, two phenomena are in competition: forces of attraction between atoms cause molecules to attract each other; but, under the effect of thermal agitation – i.e. the random displacements of atoms which increase with temperature, the molecules deform.

This thermal agitation means that once the C complex has formed, it can dissociate into A and B, the partners then being able to associate again, and so on. This is a chemical equilibrium, and the relative amount of molecules A and B and complex C is a measure of the stability of the interaction. The more complex C there is, the more it means that the affinity of A for B is high, and therefore that their interaction is stable.

In the case of Spike and ACE2, a high affinity of the “grapple” (RBD) for the “railing” (ACE2) will increase the infectivity of the virus (the grappling hook will cling all the more strongly to the railing as its affinity for she is tall).

AlphaFold2: from structure to affinity

Estimating the binding affinity therefore requires taking into account the deformations around an average molecular structure. In the lock-key metaphor, the shape of the latter must be known, at least approximately. Proteins are known to be made up of long chains of different amino acids strung together like a long string of pearls.

Knowing the sequence of the amino acids of a protein (in other words, the order in which they are linked), could we predict the shape it will adopt, by calculating it by computer?

This subject has been the subject of a major step forward with the development of the AlphaFold2 method and software of the same name, by a research group from Google DeepMind. This method has clearly outperformed its competitors during the CASP14 competition in 2020, which evaluates the quality of the predictions by comparing them to structures solved experimentally but not revealed to competitors.

Very schematically, given the sequence of amino acids whose conformation must be predicted, AlphaFold2 uses as input a database of homologous sequences (different sequences but for which the changes of amino acids do not alter the function of the protein), as well as some experimental structures from the Protein Data Bank. The method outputs a plausible structure for the protein, as well as a “confidence score” for the position, when the protein is folded, of each amino acid in the calculated conformation, which helps to see which amino acids are exposed and can interact with the outside.

The method uses two main blocks. The first produces a rough model encoding certain constraints between the amino acids, in particular the three to three distances which must respect the triangular inequality. The second, the structure module, explicitly introduces the 3D model by positioning the amino acids relative to each other, thanks to “attention mechanisms”, an algorithmic technique for exploring hypotheses randomly, and retaining those that are most consistent with the model being developed. Ultimately, the neural network generates a plausible conformation.

To date, the method is particularly effective for well-structured protein domains (the most rigid), but is much less so for unstructured parts (the most flexible), or even for flexible loops for which the notion of same single structure does not make sense. Moreover, despite the confidence score mentioned above, the overall result is delivered without any guarantee.

Apply the method to SARS-CoV-2 antibodies

The resounding success of this method has of course aroused interest in affinity prediction, which has been explored very recently. to optimize antibodies against the RDB of SARS-Cov-2so that these antibodies have a high affinity for different viral strains.

The method uses a “mutagenesis” database for this purpose: this gives both the structure of a complex, the structure of an analogous complex whose proteins have genetically mutated, and also the affinity associated with each of these two complexes. It is therefore a question of learning how mutations influence affinity. From a methodological point of view, the algorithm identifies the amino acids contributing significantly to the binding affinity.

Remarkably, this strategy optimized an effective antibody against Alpha, Beta and Gamma variants of SARS-CoV-2 (but not Delta).

Dynamics prediction remains an open problem

Reliably estimating the binding affinity between large molecules such as proteins requires exploring very high-dimensional spaces (atoms are numerous and move in the 3 dimensions of space) in order to calculate average properties reporting on our macroscopic observations.

Also, in the context of AlphaFold2 and machine learning, there needs to be data available, so that the algorithms can learn to link the structure and its properties. In our case, the static information present in the Protein Data Bank and other databases obviously do not contain all the dynamic information required.

“Predicting is not explaining”

The practical question of effectively blocking a virus like SARS-CoV-2 shows how difficult these molecular design questions are, not yet falling within the scope of classical engineering optimization work.

Affinity prediction also illustrates the opposition observed in epistemology between “predictivism” and explanation by laws and models, which make it possible to establish a chain of causality. As the mathematician and epistemologist René Thom said, “Predicting is not explaining”and machine learning techniques illustrate this dissonance well.

We bet, however, that the accumulation of data, dynamic in particular, will allow convergence in the sense that machine learning will be able tomatch predictions with explanations.

We would love to thank the writer of this article for this outstanding material

Artificial intelligence to the challenge of protein design: prowess and limits of AlphaFold

Check out our social media accounts as well as other pages related to it.