AlphaFold
Table of Contents
1. Nature Article
“It’s a game changer,” says Andrei Lupas, an evolutionary biologist at the Max Planck Institute for Developmental Biology in Tübingen, Germany, who assessed the performance of different teams in CASP. AlphaFold has already helped him find the structure of a protein that has vexed his lab for a decade, and he expects it will alter how he works and the questions he tackles. “This will change medicine. It will change research. It will change bioengineering. It will change everything,” Lupas adds
Proteins are the building blocks of life, responsible for most of what happens inside cells. How a protein works and what it does is determined by its 3D shape — ‘structure is function’ is an axiom of molecular biology. Proteins tend to adopt their shape without help, guided only by the laws of physics.
The first complete structures of proteins were determined, starting in the 1950s, using a technique in which X-ray beams are fired at crystallized proteins and the diffracted light translated into a protein’s atomic coordinates
Early attempts to use computers to predict protein structures in the 1980s and 1990s performed poorly, say researchers
Moult started CASP to bring more rigour to these efforts. The event challenges teams to predict the structures of proteins that have been solved using experimental methods, but for which the structures have not been made public.
DeepMind’s 2018 performance at CASP13 startled many scientists in the field, which has long been the bastion of small academic groups. But its approach was broadly similar to those of other teams that were applying AI, says Jinbo Xu, a computational biologist at the University of Chicago, Illinois.
The first iteration of AlphaFold applied the AI method known as deep learning to structural and genetic data to predict the distance between pairs of amino acids in a protein. In a second step that does not invoke AI, AlphaFold uses this information to come up with a ‘consensus’ model of what the protein should look like, says John Jumper at DeepMind, who is leading the project.
The team tried to build on that approach but eventually hit the wall.
developed an AI network that incorporated additional information about the physical and geometric constraints that determine how a protein folds.
They also set it a more difficult, task: instead of predicting relationships between amino acids, the network predicts the final structure of a target protein sequence. “It’s a more complex system by quite a bit,” Jumper says.
Target proteins or portions of proteins called domains — about 100 in total — are released on a regular basis and teams have several weeks to submit their structure predictions
Some predictions were better than others, but nearly two-thirds were comparable in quality to experimental structures. In some cases, says Moult, it was not clear whether the discrepancy between AlphaFold’s predictions and the experimental result was a prediction error or an artefact of the experiment.
About half of the teams mentioned ‘deep learning’ in the abstract summarizing their approach,
“I think it’s fair to say this will be very disruptive to the protein-structure-prediction field. I suspect many will leave the field as the core problem has arguably been solved, ” he says. “It’s a breakthrough of the first order, certainly one of the most significant scientific results of my lifetime.”
An AlphaFold prediction helped to determine the structure of a bacterial protein that Lupas’s lab has been trying to crack for years. Lupas’s team had previously collected raw X-ray diffraction data, but transforming these Rorschach-like patterns into a structure requires some information about the shape of the protein. Tricks for getting this information, as well as other prediction tools, had failed. “The model from group 427 gave us our structure in half an hour, after we had spent a decade trying everything,” Lupas says
AlphaFold’s performance also marks a turning point for DeepMind. The company is best known for wielding AI to master games such Go, but its long-term goal is to develop programs capable of achieving broad, human-like intelligence. Tackling grand scientific challenges, such as protein-structure prediction, is one of the most important applications its AI can make, Hassabis says. “I do think it’s the most significant thing we’ve done, in terms of real-world impact.”
2. Deepmind Blog Post
Proteins are essential to life, supporting practically all its functions. They are large complex molecules, made up of chains of amino acids, and what a protein does largely depends on its unique 3D structure. Figuring out what shapes proteins fold into is known as the “protein folding problem”, and has stood as a grand challenge in biology for the past 50 years.
Professor John MoultCo-Founder and Chair of CASP, University of Maryland
nuclear magnetic resonance and X-ray crystallography.
newer methods like cryo-electron microscopy
depend on extensive trial and error, which can take years of painstaking and laborious work per structure, and require the use of multi-million dollar specialised equipment.
In his acceptance speech for the 1972 Nobel Prize in Chemistry, Christian Anfinsen famously postulated that, in theory, a protein’s amino acid sequence should fully determine its structure. This hypothesis sparked a five decade quest to be able to computationally predict a protein’s 3D structure based solely on its 1D amino acid sequence as a complementary alternative to these expensive and time consuming experimental methods.
A major challenge, however, is that the number of ways a protein could theoretically fold before settling into its final 3D structure is astronomical.
In 1969 Cyrus Levinthal noted that it would take longer than the age of the known universe to enumerate all possible configurations of a typical protein by brute force calculation – Levinthal estimated 10300 possible conformations for a typical protein. Yet in nature, proteins fold spontaneously, some within milliseconds – a dichotomy sometimes referred to as Levinthal’s paradox.
The main metric used by CASP to measure the accuracy of predictions is the Global Distance Test (GDT) which ranges from 0-100. In simple terms, GDT can be approximately thought of as the percentage of amino acid residues (beads in the protein chain) within a threshold distance from the correct position. According to Professor Moult, a score of around 90 GDT is informally considered to be competitive with results obtained from experimental methods.
Our methods may prove especially helpful for important classes of proteins, such as membrane proteins, that are very difficult to crystallise and therefore challenging to experimentally determine
This computational work represents a stunning advance on the protein-folding problem, a 50-year-old grand challenge in biology. It has occurred decades before many people in the field would have predicted. It will be exciting to see the many ways in which it will fundamentally change biological research.
Professor Venki RamakrishnanNobel Laureate and President of the Royal Society
A folded protein can be thought of as a “spatial graph”, where residues are the nodes and edges connect the residues in close proximity. This graph is important for understanding the physical interactions within proteins, as well as their evolutionary history. For the latest version of AlphaFold, used at CASP14, we created an attention-based neural network system, trained end-to-end, that attempts to interpret the structure of this graph, while reasoning over the implicit graph that it’s building. It uses evolutionarily related sequences, multiple sequence alignment (MSA), and a representation of amino acid residue pairs to refine this graph.
Additionally, AlphaFold can predict which parts of each predicted protein structure are reliable using an internal confidence measure.
It uses approximately 128 TPUv3 cores (roughly equivalent to ~100-200 GPUs) run over a few weeks, which is a relatively modest amount of compute in the context of most large state-of-the-art models used in machine learning today.
An overview of the main neural network model architecture. The model operates over evolutionarily related protein sequences as well as amino acid residue pairs, iteratively passing information between both representations to generate a structure.
AlphaFold is a once in a generation advance, predicting protein structures with incredible speed and precision. This leap forward demonstrates how computational methods are poised to transform research in biology and hold much promise for accelerating the drug discovery process.
Arthur D. LevinsonPhD, Founder & CEO Calico, Former Chairman & CEO, Genentech
As well as accelerating understanding of known diseases, we’re excited about the potential for these techniques to explore the hundreds of millions of proteins we don’t currently have models for – a vast terrain of unknown biology. Since DNA specifies the amino acid sequences that comprise protein structures, the genomics revolution has made it possible to read protein sequences from the natural world at massive scale – with 180 million protein sequences and counting in the Universal Protein database (UniProt). In contrast, given the experimental work needed to go from sequence to structure, only around 170,000 protein structures are in the Protein Data Bank (PDB).
Just as 50 years ago Anfinsen laid out a challenge far beyond science’s reach at the time, there are many aspects of our universe that remain unknown. The progress announced today gives us further confidence that AI will become one of humanity’s most useful tools in expanding the frontiers of scientific knowledge, and we’re looking forward to the many years of hard work and discovery ahead!