MIT pioneers technique that creates fitter proteins to study the brain

18 Apr 2024 --- Researchers from Massachusetts Institute of Technology (MIT), US, have created a new computational technique that makes it easier to engineer optimized proteins, manipulate the more complicated proteins and do so using smaller datasets than previously possible. The researchers plan to engineer proteins to measure electrical activity in the brain.

“Protein design is a hard problem because the mapping from DNA sequence to protein structure and function is really complex. There might be a great protein, ten changes away in the sequence, but each intermediate change might correspond to a totally nonfunctional protein,” says Ila Fiete, a professor of brain and cognitive sciences at MIT and one of the senior authors of the study.

“It’s like trying to find your way to the river basin in a mountain range, when there are craggy peaks along the way that block your view. The current work tries to make the riverbed easier to find.”

Predicting protein mutations
When engineering proteins with useful functions, researchers typically begin with a natural protein that has a desirable function like emitting fluorescent light. Then, they put it through many rounds of random mutation to generate the best version of the protein.

The process has produced many optimal proteins, including green fluorescent protein (GFP). However, there are many other proteins that are harder to generate an optimized version of. MIT’s researchers developed a computational approach that makes it easier to predict mutations that will lead to better proteins using a relatively small amount of data.

They began by training a model known as a convolutional neural network (CNN) on experimental data consisting of GFP sequences and their brightness — the feature they wanted to optimize. The model created a 3D map or “fitness landscape” showing the fitness of a given protein and how it differs from the original sequence using a small amount of experimental data from roughly 1,000 variants of GFP.

“Once we have this landscape that represents what the model thinks is nearby, we smooth it out and then we retrain the model on the smoother version of the landscape. Now there is a smooth path from your starting point to the top, which the model is now able to reach by iteratively making small improvements. The same is often impossible for unsmoothed landscapes,” says Andrew Kirjner, a graduate student at MIT and one of the study’s lead authors. Neurons in the brain firing The MIT scientists want to use the new computational technique to create better tools for conducting neuroscientific research.

Creating fitter proteins
The computational model creates landscapes that contain peaks representing fitter proteins and valleys showing less fit proteins. The researchers used an existing computational technique to “smooth” the fitness landscape because predicting the path a protein needs to follow to reach the peaks of fitness can be difficult. Often, a protein must undergo a mutation that makes it less fit before it reaches a nearby peak of higher fitness.

Once the small bumps in the landscape were smoothed, they retrained the CNN model and found that it could reach greater fitness peaks more easily. The new model could predict optimized GFP sequences with as many as seven different amino acids from the protein sequence they started with. The best of these proteins were estimated to be about 2.5 times fitter than the original.

The researchers generated proteins with mutations that were predicted to lead to improved versions of GFP and a protein from an adeno-associated virus, which is used to deliver DNA for gene therapy using this computational approach. They hope it can be used to develop additional tools for neuroscience research.

An open-access paper on the study documenting the technique will be presented at the International Conference on Learning Representations from 7–11 May.

Measuring neuron activity
Initially, the researchers were interested in developing proteins that can be used as voltage indicators in living cells. These proteins are produced by certain bacteria or algae and emit fluorescent light when electric potential is detected. When engineered for mammalian cells, scientists can measure neuron activity without electrodes.

Researchers use computational modeling to predict which proteins work best because there is an infinite number of possible sequences that can be generated to optimize a given protein by swapping in different amino acids at each point within the sequence.

“This work exemplifies the human serendipity that characterizes so much science discovery. We learned that some of our interests and tools in modeling how brains learn and optimize could be applied in the totally different domain of protein design, as being practiced in the Boyden lab,” says Fiete.

The researchers also showed that their approach works well in identifying new sequences for the viral capsid of adeno-associated virus, a viral vector commonly used to deliver DNA. In that case, they optimized the capsid for its ability to package a DNA payload.

“We used GFP and AAV as a proof-of-concept to show that this is a method that works on data sets that are very well-characterized, and because of that, it should be applicable to other protein engineering problems,” says Shahar Bracha, a postdoctoral student at MIT and co-author of the study.

“Dozens of labs have been working on that for two decades, and still there isn’t anything better. The hope is that now, with the generation of a smaller data set, we can train a model in silico and make predictions that could be better than the past two decades of manual testing.”

The researchers plan to use this computational technique on data they have generated on voltage indicator proteins.

By Inga de Jong