July 12th, 2024

Ex-Meta scientists debut gigantic AI protein design model

EvolutionaryScale introduces ESM3, a powerful AI protein design model trained on billions of sequences. Secured $142 million funding for drug development. Addresses concerns about AI-designed proteins. Researchers anticipate its impact.

Read original articleLink Icon
Ex-Meta scientists debut gigantic AI protein design model

EvolutionaryScale, founded by ex-Meta scientists, has introduced a large AI protein design model called ESM3. This model, trained on over 2.7 billion protein sequences and structures, has the capability to create new proteins based on user specifications. The company secured $142 million in funding to expand its applications in drug development and sustainability. By leveraging AI technology, EvolutionaryScale aims to make biology programmable and has already demonstrated the creation of new fluorescent proteins. The model has the potential to revolutionize medicine by designing entirely new proteins. Despite concerns about the potential weaponization of AI-designed proteins, EvolutionaryScale has taken measures to mitigate risks, including excluding certain sequences from training data. Researchers are excited about the possibilities ESM3 offers in designing proteins and anticipate its impact on various fields, including drug development and sustainability. The model's open-source version allows for collaboration and experimentation, although the largest version requires significant computing resources to replicate independently.

Related

Are AlphaFold's new results a miracle?

Are AlphaFold's new results a miracle?

AlphaFold 3 by DeepMind excels in predicting molecule-protein binding, surpassing AutoDock Vina. Concerns about data redundancy, generalization, and molecular interaction understanding prompt scrutiny for drug discovery reliability.

ESM3, EsmGFP, and EvolutionaryScale

ESM3, EsmGFP, and EvolutionaryScale

EvolutionaryScale introduces ESM3, a language model simulating 500 million years of evolution. ESM3 designs proteins with atomic precision, including esmGFP, a novel fluorescent protein, showcasing its potential for innovative protein engineering.

How AI Revolutionized Protein Science, but Didn't End It

How AI Revolutionized Protein Science, but Didn't End It

Artificial intelligence, exemplified by AlphaFold2 and AlphaFold3, revolutionized protein science by accurately predicting protein structures. Despite advancements, AI complements rather than replaces biological experiments, highlighting the complexity of simulating protein dynamics.

AI Revolutionized Protein Science, but Didn't End It

AI Revolutionized Protein Science, but Didn't End It

Artificial intelligence, exemplified by AlphaFold2 and its successor AlphaFold3, revolutionized protein science by predicting structures accurately. AI complements but doesn't replace traditional methods, emphasizing collaboration for deeper insights.

New technique opens the door to large-scale DNA editing to cure diseases

New technique opens the door to large-scale DNA editing to cure diseases

Researchers have described a new genetic editing mechanism using jumping genes to insert DNA sequences accurately. This system shows promise in overcoming CRISPR limitations, with 94% accuracy and 60% efficiency in bacteria. Optimizations are needed for mammalian cell use.

Link Icon 13 comments
By @zacharyvoase - 7 months
Reading this it sounds like 'AI' is when you build a heuristic model (which we've had for a while now) but pass some threshold of cost in terms of input data, GPUs, energy, and training.

The classical approach was to understand how genes transcribe to mRNA, and how mRNA translates to polypeptides; how those are cleaved by the cell, and fold in 3D space; and how those 3D shapes results in actual biological function. It required real-world measurement, experiment, and modeling in silico using biophysical models. Those are all hard research efforts. And it seems like the mindset now is: we've done enough hard research, let's feed what we know into a model, hope we've chosen the right hyperparameters, and see what we get. Hidden in the weights and biases of the model will be that deeper map of the real world that we have not yet fully grasped through research.

But the AI cannot provide a 'why'. Its network of weights and biases are as unintelligible to us as the underlying scientific principles of the real world we gave up trying to understand along the way. When AI produces a result that is surprising, we still have to validate it in the real world, and work backwards through the hard research to understand why we are surprised.

If AI is just a tool for a shotgun approach to discovery, that may be fine. However, I fear it is sucking a lot of air out of the room from the classical approaches. When 'AI' produces incorrect, misleading, or underwhelming results? Well, throw more GPUs at it; more tokens; more joules; more parameters. We have blind faith it'll work itself out.

But because the AI can never provide a guarantee of correctness, it is only useful to those with the infrastructure to carry out those real-world validations on its output, so it's not really going to create a paradigm shift. It can provide only a marginal improvement at the top of the funnel for existing discovery pipelines. And because AI is very expensive and getting more so, there's a pretty hard cap on how valuable it would be to a drugmaker.

I know I'm not the only one worried about a bubble here.

By @trott - 7 months
> matching less than 60% of the sequence of the most closely related fluorescent protein

> When the researchers made around 100 of the resulting designs, several were as bright as natural GFPs, which are still vastly dimmer than lab-engineered variants.

So they didn't come up with better functionality, unlike what some commentators imply. They basically introduced a bunch of mutations while preserving the overall function.

Relevant: https://en.wikipedia.org/wiki/Conservative_replacement

By @Spacecosmonaut - 7 months
Very nice work. We need brighter fluorescent protein tags that are more compact, in particular in the far red spectrum. The size of current fluorescent protein coding DNA sequences is out of reach of prime editing and still relies on less efficient gene editing technology.
By @throwaway24124 - 7 months
Are there any good resources for understanding models like this? Specifically a "protein language model". I have a basic grasp on how LLMs tokenize and encode natural language, but what does a protein language actually look like? An LLM can produce results that look correct but are actually incorrect, how are proteins produced by this model validated? Are the outputs run through some other software to determine whether the proteins are valid?
By @thebeardisred - 7 months
> For a smaller open-source version, certain sequences, such as those from viruses and a US government list of worrying pathogens and toxins, were excluded from training. Neither can ESM3-open — which scientists anywhere can download and run independently — be prompted to generate such proteins.

That sounds like a glove being thrown down.

By @flobosg - 7 months
> However, its amino-acid sequence is vastly different, matching less than 60% of the sequence of the most closely related fluorescent protein in its training data set.

Not to downplay this achievement, but 60% sequence identity is nowhere near “vastly different”.

By @trhway - 7 months
Tangential - the laws of nature discovered by our brain usually involve just few quantities, like f=ma. On one side it is a great ability of our brain for analytical reduction, on the other side it is just an inability to deal with complex multiparameter phenomena without such a reduction. I wonder if pumping more and more data into the NNs we'd be able to distill emerging multiparameter correlations which happen to be new laws of nature irreducible to more simpler ones.
By @btbuildem - 7 months
I wonder if this will ever come full circle (or.. spiral) and the AI tools we've created will in turn lead the way to discovering / inventing new proteins / cells / life forms that eventually outsmart and outcompete us.
By @nightowl_games - 7 months
Pretty sure the All In pod guys were using this company as an example of the AI bubble being over inflated.

"It's just three guys and a model and they think it's worth X hundred million"

By @yeutterg - 7 months
Just want to congratulate Tom Hayes on the big launch!
By @brcmthrowaway - 7 months
Can protein models help cure prion diseases?
By @yieldcrv - 7 months
distribute protein folding was a waste of time and energy

we just had to wait for this approach to be apparent

By @yldedly - 7 months
"Rives sees ESM3’s generation of new proteins by iterating through various sequences as analogous to evolution."

Except for the part where a sequence is actually deemed more fit, ie natural selection? And the part where mutations are random, instead of sampled from the training data manifold, so much more constrained?

...so really it's a worse version of random search?