Listen to this story
On July 20, 2022, Chinese biotech firm Helixon launched OmegaFold, the first computational method to predict high-resolution protein structure from a single primary sequence successfully. This new study by Chinese researchers fills a much-encountered gap in structure prediction and inches closer to understanding protein folding in nature.
Recently, the company open-sourced its project, joining the likes of DeepMind’s AlphaFold, RoseTTAFold, and Meta AI’s ESMFold, among others, which are also open source. The initial version of the code and model is available on GitHub.
Understanding protein folding helps researchers and scientists know the underlying cause of many diseases and abnormalities. It also helps find a cure, design new medicines, pharmaceutical solutions, and alternative treatments.
Sign up for your weekly dose of what's up in emerging technology.
This new model developed by Helixon claims to outperform RoseTTAFold and achieve similar prediction accuracy to AlphaFold 2 on the recently released structure. In a study, the researchers said they had used a new combination of a protein language model that allows them to make predictions from single sequences and a geometry-inspired transformer model trained on protein structures.
In addition, OmegaFold enables accurate predictions on orphan proteins that do not belong to any functionality characterised protein family and antibodies that tend to have noisy MSAs (multiple sequence alignments) due to fast evolution.
OmegaFold vs AlphaFold vs ESMFold
A month ago, Meta AI launched a breakthrough model called Evolutionary Scale Modelling, or ESM, for faster protein structure prediction. This model, too, claimed to have similar accuracy as AlphaFold2 and RoseTTAFold, but ESMFold inference is faster at enabling the exploration of structural spaces of metagenomic proteins.
There seem to be glaring similarities between ESMFold, AlphaFold, and OmegaFold. The team said that the overall model of OmegaFold is conceptually inspired by advances in language models for NLP coupled with deep neural networks used in AlphaFold2.
OmegaFold leverages a deep transformer-based protein language model, trained on a large collection of unaligned and unlabeled protein sequences, to learn single-and pairwise -residue representations as powerful features that model the distribution of sequences.
The Omega protein language model (PLM) can capture structural and functional information encoded in the amino-acid sequences through the embeddings. These are later fed into Geoformer, a new geometry-inspired transformer neural network, to distill the structural and physical pairwise relationships between amino acids. Finally, a structural module predicts the 3D coordinates of all heavy atoms.
ESMFold, on the other hand, leverages a large-scale language model for protein prediction. The improvements in language modelling perplexity and structure learning continue through 15 billion parameters. Meanwhile, AlphaFold uses a network-based architecture and training proceeds based on evolutionary, physical and geometric constraints of protein structures.
The researchers noted that their model (OmegaFold) performs well on CASP and CAMEO benchmark datasets, spanning a wide range of prediction difficulty levels. In comparison, OmegaFold, with a single sequence as input, were as accurate as the advanced MSA-based methods, including AlphaFold 2 and RoseTTATold.
As shown below, OmegaFold structures had a mean local-distance difference test (LDDT) score of 0.82 on the CAMEO dataset, with comparable accuracy to RoseTTAFold structures (0.75 mean LDDT score) and similar to AlphaFold 2 structures (0.86 mean LDDT) predicted from MSAs. Local-distance difference tests, or LDDT, are commonly used metrics for structure evaluation.
On the CASP dataset, OmegaFold structures were also quite accurate, with an average TM-score of 0.79, slightly lower than that of RoseTTAFold structures (0.81 mean TM-score) and equivalent to AlphaFold 2 structures (0.79 mean TM–score). Meanwhile, ESMFold achieved a TM-score of 0.71 on the CAMEO test set and 0.53 on the CASP dataset. TM-score is a common metric for assessing protein structure’s topological similarity.
A score above 0.90 is considered roughly equivalent to the experimentally determined structure.
On single-sequence input, OmegaFold wins
Over the years, several companies have used deep learning to exploit evolutionary information in MSAs (multiple sequence alignments) to accurately predict protein structures. On the contrary, MSAs of homologous proteins are not always available, including orphan proteins and antibodies, and a protein typically folds in a natural setting from its primary amino acid sequence into its 3D structure. The OmegaFold team suggested that evolutionary information and MSAs should not be necessary to predict a protein’s folded form.
This is where the new ‘super fast’ protein production model OmegaFold comes into the picture. It outperformed AlphaFold 2 and RoseTTAFold on single-sequence inputs. Further, OmegaFold achieved much higher statistical prediction accuracy in comparison to AlphaFold 2, likely due to the advantages of its single-sequence-based prediction method, both on antibody loops and orphan proteins.