The Enformer vs the Basenji – The AI algorithms for gene expression predictions

Enformer, a genetic research tool based on Transformers, advances genetic research by predicting how DNA sequences influence gene expression.
The Enformer vs the Basenji - The AI Algorithms for gene expression predictions

DeepMind and Alphabet at Calico introduced a neural network architecture called Enformer that greatly improved the accuracy of predicting gene expression based on DNA sequence. 

In the paper “Effective gene expression prediction from sequence by integrating long-range interactions” published in Nature Methods, DeepMind suggested that Enformer is more accurate than Basenji.

Basenji2 and limitations

The basic building blocks of gene expression have typically been convolutional neural networks. They have, however, been limited in their ability and effectiveness to model due to the effects of distal enhancers on gene expression. 

So Deepmind depends on Basenji2, built on TensorFlow, which offers a variety of benefits, including distributed computing, a large and adaptive developer community, and is designed to predict quantitative signals using regression loss functions, rather than binary signals using classification loss functions.

The best part of Basenji is that it could predict the regulatory activity of 40,000 base pair DNA sequences at a time. 

Enformer’s advances include

Enformer, on the other hand, relies on a technique common to natural language processing from Google called Transformers to take into account self-attention mechanisms that would be able to integrate much more DNA context. As Transformers can read long text passages, DeepMind modified them to read DNA sequences of vastly extended length. 

Enformer outperformed the best team on the critical assessment of genome interpretation challenge (CAGI5) for noncoding variant interpretation despite no additional training. Furthermore, Enformer learned to predict promoter-enhancer interactions directly from DNA sequences, competing with methods that took direct experimental data as input.

In the case of training, DeepMind used Sonnet to construct neural networks used for many different purposes. It is defined in

DeepMind pre-computed variant effect scores for all frequent variants (MAF>0.5%, in any population) and stored them in HDF5 files per chromosome for the HG19 reference genome under the 1000 genomes project. Additionally, they provide the top 20 principal components of variant-effect scores per chromosome in a tabix-indexed TSV file (HG19 reference genome). These files have the following columns:

  • #CHROM – chromosome (chr1)
  • POS – variant position (1-based)
  • ID – dbSNP identifier
  • REF – reference allele (e.g. A)
  • ALT – alternate allele (e.g. T)
  • PC{i} – i-th principal component of the variant effect prediction.

Hopefully, these advances will enable better mapping of growing human disease associations to cell-type-specific gene regulatory mechanisms and provide a framework to understand how cis-regulatory evolution works.

More Great AIM Stories

Sohini Das
Sohini graduated from the University of Kalyani with a master's degree in nanosciences and nanotechnology. She hopes to become a tech journalist one day. Her work focuses on digital transformation, geopolitics, and emerging technologies.
Yugesh Verma
How is Boolean algebra used in Machine learning?

Machine learning model with Boolean algebra starts with the data with a target variable and input or learner variables and using the set of rules it generates output value by considering a given configuration of input samples.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM