Though the skillset for Data Science is the same, the implementation varies from problem to problem. That is where domain knowledge comes into play. It has been established that any sector that produces data can be optimized by data science skills to make better business decisions, overcoming challenges and identifying opportunities. Molecular biology is one of the latest fields where data analytics are extensively applied.
In this article we will go through a brief introduction of bioinformatics, also referred to as computational biology, from the point of view of a beginner data scientist.
What is bioinformatics?
As the name indicates – bioinformatics deals with computational analysis of biological data at a molecular level. It is a crossover of biology, computer science, statistics and mathematics which are not the usual disciplines that are studied together. Usually, an expert of one of the specialities decides to pursue bioinformatics which requires them to familiarize themselves with the remaining disciplines. This could be a difficult task; hence this article will assist enthusiasts who have a competent computational and statistical background and are looking to get into bioinformatics.
The life sciences contain a plethora of data that need computational tools and frameworks to manage this data and make it more readable and accessible. Bioinformatics provides the said tools and techniques that require a good understanding of the problem’s domain. Now, the question arises that what type of data are we talking about. Though the format of the data is string sequences or numerical expression of gene and proteins, the meaning could vary depending on the source and perturbation of data. These data types will be discussed in detail further in the article.
Why do we need quantitative computation in bioinformatics?
Significant amounts of research are being carried out to understand the basic human body functions to deduce how the body reacts to perturbations. For the purpose, a cell behaviour of a healthy entity to a perturbed entity is compared to deduce the difference of behaviour that is resourceful in developing drugs to deal with the perturbation. However, the data produced at a cell level is highly dimensional.
For instance, one organism’s one cell activity can produce sequences ranging from 450 to 100,00 genes. Hence, to handle such sensitive noisy, high-dimensional data, it is imperative to implement data analysis tools that have been developed in order to find the most optimized way of storing, analysing and computing this data.
Types of Data you can come across in bioinformatics?
Gene Sequences
Most of the data types that one can come across in bioinformatics is nucleic acid sequences – ACGT – namely, Adenine, Cytosine, Guanine and Thymine. These sequences could be for a gene or the whole DNA. They are present in pairs of G-C, T-A, A-T and C-G, hence only one side of the sequence is recorded as the other side can be produced as per their pairing rules. If you are dealing with sequences most of your work will be identifying patterns that are repetitive, recognizing the protein formation pattern in different sequence strips and pinpointing different patterns while comparing two strips of sequences of a healthy cell and a perturbated cell.
Gene expressions
Every human’s biological data is hard encoded in their genes which acts as a guide to how a body will react to any action. There is a surplus amount of information that lies in the genes of an individual yet to be discovered. Gene expressions refers to the messenger RNA levels of a gene at a certain time point and perturbation.
Their values are numerical and represent the so-called expression of a gene at a certain time point. It has been biologically proven that in a set of gene’s at a particular location, there are few gene’s that are referred to as “regulatory genes” and the remaining gene’s are referred to as “target genes”. The regulatory genes can be labelled as the supervisors that control the expressions of a target gene. For instance, if X ???? Y, that means X gene regulates Y gene.
A lot of research is being carried out to find these regulatory and target relationships between genes. Gene expression data suffers from high dimensionality issue also referred to as “curse of dimensionality” that means the data points to data features ratio is very small as there are thousands of genes and their respective expressions however, time points recording still falls between 10-30 time points.
If you are working with Gene expression data, you will be spending time mostly in representation models of gene regulatory networks, optimizing these models and dealing with computational complexity.
Popular data science tools and databases for bioinformatics?
Databases for bioinformatics
- GenBank: Genetic sequence database from NCBI
- EMBL-EBI: Nucleotide Sequence Database
- UniProt: Protein sequence database
- GEO Database: Gene expression profiles from NCBI
- Expression Atlas: Gene expression across species and biological conditions
The top three tools/programming languages used by computation biologist are:
- Python: BioPython, Biotite, Scikit-Bio, SciPy
- R: CROME, InterMineR, rScudo, Repo
- Matlab: Bioinformatics Toolbox
Though it fairly depends on an individual’s background to which tool they prefer to adopt, Matlab does have a better edge for visualization.
Where to begin?
Now, that we have the basics laid out let’s discuss the ideal way to address a bioinformatic project to begin will.
Step1: Identify the datatype and the problem definition related to the data type
Step2: Research about the biological inference underlining the datatype to improve your domain knowledge
Step3: Data preparation – Identify the database to be used along with required data points or data features. It is advisable to start with small datasets such as a 5-gene IRMA network.
Step4: Lay down the analysis solution in pseudocode to ensure you understand the problem statement and its working
Step5: Code your analysis, compare your results to the ground truth and infer your outcome
Bioinformatics could be challenging to any research with a non-medical background, hence, jumping to solutions without appropriate understanding of the background problem will significantly enhance the complexity of the analysis. In this article, we scratched the surface of bioinformatics with a point of view of an intermediate data scientist in order to lay a good foundation for those who have taken on the endeavour to pursue computational biology with a non-medical background.