Statistics forms the foundation of data science. It is absolutely necessary for anyone trying to build a career in data science to have a good hold over the concepts of statistics and understand how they can be applied in business settings. Different distributions of data and their properties are one such area of statistics in which a data scientist has to have crystal clear clarity.
Let us take a look at a few of the most common distributions a data scientist encounters in their career.
Normal distribution
In a normal distribution, the data is arranged in a way that most of the values form a cluster in the middle and taper off in a symmetric fashion towards either extreme. It is also called a Gaussian distribution. It appears as a bell curve when shown graphically. In a standard normal distribution, the mean is zero, and the standard deviation takes the value of 1 along with a zero skew. The mean, median and mode are all the same in a normal distribution.
In a normal distribution, the midpoint has the maximum frequency. In normal distributions, there is a constant proportion of the area under the curve lying between the mean and any given distance from the mean when they are measured in terms of standard deviation units.
Normal distributions are represented in standard scores or Z scores. These scores give an idea of the distance between an actual score and the mean in terms of standard deviations.
Bernoulli distribution
In a Bernoulli distribution, there are two possible values for the random variable (A random variable is a variable whose value depends on the outcome of an experiment). They are of two types – discrete and continuous.
A Bernoulli distribution is a discrete distribution. It has two possible outcomes and a single trial (called a Bernoulli trial). A Bernoulli trial is one of the simplest experiments conducted in statistics. It comes with two possible outcomes of success and failure. Some examples of bernoulli trials include coin tosses, rolling a dice, etc. The probability values of mutually exclusive events that make up all the possible outcomes has to sum up to one.
The two possible outcomes in the Bernoulli distribution are indicated by n=0 and n=1. Here, n=1 indicating success has a probability p and n=0 indicating failure has a probability 1-p (0<=p<=1).
Uniform distribution
Uniform distribution is one of the simplest statistical distributions to understand. It is a probability distribution in which all the possible outcomes are equally possible to occur. Graphically, we can think of it as a straight horizontal line. Uniform distributions are of two types – discrete and continuous.
A discrete uniform distribution will have a finite number of outcomes, while a continuous uniform distribution will have an infinite number of measurable outcomes that are equally likely.
Poisson distribution
A Poisson distribution is a probability distribution that shows how many times an event is likely to occur over a fixed period of time and space. It is named after French mathematician Siméon Denis Poisson. It is a discrete distribution where the variables take only specific values. It is a limiting process of the binomial distribution.
T-distribution
It is a type of normal distribution used mainly for smaller sample sizes, and population standard deviation is unknown. It is also known as Student’s t-Distribution – it is also bell-shaped and symmetrical with zero mean. The shape undergoes a change with the change in degrees of freedom. It has a greater dispersion than the standard normal distribution. As the degrees of freedom increase, the closer the distribution starts to approximate a standard normal distribution.
The student distribution ranges from –∞ to ∞ (infinity). Some important applications of T-distribution include the Test of the Hypothesis of the population mean, Test of Hypothesis of the difference between the two means and Test of Hypothesis of the difference between two means with dependent samples.
Log-normal distribution
A log-normal distribution is a probability distribution of a random variable that has its logarithm normally distributed. A random variable of log-normal distribution takes only positive real values. A random variable that is log-normally distributed will only consider positive real values.