Major data distributions a data scientist should know

Different distributions of data and their properties is one such area of statistics in which a data scientist has to have crystal clear clarity.

Statistics forms the foundation of data science. It is absolutely necessary for anyone trying to build a career in data science to have a good hold over the concepts of statistics and understand how they can be applied in business settings. Different distributions of data and their properties are one such area of statistics in which a data scientist has to have crystal clear clarity.

Let us take a look at a few of the most common distributions a data scientist encounters in their career.

Normal distribution

In a normal distribution, the data is arranged in a way that most of the values form a cluster in the middle and taper off in a symmetric fashion towards either extreme. It is also called a Gaussian distribution. It appears as a bell curve when shown graphically. In a standard normal distribution, the mean is zero, and the standard deviation takes the value of 1 along with a zero skew. The mean, median and mode are all the same in a normal distribution.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

In a normal distribution, the midpoint has the maximum frequency. In normal distributions, there is a constant proportion of the area under the curve lying between the mean and any given distance from the mean when they are measured in terms of standard deviation units. 

Normal distributions are represented in standard scores or Z scores. These scores give an idea of the distance between  an actual score and the mean in terms of standard deviations.

Bernoulli distribution

In a Bernoulli distribution, there are two possible values for the random variable (A random variable is a variable whose value depends on the outcome of an experiment). They are of two types – discrete and continuous.

A Bernoulli distribution is a discrete distribution. It has two possible outcomes and a single trial (called a Bernoulli trial). A Bernoulli trial is one of the simplest experiments conducted in statistics. It comes with two possible outcomes of success and failure. Some examples of bernoulli trials include coin tosses, rolling a dice, etc. The probability values of mutually exclusive events that make up all the possible outcomes has to sum up to one.

 The two possible outcomes in the Bernoulli distribution are indicated by n=0 and n=1. Here, n=1 indicating success has a probability p and n=0 indicating failure has a probability 1-p (0<=p<=1).

Uniform distribution

Uniform distribution is one of the simplest statistical distributions to understand. It is a probability distribution in which all the possible outcomes are equally possible to occur. Graphically, we can think of it as a straight horizontal line. Uniform distributions are of two types – discrete and continuous. 

A discrete uniform distribution will have a finite number of outcomes, while a continuous uniform distribution will have an infinite number of measurable outcomes that are equally likely.

Poisson distribution

A Poisson distribution is a probability distribution that shows how many times an event is likely to occur over a fixed period of time and space. It is named after French mathematician Siméon Denis Poisson. It is a discrete distribution where the variables take only specific values. It is a limiting process of the binomial distribution.

T-distribution

It is a type of normal distribution used mainly for smaller sample sizes, and population standard deviation is unknown. It is also known as Student’s t-Distribution – it is also bell-shaped and symmetrical with zero mean. The shape undergoes a change with the change in degrees of freedom. It has a greater dispersion than the standard normal distribution. As the degrees of freedom increase, the closer the distribution starts to approximate a standard normal distribution.

The student distribution ranges from –∞ to ∞ (infinity). Some important applications of T-distribution include the Test of the Hypothesis of the population mean, Test of Hypothesis of the difference between the two means and Test of Hypothesis of the difference between two means with dependent samples.

Log-normal distribution

A log-normal distribution is a probability distribution of a random variable that has its logarithm normally distributed. A random variable of log-normal distribution takes only positive real values. A random variable that is log-normally distributed will only consider positive real values.

More Great AIM Stories

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at sreejani.bhattacharyya@analyticsindiamag.com

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM