We frequently come out with resources for aspirants and job seekers in data science to help them make a career in this vibrant field. Cracking interviews especially where understating of statistics is needed can be tricky. Here are 40 most commonly asked interview questions for data scientists, broken into basic and advanced.

Here are some other interview questions resources for data scientists.

10 Most Common SQL Questions & Answers You Must Know For Your Next Interview

#### THE BELAMY

##### Sign up for your weekly dose of what's up in emerging technology.

10 Frequently Asked Interview Questions For Machine Learning In 2019

5 Mathematical Concepts Every Data Scientist Should Master Before An Interview

10 Important Pandas Interview Questions Every Beginner Must Know

11 Most Commonly Asked NLP Interview Questions For Beginners

12 Most Popular Python Interview Questions You Must Prepare For

10 Most Frequently Asked Questions In Data Science Interview

Top Interview Questions For A Data Engineer Job Profile

### Part 1 – Basic Statistics and Distributions

### 20 Question

**What is the difference between data analysis and machine learning?**

Data analysis requires strong knowledge of coding and basic knowledge of statistics

Machine learning, on the other hand, requires basic knowledge of coding and strong knowledge of statistics and business.

2. **What is big data? **

Big data has 3 major components – volume (size of data), velocity (inflow of data) and variety (types of data)

Big data causes “overloads”

3. **What are the four main things we should know before studying data analysis? **

Descriptive statistics

Inferential statistics

Distributions (normal distribution / sampling distribution)

Hypothesis testing

4. **What is the difference between inferential statistics and descriptive statistics? **

Descriptive statistics – provides exact and accurate information.

Inferential statistics – provides information of a sample and we need to inferential statistics to reach to a conclusion about the population.

5. **What is the difference between population and sample in inferential statistics? **

From the population we take a sample. We cannot work on the population either due to computational costs or due to availability of all data points for the population.

From the sample we calculate the statistics

From the sample statistics we conclude about the population

6. **What are descriptive statistics? **

Descriptive statistic is used to describe the data (data properties)

5-number summary is the most commonly used descriptive statistics

7. **Most common characteristics used in descriptive statistics? **

- Center – middle of the data. Mean / Median / Mode are the most commonly used as measures.
- Mean – average of all the numbers
- Median – the number in the middle
- Mode – the number that occurs the most. The disadvantage of using Mode is that there may be more than one mode.

- Spread – How the data is dispersed. Range / IQR / Standard Deviation / Variance are the most commonly used as measures.
- Range = Max – Min
- Inter Quartile Range (IQR) = Q3 – Q1
- Standard Deviation (σ) = √(∑(x-µ)
^{2}/ n) - Variance = σ
^{2}

- Shape – the shape of the data can be symmetric or skewed
- Symmetric – the part of the distribution that is on the left side of the median is same as the part of the distribution that is on the right side of the median
- Left skewed – the left tail is longer than the right side
- Right skewed – the right tail is longer than the left side

- Outlier – An outlier is an abnormal value
- Keep the outlier based on judgement
- Remove the outlier based on judgement

8. **What is quantitative data and qualitative data? **

Quantitative data is also known as numeric data

Qualitative data is also known as categorical data

9. **How to calculate range and interquartile range? **

IQR = Q3 – Q1

Where, Q3 is the third quartile (75 percentile)

Where, Q1 is the first quartile (25 percentile)

10. **Why we need 5-number summary? **

Low extreme (minimum)

Lower quartile (Q1)

Median

Upper quartile (Q3)

Upper extreme (maximum)

11. **What is the benefit of using box plot? **

Shows the 5-number summary pictorially

Can be used to compare group of histograms

12. **What is the meaning of standard deviation?**

It represents how far are the data points from the mean

(σ) = √(∑(x-µ)^{2} / n)

Variance is the square of standard deviation

13. **What is left skewed distribution and right skewed distribution? **

- Left skewed
- The left tail is longer than the right side
- Mean < median < mode

- Right skewed
- The right tail is longer than the right side
- Mode < median < mean

14. **What does symmetric distribution mean? **

The part of the distribution that is on the left side of the median is same as the part of the distribution that is on the right side of the median

Few examples are – uniform distribution, binomial distribution, normal distribution

15. **What is the relationship between mean and median in normal distribution? **

In the normal distribution mean is equal to median

16. **What does it mean by bell curve distribution and Gaussian distribution? **

Normal distribution is called bell curve distribution / Gaussian distribution

It is called bell curve because it has the shape of a bell

It is called Gaussian distribution as it is named after Carl Gauss

**17. How to convert normal distribution to standard normal distribution? **

Standardized normal distribution has mean = 0 and standard deviation = 1

To convert normal distribution to standard normal distribution we can use the formula

X (standardized) = (x-µ) / σ

18. **What is an outlier? **

An outlier is an abnormal value (It is at an abnormal distance from rest of the data points).

19. **Mention one method to find outliers? **

Shows the 5-number summary can be used to identify the outlier

Widely used – Any data point that lies outside the 1.5 * IQR

Lower bound = Q1 – (1.5 * IQR)

Upper bound = Q3 + (1.5 * IQR)

20. **What can I do with outlier? **

- Remove outlier
- When we know the data-point is wrong (negative age of a person)
- When we have lots of data
- We should provide two analyses. One with outliers and another without outliers.

- Keep outlier
- When there are lot of outliers (skewed data)
- When results are critical
- When outliers have meaning (fraud data)

### Part 2 – Advance Statistics and Hypothesis Testing

#### 20 Question

21. **What is the difference between population parameters and sample statistics?**

- Population parameters are:
- Mean = µ
- Standard deviation = σ

- Sample statistics are:
- Mean = x (bar)
- Standard deviation = s

22. **Why we need sample statistics? **

Population parameters are usually unknown hence we need sample statistics.

23. **How to find the mean length of all fishes in the sea? **

Define the confidence level (most common is 95%)

Take a sample of fishes from the sea (to get better results the number of fishes > 30)

Calculate the mean length and standard deviation of the lengths

Calculate t-statistics

Get the confidence interval in which the mean length of all the fishes should be.

24. **What are the effects of the width of confidence interval? **

- Confidence interval is used for decision making
- As the confidence level increases the width of the confidence interval also increases
- As the width of the confidence interval increases, we tend to get useless information also.
- Useless information – wide CI
- High risk – narrow CI

25. **Mention the relationship between standard error and margin of error? **

As the standard error increases the margin of error also increases

26. **Mention the relationship between confidence interval and margin of error? **

As the confidence level increases the margin of error also increases

27. **What is the proportion of confidence interval that will not contain the population parameter? **

Alpha is the portion of confidence interval that will not contain the population parameter

α = 1 – CL

28. **What is the difference between 95% confidence level and 99% confidence level? **

The confidence interval increases as me move from 95% confidence level to 99% confidence level

29. **What do you mean by degree of freedom? **

DF is defined as the number of options we have

DF is used with t-distribution and not with Z-distribution

For a series, DF = n-1 (where n is the number of observations in the series)

30. **What do you think if DF is more than 30? **

As DF increases the t-distribution reaches closer to the normal distribution

At low DF, we have fat tails

If DF > 30, then t-distribution is as good as normal distribution

31. **When to use t distribution and when to use z distribution? **

- The following conditions must be satisfied to use Z-distribution
- Do we know the population standard deviation?
- Is the sample size > 30?
- CI = x (bar) – Z*σ/√n to x (bar) + Z*σ/√n

- Else we should use t-distribution
- CI = x (bar) – t*s/√n to x (bar) + t*s/√n

32. **What is H0 and H1? What is H0 and H1 for two-tail test? **

- H0 is known as null hypothesis. It is the normal case / default case.
- For one tail test x <= µ
- For two-tail test x = µ

- H1 is known as alternate hypothesis. It is the other case.
- For one tail test x > µ
- For two-tail test x <> µ

33. **What is p-value in hypothesis testing? **

- If the p-value is more than then critical value, then we fail to reject the H0
- If p-value = 0.015 (critical value = 0.05) – strong evidence
- If p-value = 0.055 (critical value = 0.05) – weak evidence

- If the p-value is less than the critical value, then we reject the H0
- If p-value = 0.055 (critical value = 0.05) – weak evidence
- If p-value = 0.005 (critical value = 0.05) – strong evidence

34. **How to calculate p-value using manual method? **

Find H0 and H1

Find n, x(bar) and s

Find DF for t-distribution

Find the type of distribution – t or z distribution

Find t or z value (using the look-up table)

Compute the p-value to critical value

35. **How to calculate p-value using EXCEL? **

Go to Data tab

Click on Data Analysis

Select Descriptive Statistics

Choose the column

Select summary statistics and confidence level (0.95)

36. **What do we mean by – making decision based on comparing p-value with significance level? **

If the p-value is more than then critical value, then we fail to reject the H0

If the p-value is less than the critical value, then we reject the H0

**37. What is the difference between one tail and two tail hypothesis testing?**

- 2-tail test: Critical region is on both sides of the distribution
- H0: x = µ
- H1: x <> µ

- 1-tail test: Critical region is on one side of the distribution
- H1: x <= µ
- H1: x > µ

**38. What do you think of the tail (one tail or two tail) if H0 is equal to one value only?**

It is a two-tail test

39. **What is the critical value in one tail or two-tail test? **

Critical value in 1-tail = alpha

Critical value in 2-tail = alpha / 2

40. **Why is the t-value same for 90% two tail and 95% one tail test? **

P-value of 1-tail = P-value of 2-tail / 2

It is because in two tail there are 2 critical regions