We frequently come out with resources for aspirants and job seekers in data science to help them make a career in this vibrant field. Cracking interviews especially where understating of statistics is needed can be tricky. Here are 40 most commonly asked interview questions for data scientists, broken into basic and advanced.
Here are some other interview questions resources for data scientists.
Sign up for your weekly dose of what's up in emerging technology.
Part 1 – Basic Statistics and Distributions
- What is the difference between data analysis and machine learning?
Data analysis requires strong knowledge of coding and basic knowledge of statistics
Machine learning, on the other hand, requires basic knowledge of coding and strong knowledge of statistics and business.
2. What is big data?
Big data has 3 major components – volume (size of data), velocity (inflow of data) and variety (types of data)
Big data causes “overloads”
3. What are the four main things we should know before studying data analysis?
Distributions (normal distribution / sampling distribution)
4. What is the difference between inferential statistics and descriptive statistics?
Descriptive statistics – provides exact and accurate information.
Inferential statistics – provides information of a sample and we need to inferential statistics to reach to a conclusion about the population.
5. What is the difference between population and sample in inferential statistics?
From the population we take a sample. We cannot work on the population either due to computational costs or due to availability of all data points for the population.
From the sample we calculate the statistics
From the sample statistics we conclude about the population
6. What are descriptive statistics?
Descriptive statistic is used to describe the data (data properties)
5-number summary is the most commonly used descriptive statistics
7. Most common characteristics used in descriptive statistics?
- Center – middle of the data. Mean / Median / Mode are the most commonly used as measures.
- Mean – average of all the numbers
- Median – the number in the middle
- Mode – the number that occurs the most. The disadvantage of using Mode is that there may be more than one mode.
- Spread – How the data is dispersed. Range / IQR / Standard Deviation / Variance are the most commonly used as measures.
- Range = Max – Min
- Inter Quartile Range (IQR) = Q3 – Q1
- Standard Deviation (σ) = √(∑(x-µ)2 / n)
- Variance = σ2
- Shape – the shape of the data can be symmetric or skewed
- Symmetric – the part of the distribution that is on the left side of the median is same as the part of the distribution that is on the right side of the median
- Left skewed – the left tail is longer than the right side
- Right skewed – the right tail is longer than the left side
- Outlier – An outlier is an abnormal value
- Keep the outlier based on judgement
- Remove the outlier based on judgement
8. What is quantitative data and qualitative data?
Quantitative data is also known as numeric data
Qualitative data is also known as categorical data
9. How to calculate range and interquartile range?
IQR = Q3 – Q1
Where, Q3 is the third quartile (75 percentile)
Where, Q1 is the first quartile (25 percentile)
10. Why we need 5-number summary?
Low extreme (minimum)
Lower quartile (Q1)
Upper quartile (Q3)
Upper extreme (maximum)
11. What is the benefit of using box plot?
Shows the 5-number summary pictorially
Can be used to compare group of histograms
12. What is the meaning of standard deviation?
It represents how far are the data points from the mean
(σ) = √(∑(x-µ)2 / n)
Variance is the square of standard deviation
13. What is left skewed distribution and right skewed distribution?
- Left skewed
- The left tail is longer than the right side
- Mean < median < mode
- Right skewed
- The right tail is longer than the right side
- Mode < median < mean
14. What does symmetric distribution mean?
The part of the distribution that is on the left side of the median is same as the part of the distribution that is on the right side of the median
Few examples are – uniform distribution, binomial distribution, normal distribution
15. What is the relationship between mean and median in normal distribution?
In the normal distribution mean is equal to median
16. What does it mean by bell curve distribution and Gaussian distribution?
Normal distribution is called bell curve distribution / Gaussian distribution
It is called bell curve because it has the shape of a bell
It is called Gaussian distribution as it is named after Carl Gauss
17. How to convert normal distribution to standard normal distribution?
Standardized normal distribution has mean = 0 and standard deviation = 1
To convert normal distribution to standard normal distribution we can use the formula
X (standardized) = (x-µ) / σ
18. What is an outlier?
An outlier is an abnormal value (It is at an abnormal distance from rest of the data points).
19. Mention one method to find outliers?
Shows the 5-number summary can be used to identify the outlier
Widely used – Any data point that lies outside the 1.5 * IQR
Lower bound = Q1 – (1.5 * IQR)
Upper bound = Q3 + (1.5 * IQR)
20. What can I do with outlier?
- Remove outlier
- When we know the data-point is wrong (negative age of a person)
- When we have lots of data
- We should provide two analyses. One with outliers and another without outliers.
- Keep outlier
- When there are lot of outliers (skewed data)
- When results are critical
- When outliers have meaning (fraud data)
Part 2 – Advance Statistics and Hypothesis Testing
21. What is the difference between population parameters and sample statistics?
- Population parameters are:
- Mean = µ
- Standard deviation = σ
- Sample statistics are:
- Mean = x (bar)
- Standard deviation = s
22. Why we need sample statistics?
Population parameters are usually unknown hence we need sample statistics.
23. How to find the mean length of all fishes in the sea?
Define the confidence level (most common is 95%)
Take a sample of fishes from the sea (to get better results the number of fishes > 30)
Calculate the mean length and standard deviation of the lengths
Get the confidence interval in which the mean length of all the fishes should be.
24. What are the effects of the width of confidence interval?
- Confidence interval is used for decision making
- As the confidence level increases the width of the confidence interval also increases
- As the width of the confidence interval increases, we tend to get useless information also.
- Useless information – wide CI
- High risk – narrow CI
25. Mention the relationship between standard error and margin of error?
As the standard error increases the margin of error also increases
26. Mention the relationship between confidence interval and margin of error?
As the confidence level increases the margin of error also increases
27. What is the proportion of confidence interval that will not contain the population parameter?
Alpha is the portion of confidence interval that will not contain the population parameter
α = 1 – CL
28. What is the difference between 95% confidence level and 99% confidence level?
The confidence interval increases as me move from 95% confidence level to 99% confidence level
29. What do you mean by degree of freedom?
DF is defined as the number of options we have
DF is used with t-distribution and not with Z-distribution
For a series, DF = n-1 (where n is the number of observations in the series)
30. What do you think if DF is more than 30?
As DF increases the t-distribution reaches closer to the normal distribution
At low DF, we have fat tails
If DF > 30, then t-distribution is as good as normal distribution
31. When to use t distribution and when to use z distribution?
- The following conditions must be satisfied to use Z-distribution
- Do we know the population standard deviation?
- Is the sample size > 30?
- CI = x (bar) – Z*σ/√n to x (bar) + Z*σ/√n
- Else we should use t-distribution
- CI = x (bar) – t*s/√n to x (bar) + t*s/√n
32. What is H0 and H1? What is H0 and H1 for two-tail test?
- H0 is known as null hypothesis. It is the normal case / default case.
- For one tail test x <= µ
- For two-tail test x = µ
- H1 is known as alternate hypothesis. It is the other case.
- For one tail test x > µ
- For two-tail test x <> µ
33. What is p-value in hypothesis testing?
- If the p-value is more than then critical value, then we fail to reject the H0
- If p-value = 0.015 (critical value = 0.05) – strong evidence
- If p-value = 0.055 (critical value = 0.05) – weak evidence
- If the p-value is less than the critical value, then we reject the H0
- If p-value = 0.055 (critical value = 0.05) – weak evidence
- If p-value = 0.005 (critical value = 0.05) – strong evidence
34. How to calculate p-value using manual method?
Find H0 and H1
Find n, x(bar) and s
Find DF for t-distribution
Find the type of distribution – t or z distribution
Find t or z value (using the look-up table)
Compute the p-value to critical value
35. How to calculate p-value using EXCEL?
Go to Data tab
Click on Data Analysis
Select Descriptive Statistics
Choose the column
Select summary statistics and confidence level (0.95)
36. What do we mean by – making decision based on comparing p-value with significance level?
If the p-value is more than then critical value, then we fail to reject the H0
If the p-value is less than the critical value, then we reject the H0
37. What is the difference between one tail and two tail hypothesis testing?
- 2-tail test: Critical region is on both sides of the distribution
- H0: x = µ
- H1: x <> µ
- 1-tail test: Critical region is on one side of the distribution
- H1: x <= µ
- H1: x > µ
38. What do you think of the tail (one tail or two tail) if H0 is equal to one value only?
It is a two-tail test
39. What is the critical value in one tail or two-tail test?
Critical value in 1-tail = alpha
Critical value in 2-tail = alpha / 2
40. Why is the t-value same for 90% two tail and 95% one tail test?
P-value of 1-tail = P-value of 2-tail / 2
It is because in two tail there are 2 critical regions