Data science interviews can be cumbersome, and rejections are merely the beginning. While an academic degree, relevant training, skills, and course work are essential to break into data science, it does not guarantee a job or job satisfaction.
When it comes to interviews, there are hundreds of reasons for a company to reject a candidate. Of course, it makes more sense for a company to reject a good candidate than to hire a bad one. But, a talented data science professional stands above all, making sure to stay ahead of the curve.
Unlike other domains, data science hires are different. Several things are critical, and most can be overcome if the candidate is intrinsically strong with the basics — statistics, machine learning, new-age programming language, etc. Many experts believe that a sharp mind, a hunger to learn, the right attitude, and a strong work ethic are crucial to data science hiring.
In this article, we will highlight things that can get you rejected in a data science interview and how you can tackle them.
Earlier, data science expert and co-founder of Aryma Labs, Venkat Raman, in a LinkedIn post, shared various instances that can get a candidate rejected by saying the following things, especially when interviewed by a knowledgeable data scientist or statistician.
Logistic regression is not regression, but a classification algorithm
There are many misconceptions about logistic regression. Many data scientists believe that regression in logistic regression is a misnomer. Logistic regression can be used for regression, or for that matter; one could use it without having to set arbitrary cut-off points.
“Calling logistic regression a classification algorithm and believing it to be for ‘classification purpose only’ is akin to a person carrying a wheelbarrow on his head without even realising that it can be rolled,” said Raman.
‘P-value‘ is the chance of obtaining a result by sheer chance
In a research paper, ‘Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations,’ the researchers emphasise how violation of often unstated analysis protocols (like selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is wrong.
We accept the null hypothesis
“We often come across YouTube videos, blogs, posts, and private courses wherein, they say, we accept the null hypothesis instead of saying ‘we fail to reject the null hypothesis,'” stated Raman. He said if you correct them, they would say, what is the big difference? — the opposite of ‘rejecting the null’ is ‘accepting, isn’t it?’
Raman believes that it is not so simple as it is construed and said we need to rise above this and understand one crucial concept — ‘Popperian falsification.’ This concept holds the key to why we use the language ‘fail to reject the null,’ he added.
The Popperian falsification means that ‘science is never settled,’ it keeps changing or evolving. In other words, scientific theories held sacrosanct today can be refuted tomorrow. Therefore, under this principle, scientists never proclaim ‘X theory is true,’ Instead, they try to prove that ‘the theory X is wrong.’ That’s a falsification. That is where ‘we fail to reject the null’ comes into play.
Now, when ‘we accept the null hypothesis,’ we can not prove theory X is wrong. “But, does that really mean ‘theory X is correct?’ No, somebody more smarter in the future could prove theory X is wrong,” said Raman. There always exists that possibility.
In linear regression, the dependent and independent variables need to follow a normal distribution; if not, they need to be log-transformed
Linear regression is a method of modelling the linear relationship between the dependent and independent variables. For example, the linear regression can be defined by the following equation:
x = independent variable
y = dependent variable
Β1 = coefficient of x (slope)
Β0 = intercept (constant) which tells the distance of the line from the origin on the y-axis
Stepwise regression is a type of regression
Stepwise regression is a step-by-step iterative construction of a regression model that involves selecting independent variables to be used in a final model. It typically involves adding or removing potential explanatory variables in succession and testing for statistical significance after each iteration.
Your modelling strategy is to try all the models on the data and choose one based on an accuracy metric via some low code library
Raman said if you ‘try all models’ blindly and choose one based on some accuracy metrics, sorry, you are not doing data science — you are just gaming the system.
Further, he said, if you were clueless on which variables to select for your model, you would be clueless even after getting some variables through the ML/stat feature selection technique. “Just that you would falsely convince yourself that ‘these features that you have gotten are important'” he added.
He said applied data science is a serious business, and we need to know what we are doing and knowing maths and statistics helps. “We may not be ‘scientists’ in the strictest sense. “But, at least to do justice to the word ‘science’ in data science,” said Raman, suggesting the following technique:
- One needs to ideate and think through the problems thoroughly
- Choose features guided by domain knowledge (or talk to domain experts before choosing features)
- Know and understand how the models work under the hood
- Weigh the pros and cons of applying an algorithm/technique to data
“AI could be a multi-billion dollar industry or even a trillion-dollar industry, but only if data science is done right. Doing it wrongly will only cause disenchantment and lower adoption,” said Raman.
The central limit theorem kicks in at n=30
“The central limit theorem does not ‘kick in’ at n = 30,” said Raman.
In the central limit theorem, the sample mean will be approximately normally distributed for larger sample sizes (n), regardless of the distribution from which we are sampling. Busting the myths around this, Raman said:
- Sample mean tends to be approximately normally distributed if the sample size is at least 30
- If n>30, use z test, and if n<30, use t-test – is not entirely correct (You can use t-test for larger samples too)
But, the question is, how did n=30 come about?
Raman said the n=30 heuristic is not devised out of purely mathematical or statistical reasons. In retrospect, when computers were still in their infancy, people used tables much like a logarithm table to see the values of any distribution for any combination of DF and significance level.
Because such a table had to fit many distributions (t, normal, chi-square, etc.), the t distribution page was restricted to only 30 entries. Also, at n=30, normal distribution and t distribution were deemed ‘approximately’ same.
Confidence interval is the probability that the parameter of interest lying between the interval is 90 per cent or 95 per cent
Confidence interval is one of the most confusing topics in statistics. Unfortunately, the internet is filled with incorrect explanations. “Even Wikipedia’s definition is so ambiguous that it has put a blaring sign of caution!” said Raman.
Raman said, for this reason, many aspiring data scientists/statisticians can not be faulted for learning it wrongly. However, it is important to learn it correctly as it plays a major role in NHST inference. Therefore, interpreting the confidence interval correctly in the output table is crucial.
Here are some of the ways confidence intervals are wrongly defined:
- The probability that the parameters of interest lying between the interval is 90 per cent or 95 per cent
- There is a 95 per cent probability that the mean weight is between 50 kilograms and 70 kilograms
- The probability that the mean will range between 50 kilograms and 70 kilograms is 95 per cent
“From a frequentist perspective, it does not make any sense,” said Raman. He noted that the parameter of interest either lies within the interval or does not. There is no probability or chance (per cent) associated with it.
Correct definition: If one ran the same statistical test, taking different samples, and constructed a confidence interval each time, then in 95 per cent of the cases, the confidence interval constructed for that sample will contain the true parameter.
Your strategy to deal with outliers is to throw them out just to fit the curve more smoothly
Principal Biostatistician at 2KMM CRO, Adrian Olszewski, believes in embracing outliers and skewness. Citing The Flint water crisis, he said that you need to investigate them even if they are errors. In this project, people ignored the abnormal concentration of lead in water, and some died.
“I don’t say ‘don’t clean your data.’ I say: don’t do things without the context and use proper methods. Go the right way,” wrote Olszewski in his LinkedIn post.
Your strategy for imputation of missing values is to fill it by average (mean) values
Most machine learning (ML) algorithms need numeric input values and a value for each row and column in a dataset. Therefore, missing values can cause problems for ML algorithms.
Identifying missing values in a dataset and replacing them with numeric values is called data imputing, or missing data imputation. One of the popular approaches to data imputation includes using statistical methods to estimate a value for a column from those values present and later replacing all missing values in the column with the calculated statistic.
It is easy because statistics are fast to calculate and often prove very effective. Some of the statistics calculated include the column mean value, column median value, the column mode value and a constant value.
Besides these, there are other factors that experts believe can get you rejected in a data science interview. As pointed out by Health IQ VP – product management, Praful Krishna, some of them include:
- Inability to demonstrate technical bent of mind on paper
- Inability to complete basic sums, puzzles, problems, etc.
- Inability to understand basic statistical concepts like Naive Bayes
- Inability to take initiative in the problem scenarios with respect to redundant approaches and calculations
- Inability to connect with the interviewer, most likely because of ego or abrasive personality of the candidate
At the same time, not knowing techniques to process quality data can also get you rejected. Tech expert Allen Woods said there is only one constant, and that is – garbage in, garbage out (GIGO). If folks do not get the significance of that, then all the analysis approaches under the sun are for nothing, he added.
“So, a vital question is what steps do you take to validate and verify your data and as part of that, how do you determine and resolve any shortfalls in the scope of observation? And if part of the resolution of observation shortfalls is the inclusion of new data forms, what would be the impact, in terms of scope, in, say, a graph schema in terms of shape? It’s not all about statistical technique,” said Woods.