# Top 6 most common statistical errors made by data scientists

Data scientists are the rare breed of professionals who can solve the world’s thorniest problems. The data savvy professionals are believed to be a rare combination of statistical and computational ingenuity, however, these data pros are also prone to mistakes. While we have dived into the makings of a data scientists and covered the topic extensively, it is time to train the gaze on the six most common statistical mistakes data scientists make. Some of the most common errors are the types of measurements, variability of data and the sample size. Statistics provides the answers but in some cases it confuses too.

### Correlation is not causation

According to leading data science veteran and co-author Data Science for Business Tom Fawcett, the underlying principle in statistics and data science is the correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is apparently the most common mistake in Time Series. Fawcett cites an example of a stock market index and the unrelated time series Number of times Jennifer Lawrence was mentioned in the media. The lines look amusingly similar. There is usually a statement like: “Correlation = 0.86”.  Recall that a correlation coefficient is between +1 (a perfect linear relationship) and -1 (perfectly inversely related), with zero meaning no linear relationship at all.  0.86 is a high value, demonstrating that the statistical relationship of the two time series is strong.Fawcett goes on to add that when exploring relationships between two time series, all one wants to know is whether the variations in one series are correlated with variations in another.

### Biased Data

We have heard of biased algorithms, but there is bias data as well.  We are talking about biased sampling that can lead to measurement errors because of unrepresentative samples. In most cases, data scientists can arrive at results that are close but not accurate due to biased estimators. An estimator is the rule for calculating an estimate of a given quantity based on the observed data. In fact, non-random samples are believed to be biased, and their data cannot be used to represent any other population beyond themselves.

### Regression Error

In basic linear or logistic regression, mistakes arise from not knowing what should be tested on the regression table. In regression analysis, one identifies the dependent variable that varies based on the value of the independent variable. The first step here is to specify the model by defining the response and predictor variables. And most data scientists trip up here by mispecifying the model. In order to avoid the model misspecification, one must find out if there is any functional relationship between the variables that are being considered.

### Misunderstanding P Value

Long pegged as the ‘gold standard’ of statistical validity, P values are a nebulous concept and scientists believes that aren’t as reliable as many researchers assume. P value are used to determine statistical significance in a hypothesis test. According to the American Statistical Association, P value do not measure the probability that the studied hypothesis is true, or the probability that the data was produced by random chance alone. Hence, business and organizational decisions should not be based only on whether a p-value passes a specific threshold. Many believe that data manipulation and significance chasing can make it impossible to come to the right conclusions from findings.

### Inadequate Handling of Outliers and Influential Data Points

Outliers can affect any statistical analysis, thereby outlier should be investigated and deleted, corrected, or explained as appropriate. For auditable work, the decision on how to treat any outliers should be documented. Sometimes loss of information may be a valid tradeoff in return for enhanced comprehension.

### Loss of information

The main object of statistical data analysis is to provide the best business outcome, with minimal modeling or human bias. Sometime, a loss of information in individual data points can impact the result and its relationship with data set.

Richa Bhatia is a seasoned journalist with six-years experience in reportage and news coverage and has had stints at Times of India and The Indian Express. She is an avid reader, mum to a feisty two-year-old and loves writing about the next-gen technology that is shaping our world.

## Our Upcoming Events

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Apple Launches iCringe with a Sustainability Twist

With Mother Nature in mind, Apple is making impactful strides towards carbon-neutral products. However, there is a slight hiccup

### Data Science Hiring Process at Zoho

Zoho has over 10 open positions for both freshers and experienced professionals.

### Will AGI Be Built in China?

AGIEval Seems to Think So

### NVIDIA Expands Cloud Business with Investments, Partnerships

With NVIDIA partnership, Hugging Face users get access to SOTA GPUs and infrastructure needed to rapidly train and finetune foundation models at scale and drive a new wave of enterprise LLM development.

### Intel Soon to be on Par with NVIDIA

A green CPU with a blue GPU might soon be possible.

### Shell Hackathon to Protect Against Cyber Threats

The aim of the Cyber Threat Detection Hackathon is to build a model capable of identifying code in a body of text.

### ChatGPT is Down, I Can’t Code Anymore

Don’t they know I have a product to ship?

### Decoding SAP Labs’ Generative AI Motto

The German ERP software provider is investing heavily in upskilling its employees.

### Why AI Tech Honchos are Meeting Behind Closed Doors

What transpired when the who’s who of tech leaders convened in Capitol Hill last week to discuss AI behind closed doors?

### AI Clock is Ticking: Wake Up Call for Education Institutions

It’s not too late