Right Data Selection Makes A Right Impact On Your Analysis

Analyzing your data right and selecting the right data are mutually dependent, in fact, this is a key activity to ensure if there are apt data samples coming through, which will eventually lead to success. Getting your data right is not possible as it is user-driven, however, getting the right data for your analysis is absolutely in the control of a data analyst!

Data is often biased, mainly due to the nature of the business, geographies operated, seasonal variations and multiple other factors, you shouldn’t ever let this biasness flow into the data selection sample.

Let’s assume, your business is taking a survey to launch a new product in geography you have never been before, how would you arrive at a decision? What sort of sampling techniques would help? Biasness on survey results might lead to an incorrect decision as we sway around the negativity found in the information, eventually making bad decisions and launching an incorrect product, or possibly the bigger mistake that can further impact the revenue. It is highly important for data analysts to be involved in this process as this part of the activity is picked up with loads of considerations.


Sign up for your weekly dose of what's up in emerging technology.

There are various sampling techniques out there and statistical measures of inference.  

Machine Learning and Predictive algorithms are based on the right data sample, it’s not the model where you fit the problem derives a logical outcome, it’s the step before selection of a right sample that decides!

Download our Mobile App

While it is difficult to get a representative data sample, there are few sample techniques out there to help you.

Let’s consider the same example of the survey business which was conducting a survey to launch a new product in a geographical market with the following scenario – the survey has been taken by 1 million+ individuals across 5 geographies.

The selection of samples against a population is decided on its degree of confidence, typically at 95% of confidence, it’s the ninth partition, which equals 0.1 million observations.

There are quite a number of Inferential statistical measures available that can help you evaluate the right selection of a sample and determine the sample which is fit for consideration around a given population and make generalizations around them. It primarily helps in assessing the likelihood in the most accurate manner and builds confidence around the selection of a data sample for researching a data science problem 

Testing that is conducted around an Inferential problem statement is termed as hypothesis testing.

Here in this article, I will focus on the usage of T-Test and its significance in Inferential statistics, It can be used across 2 pairs of samples.

In any sample, it is considered well distributed only if the spread is normal.

Here is an illustration of how T-Test works on a sample

Mean (Sample1) – Mean (Sample2) divided by the SQRT ( SQR(Standard Deviation (Sample1))  / Total Sample Size + SQR (Standard Deviation(Sample2)) / Total Sample Size )

Null Hypothesis (H0), Define with a negative assumption that there is no significant difference between the samples we have selected.

Select a Critical value and compare it with the T Value.

T value < Critical value, don’t reject the Null Hypothesis.

T Value > Critical value, reject the Null Hypothesis and accept the alternate hypothesis

The critical value is found using the T Table. The probability we use in within 6 Sigma standards, either 0.05 or 0.025 or 0.01, this gives us a significance, commonly used one is 0.05, that compliments to the fact that if we use the sampling hundred times, 95% of times we reject Null hypothesis and only 5% of times is where we accept the hypothesis. Degrees of freedom are used to pick up the sum of the sample size of both samples -2, N1+N2-2.

This article is presented by AIM Expert Network (AEN), an invite-only thought leadership platform for tech experts. Check your eligibility.

More Great AIM Stories

Phani Mantravadi
Phani Mantravadi is a Senior Technology Leader with about 19 years of rich product experience, currently associated with at EPAM Systems(India) Private Limited and helping customers build world class products, Highly experienced in implementation of products and solutions over a large scale, having worked with product giants Tesco and CDKGLOBAL. Speaker on technology and engineering areas and provided Insightful sessions at IIIT Hyderabad, MRE University, he is Author of various technology articles over LinkedIn and AIM and self publications. Passionate on building solutions using Python and Data Analytics

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox