Top 6 Common Biases In ML Models

Machine learning is not just about machines. At least not yet. There is still a human element in the loop, and it looks like this will continue for some time. In other words, artificial general intelligence (AGI) is a distant dream. Since humans are interfering in the learning processes of ML models, the underlying biases surface in the form of inaccurate results. 

Having an unbiased model is almost impossible as humans generate the data, and a model is only as good as the data it is fed. So, it is the job of the data engineer to keep an eye on the ways in which bias can enter the system. According to Google developers team, the following are the commonly encountered biases during the training of a machine learning model:

Automation Bias

Automation bias is believed to occur when a human decision-maker favours recommendations made by an automated decision-making system over the information made without automation, even when it is found that the automated version is dishing out errors.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Confirmation Bias

The tendency to search for or interpret information in a way that confirms one’s prejudices (hypothesis). Machine learning developers might sometimes tend to collect data or label them in a way that would satisfy their unresolved prejudices. These biases seep into the results and sometimes blow up on a large scale.  

This bias also stems from another form of bias known as experimenter bias, where the data scientist would train a model until their previously held hypothesis has been confirmed.

Group Attribution bias

This bias commonly occurs when the assumption that what is good for one person is good for the group is taken too seriously. The effects of this bias can further worsen if a convenience sampling is used for data collection. The attributions made in this way rarely reflect reality.

Out-Group Homogeneity Bias

Consider two groups of families. One group has a couple of twins, while the other does not. When a non-twin family is asked to distinguish between twins, they might falter, whereas the twin’s parents will identify with ease and might even give a nuanced description. So, for the non-twin family, these twins are all but the same. The brevity with which assumptions are made on groups outside ours leads to out-group homogeneity bias. Similarly, there is In-Group Bias as well, which works the other way around.

Selection Bias

Selection bias is a result of errors in the way sampling is done. For example, we need to build an ML model that predicts audience sentiments with regard to films. As part of collecting data, if the audience is handed over a survey form, then the following forms of bias can appear:

  • Coverage bias: When the population represented in the dataset does not match the population that the machine learning model is making predictions about. Taking the same movie example as above, by sampling from a population who chose to see the movie, the model’s predictions may not generalize to people who did not already express that level of interest in the film.
  • Sampling bias: This occurs when the sample is not random or diverse. Suppose only the reviews of front row people in a theatre are taken instead of a random group, then, needless to say, we will hardly grasp the sentiments of the audience.
  • Non-response bias: This bias is usually from the data end. A sigh of relief for the data collectors. This bias will occur when certain sections of the audience choose not to review the movie. Suppose the neutral audience keeps away from reviewing and only the ones with strong opinions, usually the fans, pile up in the reviews, then the results will lean in favour of the film. This bias is also known as participation bias.

Reporting Bias

Suppose an NLP model is trained on the dataset that contains news from the last few decades. Though calling news as biased is an understatement, there is a peculiar kind of bias that emerges out of the way the actions are documented. For example, if the word ‘laughed’ is more prevalent than ‘breathed’ in a story, then a machine learning model that takes the frequency of words into account will conclude that laughing is more common than breathing! 

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.