Top 6 Common Biases In ML Models

Machine learning is not just about machines. At least not yet. There is still a human element in the loop, and it looks like this will continue for some time. In other words, artificial general intelligence (AGI) is a distant dream. Since humans are interfering in the learning processes of ML models, the underlying biases surface in the form of inaccurate results. 

Having an unbiased model is almost impossible as humans generate the data, and a model is only as good as the data it is fed. So, it is the job of the data engineer to keep an eye on the ways in which bias can enter the system. According to Google developers team, the following are the commonly encountered biases during the training of a machine learning model:

Automation Bias

Automation bias is believed to occur when a human decision-maker favours recommendations made by an automated decision-making system over the information made without automation, even when it is found that the automated version is dishing out errors.


Sign up for your weekly dose of what's up in emerging technology.

Confirmation Bias

The tendency to search for or interpret information in a way that confirms one’s prejudices (hypothesis). Machine learning developers might sometimes tend to collect data or label them in a way that would satisfy their unresolved prejudices. These biases seep into the results and sometimes blow up on a large scale.  

This bias also stems from another form of bias known as experimenter bias, where the data scientist would train a model until their previously held hypothesis has been confirmed.

Download our Mobile App

Group Attribution bias

This bias commonly occurs when the assumption that what is good for one person is good for the group is taken too seriously. The effects of this bias can further worsen if a convenience sampling is used for data collection. The attributions made in this way rarely reflect reality.

Out-Group Homogeneity Bias

Consider two groups of families. One group has a couple of twins, while the other does not. When a non-twin family is asked to distinguish between twins, they might falter, whereas the twin’s parents will identify with ease and might even give a nuanced description. So, for the non-twin family, these twins are all but the same. The brevity with which assumptions are made on groups outside ours leads to out-group homogeneity bias. Similarly, there is In-Group Bias as well, which works the other way around.

Selection Bias

Selection bias is a result of errors in the way sampling is done. For example, we need to build an ML model that predicts audience sentiments with regard to films. As part of collecting data, if the audience is handed over a survey form, then the following forms of bias can appear:

  • Coverage bias: When the population represented in the dataset does not match the population that the machine learning model is making predictions about. Taking the same movie example as above, by sampling from a population who chose to see the movie, the model’s predictions may not generalize to people who did not already express that level of interest in the film.
  • Sampling bias: This occurs when the sample is not random or diverse. Suppose only the reviews of front row people in a theatre are taken instead of a random group, then, needless to say, we will hardly grasp the sentiments of the audience.
  • Non-response bias: This bias is usually from the data end. A sigh of relief for the data collectors. This bias will occur when certain sections of the audience choose not to review the movie. Suppose the neutral audience keeps away from reviewing and only the ones with strong opinions, usually the fans, pile up in the reviews, then the results will lean in favour of the film. This bias is also known as participation bias.

Reporting Bias

Suppose an NLP model is trained on the dataset that contains news from the last few decades. Though calling news as biased is an understatement, there is a peculiar kind of bias that emerges out of the way the actions are documented. For example, if the word ‘laughed’ is more prevalent than ‘breathed’ in a story, then a machine learning model that takes the frequency of words into account will conclude that laughing is more common than breathing! 

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Do machines feel pain?

Scientists worldwide have been finding ways to bring a sense of awareness to robots, including feeling pain, reacting to it, and withstanding harsh operating conditions.

IT professionals and DevOps say no to low-code

The obsession with low-code is led by its drag-and-drop interface, which saves a lot of time. In low-code, every single process is shown visually with the help of a graphical interface that makes everything easier to understand.

Neuralink elon musk

What could go wrong with Neuralink?

While the broad aim of developing such a BCI is to allow humans to be competitive with AI, Musk wants Neuralink to solve immediate problems like the treatment of Parkinson’s disease and brain ailments.

Understanding cybersecurity from machine learning POV 

Today, companies depend more on digitalisation and Internet-of-Things (IoT) after various security issues like unauthorised access, malware attack, zero-day attack, data breach, denial of service (DoS), social engineering or phishing surfaced at a significant rate.