The notion of machine learning fairness can be bottled down to the following facets of data pre-processing:
- Demographic parity
- Equal opportunity
- Equalised odds
- Disparate impact
To remove prejudices from a model might not be an impossible task but can any application which serves humans be immune to the human itself. And, even if the human element is considered, how much of it is too much?
Sign up for your weekly dose of what's up in emerging technology.
With great ML deployment comes great responsibility. The spike in interest in test fairness in the 1960s arose during a time of social and political upheaval, with quantitative definitions catalysed in part by U.S. federal anti-discrimination legislation in the domains of education and employment.
Here’s a look at a few significant events that exposed the unfairness of the system:
- Concerned with the fairness of tests for black and white students, T. Anne Cleary defined a quantitative measure of test bias for the first time, cast in terms of a formal model for predicting educational outcomes from test scores.
- While Cleary’s focus was on education, her contemporary Robert Guion was concerned with unfair discrimination in employment. Arguing for the importance of quantitative analyses in 1966, he wrote that: “Illegal discrimination is largely an ethical matter, but the fulfillment of ethical responsibility begins with technical competence”, and defined unfair discrimination to be “when persons with equal probabilities of success on the job have unequal probabilities of being hired for the job.”
- Responding to these concerns, the Association of Black Psychologists formed in 1969.
- The advent of the 70s saw researchers like Thorndike professing viewpoints that judgment on test-fairness must rest on the inferences that are made from the test rather than on a comparison of mean scores in the two populations. Thorndike was quoted saying that one must then focus attention on fair use of the test scores, rather than on the scores themselves.
- As an alternative to Cleary, Thorndike proposed that the ratio of predicted positives to ground truth positives be equal for each group. Using confusion matrix terminology, this is equivalent to requiring that the ratio (T P+F P)/(T P+ F N) be equal for each subgroup.
- In his 1976 book Computer Power and Human Reason, Artificial Intelligence pioneer Joseph Weizenbaum suggested that bias could arise both from the data used in a program, but also from the way a program is coded.
- With the start of the 1980s came renewed public debate about the existence of racial differences in general intelligence, and the implications for fair testing.
- In 1981, with no public debate, the United States Employment Services implemented a score-adjustment strategy that was sometimes called “race-norming”.
- As many as 60 women and ethnic minorities denied entry to St. George’s Hospital Medical School from 1982 to 1986, based on the implementation of a new computer-guidance assessment system that denied entry to women and men with “foreign-sounding names” based on historical trends in admissions.
- Batya Friedman and the philosopher Helen Nissenbaum (1996) discussed bias concerns in the use of computer systems for tasks as diverse as scheduling, employment matching, flight routing, and automated legal aid for immigration.
- Friedman and Nissenbaum (1996) also examined the history of the algorithm for the National Resident Match Program, which matches medical residents to hospitals throughout the United States. The algorithm’s seemingly equitable assignment rules favoured hospital preferences over resident preferences and single residents over married residents.
- Amazon’s Flawed Recruiter created 500 computer models focused on specific job functions and locations. They taught each to recognise some 50,000 terms that showed up on past candidates’ resumes. Instead, the technology favoured candidates who described themselves using verbs more commonly found on male engineers’ resumes, such as “executed” and “captured,” one person said.
- Microsoft’s twitter-based AI chatbot Tay, despite being stress-tested “under a variety of conditions, specifically to make interacting with Tay a positive experience,” learned anti-Semitic and racist behaviour due to the efforts of a specific group of individuals. By being repeatedly exposed to similar types of discriminatory content, Tay acquired numerous discriminatory biases.
- A year after Tay was shut down, Microsoft launched another chatbot known as Zo, which faced similar public backlash after exhibiting anti-Islamic learned biases. However, due to bias avoidance measures, Zo proved to be resistant to exhibiting discriminatory biases. To avoid exhibiting bias, Zo included filters for rejecting discussion about topics that referenced religion or politics.
- Whereas, in 2015, Google came under fire after its new Photos application categorised photos of Jacky Alciné and his girlfriend as “gorillas.” Google attempted to fix the algorithm but ultimately removed the gorilla label altogether.
In a study done on the fairness of the machine learning algorithms over the past 50 years, Google researchers concluded the following:
- In the 1960s and 1970s, the fascination with determining fairness ultimately died out as the work became less tied to the practical needs of society, politics and the law, and more tied to unambiguously identifying fairness.
- The rise of interest in fairness today has corresponded with the public interest in the use of machine learning in criminal sentencing and predictive policing.
- Careful attention should be paid to legal and public concerns about fairness.
- The experiences of the test fairness field suggest that in the coming years, courts may start ruling on the fairness of ML models.
- If technical definitions of fairness stray too far from the public’s perceptions of fairness, then the political will to use scientific contributions in advance of public policy may be difficult to obtain.
Machine learning fairness is an active ongoing research in many big tech companies. For instance, in September 2018, Google debuted its What-If Tool. It allows users to generate visualisations that explore the impact of algorithmic tweaks and adjustments to bias in their datasets on the fly. The researchers also tackled this problem of bias in labelling, by providing a mathematical formulation on how biases arise in labelling and how can this be mitigated.
The proposed solutions still don’t scratch the surface and right now bias is almost inevitable. The bias usually gets the flak, be it in case of gender, race and culture, the problem often seems to be the over-representation of certain groups. The problem goes back again to the way data is collected. Speaking at the recently concluded Analytics India Magazine’s conference, TheMath Company stressed about the need for data ethnographers in the current scenario.
The datasets have to be prepared or have to be collected from some source which is collateral of human interactions. The collected data will be cleaned and appended with classes. These sub-groups no matter how unbiased they were planned to be, there still lies an underwritten, underlying bias.