The past decade has been the decade of artificial intelligence. With the abundance of data that is generated every day, more and more decisions are made using data analytics and algorithms. The rise of AI in the 21st century has been swift and it has made an impact on almost every field. However, just as many good things, artificial intelligence also has a downside which came to light when in the 1980s some staff members of St. George’s Hospital Medical School noticed a lack of diversity among their students. They started an inquiry to investigate the matter. In the end, it turned out that the algorithm that was in charge of the initial screening of their applicants classified the candidates as “Caucasian” or “non-Caucasian” based on their name and place of birth, and weighted this against them. Even being female would reduce one’s score with three points. Dr. Geoffrey Franglen, vice-dean, and writer of the algorithm stated that his system was only sustaining the biases that already existed in the admissions system. After all, Franglen had tested the system for a few years against human assessors and found that his program agreed with their decision 90 to 95 percent of the time. However St. George’s was found guilty of practicing discrimination in its admissions policy, and the world was introduced to a new form of discrimination, prejudice through artificial intelligence.
Prejudiced algorithms
Artificial intelligence and the self-learning programs that underlie them already play a big part in everyday life. Algorithms decide what we get in our inbox and what gets sent straight to spam, they help when we search on Google or when Spotify offers you personalized music recommendations. However, sometimes these algorithms copy or amplify our prejudice. An example of this is PredPol, an algorithm already in use in the USA to predict when and more importantly where crimes will take place. Developed by the Los Angeles Police Department, the algorithm uses reports recorded by the police to make a map highlighting all the “crime-hotspots”.
However, in 2016, the Human Rights Data Analysis Group found that the program was sending officers to neighborhoods with a high proportion of people from racial minorities, with no regard for the actual crime rates in those areas. While in fact, the actual drug crime was estimated to be much higher in predominantly white neighborhoods. They concluded that the algorithm reinforces the bias from police data, and targets those areas which are already over-represented in the historical police data instead of targeting those areas where the expected drug crime is the most vibrant.
However, it is not only ethnic minorities who are victims of the “discriminating” algorithms. When Amazon tried to enhance their recruitment progress they chose to use AI to select candidates for their open job applications. The algorithm was trained for this new job by being fed data from the past ten years. The data included information about who was hired and who was not. After a while, Amazon came to the conclusion that the algorithm preferred (white) males over females, even though ethnicity and gender were not part of the dataset. The algorithm was still able to detect the gender and ethnicity of the candidates based on other information and labeled being a white man as a positive characteristic. It is possible that during the past ten years Amazon had (unintentionally) discriminated their candidates on gender and ethnicity, resulting in the algorithm simply reinforcing this behavior. Luckily Amazon survived the whole ordeal without a scratch. However, the algorithm turned out to be so discriminating it couldn’t be fixed anymore.
But not in all cases is the verdict of an algorithm taken so lightly as it was for Amazon. In 2016, one of the most notorious cases of AI prejudice was exposed: the racially biased algorithm COMPAS. An algorithm designed to guide sentences by predicting the likelihood of a criminal reoffending. According to the analysis of the news organization ProPublica, the algorithm predicted criminals of color to be more prone to recidivism than they actually where, and the reverse for white criminals. As the company who developed the algorithm argued that statement and the software remained proprietary, the truth still remains somewhat of a mystery.
Measuring discrimination in AI
Luckily such mysteries can be unraveled using proper measures for prejudice. In an article in STAtOR Jan Steen explains four methods to measure the level of discrimination by artificial intelligence. He illustrated these methods using data from ‘The Florida Department of Corrections’ (ProPublica, 1996). This sample again focuses on the chance of a criminal reoffending, but instead of focusing on ethnicity, the algorithm makes a distinction between men and women. Steen estimates the chance of a defendant reoffending using a logistic regression.
\begin{align*}
Log(\frac{p}{1-0p}) = & 0.55 – \text{degree} \cdot 0.31 – \text{age} \cdot 0.04 + \text{crimes} \cdot 0.15\\
& + \text{violence} \cdot 3.24 + \text{gender} \cdot 0.21
\end{align*}
Where
– $p$ = the chance of reoffending (model-score).
– Degree = the degree of the crime.
– Crimes = the number of earlier committed crimes.
– Violence = 0, when there has occurred no violence surrounding the crime. 1, others.
– Gender = 0, if woman. 1, if man.
To examine the discrimination in our sample we can compare the model-scores by gender. When we look at the two distributions we see they slightly differ.
To test whether there is a significant difference between the two distributions we can use the Kolmogorov-Smirnov test. In Steen’s example, the p-value equals $2.2 \times 10^{-16}$ which is basically equal to zero, and hence we can conclude that there is a significant difference between the two distributions. Another method to measure discrimination is by using the Area Under the Curve (AUC). The AUC is a general method to test the accuracy of different models. The AUC measures the chance that a random person who eventually appears to be a recidivist has a higher model-score than a random person who is not a recidivist. In our example, this translates to the probability that a random man has a higher model-score than a woman. Steen calculated that the AUC of our model is approximately 0.69, which means a random man has a 69% chance of having a higher model score than a woman in our sample. After all, model-scores are calculated the algorithm should choose a cutoff point, a number that indicates the line between a positive and a negative verdict by the algorithm. In our sample al individuals who score above a 0.5 are labeled as a recidivist.
Another way to express discrimination as a number is by using the selection rate, the ratio between the percentage of men and women who are labeled as a recidivist. Which in our case equals
$$\frac{505/(955+505)}{46/(298+46)} \approx 2.5.$$
Hence, men have a probability of being labeled as a recidivist 2.5 times as large as the probability of a woman. In America, they use the 80%-rule, which means that a selection rate may lie between 0.8 and 1.25 before it’s labeled as discrimination. The final method to test discrimination is by applying a chi-squared test. The chi-squared test tests whether there is a correlation between two variables. According to Steen, the results of the chi-squared test were again approximately zero, hence we find that again women are more favored over men. Even when the gender variable was excluded from the model, all four methods still found evidence of discrimination in favor of women.
Improving algorithms
There are some steps we can make to improve AI when it comes to fairness and equality. We can start by assessing the quality of the data we use for our algorithms. For the PredPol example, this could mean not only including police reports, but also estimated crime numbers from other organizations to prevent the disproportionate policing of areas with racial minorities. The European Union Agency for Fundamental Rights promotes transparency, being open about the data and code used to build the algorithm, as well as the logic underlying the algorithm, and providing meaningful explanations of how it is being used. Furthermore, they emphasize the importance of involving experts in oversight: to be effective, reviews need to involve statisticians, lawyers, social scientists, computer scientists, mathematicians and experts in the subject at issue.
Artificial intelligence will only rise further in the upcoming decennia. Andrew Ng, the brains behind Google Brain even said: “Just as electricity transformed almost everything 100 years ago, today I actually have a hard time thinking of an industry that I don’t think AI will transform in the next several years.” Therefore it’s especially important that algorithms that affect the course of someone’s life are trialed and evaluated multiple times. However, we cannot deny the fact that in a lot of cases, discriminating artificial intelligence has its roots in our own prejudice. Just as the case with Amazon or St. George’s Hospital Medical School, biases that had occurred in the past were only reinforced by the algorithms in the present. Because in the end, algorithms are as good as the data we feed them.
Source
Steen, J. (2019). Het meten van discriminatie in algoritmische besluitvorming. STAtOR, 20(4), 4–9.
This article is written by Fenna Beentjes