An Introduction to Ethical Data Science

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



I. The 'Big' Picture

Big data. In the last decade, these words have evolved from just an idea to a casually and commonly used phrase. Big data surrounds us in every google search we make, every button we click on the internet, and every online interaction we have. It has become so vastly intertwined in daily human interaction, and yet most of the time we don’t even realize it is there. Moreover, most people don’t really even understand what it is. So, what is big data?

Definitions for big data usually involve the “3V’s”: volume, velocity, and variety. In this sense, it is just a large collection of up-to-date information about anything. Today, we are lucky that that it can also be processed with relative technological ease. Now that most data is stored on a computer, gaining access to and drawing insights from large datasets is easier than ever before. With access to this data, scientists and statisticians can now draw correlations and see patterns at an exponentially accelerated rate using machine learning. While this has been an incredible optimization for the field of prediction, the rate of growth for machine learning has become so immensely fast that a few important precautions are being brushed under the rug. Machine learning researcher Michael Jordan sums it up nicely:
The issue is not just size—we’ve always had big data sets — the issue is granularity.
Granularity of big data is a topic that data scientists commonly avoid. But why is this? Isn’t the scale and level of detail for data just more fuel for training data? While that is true, the problem lies in the kinds of data that are being observed. Since the internet is by far the main source of collection, most datasets that fall under the concept of big data are about people; our actions, attributes, desires, and interactions. In some cases this is very good. For example, if statisticians had access to a highly granular dataset of medical information, they could utilize this to predict the likelihood of an individual developing a genetic disease, their children having birth defects, or the possible cause of undiagnosed symptoms. But, unfortunately not all predictors are looking to give beneficial outcomes. Sometimes data can be the deterministic factor for a decidedly bad outcome of an individual. Which is why it’s very important to look at where a ‘big data set’ comes from - before drawing any conclusions.

Even in vastly large data sets, there seems to be a proportionally smaller amount of data available about minorities. Along with this, statistical patterns that may be valid for the majority might prove invalid for a minority. This means that although machine learning algorithms strive for good performance on average, these performance metrics might actually be detrimental to individuals within a minority group. Pratik Gajane, a machine learning researcher stated:
Training data may simply reflect the wide-spread biases that persist in society at large... data mining can discover surprisingly useful regularities that are really just pre-existing patterns of exclusion and inequality.  
While it’s great that huge investments are being made to figure out how to optimize machine learning algorithms and predictors to make them as mathematically precise as possible - an important piece to the puzzle is being forgotten. How do we create mathematical calculations for ethical accuracy?

II. Introducing Non-Discriminatory Predictors

One of the most important aspects of a predictor is its measure of accuracy and performance. This is true not only for classifiers that determine critical decisions for human lives, but also for any prediction model. Non-discriminatory predictors are a subset of fairness prediction models. It is important to note that fairness models do not seek to completely eliminate biased predictions. In some cases, a biased prediction will actually yield a more accurate result. So I must distinguish that for a biased result to be a problem with fairness, one or some of the outcomes must be more beneficial or more desirable than others.

Fairness plays a large role in the accuracy of a predictor, and while the accuracy and performance of predictors can be easily measured mathematically, this notion of fairness is much harder to calculate. Even though fairness frameworks for human decision making have been formally delineated, fairness in data science is not officially formalized anywhere. There are however notions of fairness models that have been heavily researched within the last few years, so let's dive into their methodologies.

III. Methodology

(If you aren't up-to-date with data science jargon, or are disinterested in the technicality of these models and just came here for an analysis, feel free to skip this section!)

Since fairness in machine learning and computational social science are relatively new concepts, there hasn’t been a large amount of research done in this area. I will begin by briefly explaining a few of the most promising fairness predictors today and their current drawbacks; keeping in mind that since this research is so new, any of this information could change on a given day.

Before introducing these models though, it is necessary to explain what a protected attribute is. In terms of legality, non-discrimination laws aim to stop unfair treatment of individuals who can be identified as members of certain demographic groups. These groups are distinguished by attributes that are protected by law. When an individual is treated in an unfair way based off of one of these protected attributes (ie race), that is considered discrimination.

Treatment Parity (race-blind approach)
Treatment parity is more commonly known as the ‘race-blind’ approach because it omits any protected attributes from the prediction process entirely. While this seemingly eliminates discrimination, it also could lead to less accurate results. Also, similar to real-world race blindness, it is considered a naive approach at egalitarianism. Think about implementing a race-blind predictor for college admittance. While it may seem like an optimal model, it actually becomes blind to counter discrimination (hence the motivation for affirmative action).

Individual Fairness
For this model, each individual of a population would get assigned a probability distribution over the set of outcomes A. Fairness is ensured by checking that the distributions assigned to similar people are similar. For each member in a protected group with a negative outcome, the model will try to find individuals from a non-protected group with similar non-protected attributes, and if the outcomes between the two individuals are significantly different, discrimination has occurred. This model delegates the responsibility of ensuring fairness to the distance metric instead of the predictor. Unfortunately, this means that if a distance metric uses protected attributes (directly or indirectly) to compute the distance between two individuals, then it is discriminatory and this model would not work for that dataset.

Group Fairness
Also known as statistical or demographic parity, group fairness ensures that the general population is treated statistically similar to a protected group. Specifically, it makes sure that a person who belongs to one group is equally likely (up to bias e) to receive a particular outcome as an individual belonging to another group of the population. The most widely known implementation of this kind of model is affirmative action.

Post Hoc Correction
This model is actually inherently implemented in other models, because it specifies implementing a non-discriminative predictor on top of an already working, possibly discriminatory predictor. The previous predictor is then corrected by taking into account the protected attribute(s) and readjusting its process in order to minimize loss. This is a promising model, though it does experience drawbacks when determining which method of calculating loss. For some companies, loss could equate to the cost that the company would have to pay for themselves instead of the loss of the desired benefit for a protected group.

Equality of Capability of Functioning
Equality of Capability of Functioning (ECOF) addresses inequality due to social endowments (like gender) as well as natural endowments (like sex). It addresses that unequal distribution of social benefits is only fair when it is from the intentional decisions and actions of the concerned individuals. In order to implement this, the data scientist needs to be able to determine which attributes an individual has no say in. Which makes this model particularly difficult because it has a very high informational requirement. ECOF also factors in endowment sensitivity, which allows unchosen circumstances to be either offset or at least compensated for. While this is a very worthwhile model for outcomes in domains where natural and/or social endowments historically impede a group from receiving social benefits, it is likely that it won’t be used widely until the high informational requirement barrier can be overcome.

Equalized Odds
This model states that a predictor Ŷ satisfies fairness with respect to a protected attribute A and outcome Y if Ŷ and A are independent conditional on Y. This means that Ŷ has equal true positive rates across all protected groups and that accuracy is also equally high between them. While this model is more promising than previously mentioned models (because it has no major barriers to entry), it does seem to unnecessarily attempt to calculate fairness for groups who shouldn’t receive a beneficial outcome from the predictor.

Equality of Opportunity
Equality of Opportunity is incredibly similar to equalized odds, except that it solves the drawback of unnecessarily calculating fairness for all groups. For this model, it requires non-discrimination only within the advantaged outcome group. For example, out of the people who actually do deserve a predicted benefit, there should be equal opportunity to receive that benefit no matter what group you are a part of.

IV. What's Next?

If you got lost in the technicality of these models, that is okay.

I didn't know about any of this two months ago. Very few people have dipped their toes into this field, which is incredible to me (and not in a good way) because it is so vastly important for the future of AI, and ultimately the need for the creation of moral machines.

It's at this point that I should mention this post is an introduction into the research I've been working on for the past few months. While I'm still very new to this field, discussing the importance of non-discriminatory predictors has helped me find a huge passion for ethical data science and computational social science. This was a brief introduction, but I am currently working on a publication that will (hopefully) be released by the end of this year. As well as another publication that involves a different sector of ethical data science: data privacy.

I'll be posting about these when they are finalized, but in the meantime I'd love to hear your thoughts, worries, or general excitement about the emerging field of computational social science. Please feel free to reach out to me with any questions or comments you may have regarding my work.


As always, thank you for reading and until next time.








Acknowledgements and Citations:
Big Data, Machine Learning, and the Social Sciences: Fairness, Accountability, and Transparency
On Formalizing Fairness Through Machine Learning
Fairness Through Awareness
Learning Non-Discriminatory Predictors
Inequality Reexamined
Equality of Opportunity in Supervised Learning



Comments

Popular posts from this blog

Brain Dump on Databases

My Blog Is Now On Medium!

My Grace Hopper Thoughts