Building Trust in AI Part I: Trusting Data

7 minute read

Training data and training features are central to any machine learning project. Machines are learning, but what’s it being taught and who’s teaching it? This matters because machines don’t bring prior experience, contextual beliefs to the class room so to speak. Instead they only learn from what is shown them in the class room from the textbook of data –and like our highly opened minded and idealistic youth, machine learning algorithms aren’t much in the way of critical thinkers. In machine learning the “training data” aka the textbook of some data you’ve examined and labeled accordingly with the answer you want is what you show to your software to teach it how to make decisions. Often this collection of historical data used to training machine learning algorithms is riddled with all sort of one sided personal bias, cultural, geographical and biological biases endorsed with varying degrees of naiveté by those who train the data. Teacher decides what questions to ask, and tells machine learners what matters the teacher is responsible for “feature selection.

The solution? rigorous scientific methodology and diversity of perspectives poured into the data collection process.

This is where machine learning engineers and data scientists expose themselves as nothing more then mathematicians and business men and woman and not scientist such as those found in medicine, biology and psychology. Why because in my years of working with a vast group only researchers and specialists in business engineering and medicine it’s only the later group the natural scientists who don’t give me a clueless vacant stares when I ushered phrases like Selections bias, Randomized population sampling, representative data and Pilot study in a study group meeting. High school concepts that have somehow eluded the some of the sharpest and gifted minds. This kind of cognitive disconnect won’t be solved with an app a “bias detecting app” (I bet they are making one as I write this post, so predictable) but instead as cliché as it sounds this bias issues will really only be solve through exposure, experience, inclusivity and diversity.

Even though it’s tempting, there’s no need to reinvent the wheel nature has already solved this problem for us we just need to apply the solution!

Having a “bias detecting app” is kind of like someone thinking that the key to a great novel –a novel that is universally appealing and culturally relevant to all nations and people is achieved through the use of a spellchecker… it’s laughable as it’s simply no substitute for the human element but they’re gonna make it anyway. What does this say about current line of problem solving and thinking of the sharpest minds today? pretty narrow if you ask me, kind of like the narrow AI that keeps being made. it’s no wonder another AI winder is coming need I say more

Algorithms are becoming more and more involved in major decisions in many industries. They are already helping us decide who gets a loan, who is hired or fired, who can travel freely, and even who is arrested and how long they go to jail for. If these algorithms are flawed or biased, it could actually amplify injustice and inequality. AI and machine learning systems learn by finding patterns in data. The input data is used to train AI and machine learning systems it’s natural to think that because machine learning is an automated there is no human bias. However, just because a process has been automated based on data it doesn’t automatically make that process neutral. In fact a “neutral” learning algorithm can yield a model that strongly deviates from the actual population statistics, or from a morally justifiable type of model, simply because the input or training data is biased in some way.

AI is only as sharp and as useful as the data it learns from.

Compounding this effect further is the convenience and reusability of code libraries. Many programmers (including myself) use popular coding libraries to do common tasks without a second thought. If the codded libraries being reused contains bias, this bias will like a virus propagate will. The virulent proliferation of such encoded bias will be the demise and reputational suicide of many business, large companies and governments once narrow AI moves into full swing. The same entities that positive endorse the promise of AI technologies will be the same ones devoured by it. Right now Big Tech & Internet such as the like of Facebook and Google go about their business with too big to fail mentally but we all know there that line of thinking led to in 2008. Once again business and government have created the right socioeconomic and geopolitical cocktail for such a crisis. Instead this time the crisis will be a digital one fueled by the blend of human bias, misunderstanding and emotions interacting with algorithms designed to exploit, gamify these very biological mechanisms for financial profit.

Even with good intentions it’s a cognitively difficult task to separate ourselves from our own human biases. This is because bias is one of the primary organizing principles —one of our brains primary decision-making systems. It helps us build schemas which compound into heuristics that are then used to interpret, navigate, interact with our environment. Without bias we would not be able to prioritize as it tells us what to ignore and what to pay attention to. However bias is a double edged sword as there are unwanted biases which can persist into the information environment leading to beliefs which develop into stereotypes which then leads to prejudice when acted upon causing inequality, unrest and societal tensions. Our biases, reflect our values and culture. These biases for better or for worse are a part of us; they inform our decision making, our policies and even the technology we create. Bias can manifest in many different ways. For example dataset spanning over ten years of candidates who have been interviewed and joined a company could be used to train a machine learning system to automatically select candidates for future postings. If the original data contains implicit bias such as hiring disproportionately more of one population demographic than another –say more white men than women or minorities –or perhaps paying higher salaries to a particular demographic of the population than another such biases, (albeit be it unintentional) will be carried forward into the machine learning models of the future. As such candidates that were hired in the past become reinforced this is bad in at two levels:

  • First is that it will disenfranchise and marginalize large populations of society which history has shown is never a good idea; bring about social unrest, distrust, anarchy, revolution and election outcomes that simply dumbfound the intelligentsia who to be honest seem to have spent most of their time playing and thinking about society and economics as if it was a game of high order chmess a variant of chess that nobody plays. (okay rant over)

  • Second is missing data. AI and machine learning algorithms can’t infer about new trends or provide updated recommendations if it doesn’t have the data to do so. If employee selection becomes become reinforced machine learning algorithms over fit to the data and your business missies out. what makes this especially nasty is that your organization will be none the wiser about this bias. This is what one calls a data trap

It simply boils down to this. AI and Machine learning algorithms simply can’t recognize bias and as such will fail to adapt to the changes in socio-economic climate which may affect the growth and reputation your business, nation or government.

When collecting data we need to make sure the dataset is pulled is randomized and accurately reflects the entire population. It is also important allocate time so screen for selection bias. Here are just some examples of selection bias that engineers, researchers and corporations should be aware of:

  • Voluntary bias - When the subjects who volunteer to participate in a research project are different in some ways from the general population which skews the data and thus the results

  • Undercoverage bias - When the subjects or data provided is not representative of the population which is not represented in the results

  • Non-response bias - When a certain groups of people do not participate and so the data is not represented in the results

  • Convenience bias - When selecting member of a population who are conveniently and readily available so you just ask people who are convenient yet for you to ask which skews the data and thus the results

  • Focus bias- the deliberate use or non-use of certain types of legally required or protected information respectively t–which can lead to biased algorithmic decision relative to a moral standard.

By screening for selection bias present in our datasets then documenting our findings and methods we used to address data bias goes on a long way in the eyes of the public. This helps to establish trust between you and the public which will in turn work towards building trust in the promise of AI.

Hope this helps