Explainable AI: Cracking Open Deep Learning’s Black Box

How does artificial intelligence actually work? What are the pitfalls of this technology that we need to know? Explore the complex nature of this "black box" technology as we break down what artificial intelligence really means.

By Kathryn Nave, Contributor

In 1997, a team of researchers from the University of Pittsburgh reported the results of a large scale investigation into the possible use of different machine learning technologies in healthcare. The good news: When tested on 4,352 pneumonia patients from across the United States, a neural network could predict a patient’s mortality risk with 98.5 percent accuracy.

In seemingly better news for asthma sufferers, the highly successful model also predicted that the pre-existing respiratory condition would reduce the likelihood of pneumonia becoming fatal. As such, the model recommended, asthmatic pneumonia patients should be treated at home to reduce the burden on the healthcare system.

“Medically, of course, this doesn’t make any sense,” explained University of Washington artificial intelligence (AI) researcher Marco Tulio Ribeiro, who cites the project as his favorite illustration of the potential dangers of deferring to AI decision making. “Asthma obviously increases the risk of pneumonia. But when they dove deeper into the data, they realized it was because historically, patients who had asthma were treated more aggressively, and as a result of this, they died at lower rates. If they’d relied on this system to triage, those people would have just be sent home to die.”

Fortunately in this case, the error was discovered.

Yet, the risk of acting on critical recommendations made by the neural network led researchers to recommend alternative—and less accurate—rule-based machine learning models that rely on explicit instructions informed by human expertise.

The Black Box Problem

Neural networks work by passing input data (such as an image or a medical record) through a network of simple processing layers that, through training, learn to extract the features of that data set most relevant to a classification.

So, if the task of a network is to identify animals in pictures, then during training it might learn that whenever a picture is labeled as ‘cat’ it typically includes ‘pointed ears’, ‘whiskers’, ‘paws’, and so forth. Subsequently, the detection of each of these features in the input image will increase the network’s confidence that this image should be classified as an image of a cat.

Crucially, unlike rule-based methods, during training, a neural network is only given a particular input and the correct label for that input. It is not explicitly told to look for specific features like ‘whiskers’ or ‘paws’. This means that even the person who created and trained the network may not know what is being detected at the intermediate stages of the process—or why the model reaches the conclusion that it does.

This can be particularly problematic if a significant proportion of cat-labeled images from the training set coincidentally share an irrelevant feature, as Google Brain research scientist Ian Goodfellow found when he discovered that the inclusion of a large number of LOLcat memes in his training set led his neural network to encode ‘white text’ as a key feature of cat images.

Quirks of the training dataset are not the only source for undesirable decision-making logic to creep into a neural network. In other cases, the model may latch onto a trend that is a consistently successful prediction of a particular classification across a wide range of domains, but only due to human biases that its creator would likely not intend to reproduce. For example, when trained to spot correlations between words in a large corpus of written text, a neural network often learns that ‘programmer’ is reliably associated with ‘male.’

Unless the influence of such biases can be explicitly identified and controlled, ethical concerns will inhibit possible uses of such neural network models. The difficulty of teasing out these influences becomes especially difficult for the increasingly common method of deep learning, which uses networks with a particularly large number of processing layers that correspond to abstract patterns in the data set.

Today, neural networks are a widespread (if an often unnoticed) part of everyday life. They power everything from Netflix’s movie recommendations to Google Translate. Here, at least, the cost of an occasional errant prediction is an acceptable loss in comparison to the gains in accuracy and efficiency. In the case of high-risk areas such as military, medical, finance or criminology, the lack of interpretability is the main barrier to relying on the accuracy and efficiency of deep neural networks.

“Accuracy during testing is not the only metric we should care about, and everybody who has put a model in production knows that it never behaves the same as it did during tests,” Ribeiro said. “Mistakes have different costs. If you’re using machine learning to optimize ads, the stakes of a mistake are just losing money, which is bad, but it’s not as bad as sending people to jail incorrectly.”

Generating Explanations

In August 2016, the Defense Advanced Research Projects Agency (DARPA) announced that it would invest $75 million over the next five years into research, including Ribeiro’s, on how to render these complex models interpretable to human observers.

Ribeiro’s solution is a technique called Locally Interpretable Model-agnostic Explanations, or LIME for short. It works by making alterations to different features of a particular input, and seeing which of these alterations makes the greatest difference to the output classification, thus highlighting the features most relevant to the network’s decision.

The key to LIME’s effectiveness is in the ‘local element’. That is, it doesn’t attempt to explain all of the decisions a network might make across all possible inputs, only the factors in determining its classification for one particular input. In the case of a particular image like a cat seated on a chair, LIME might reveal that altering features like the paws or whiskers would result in the greatest reduction in the network’s confidence in the ‘cat’ classification, while altering the color of the chair made no impact on this at all.

“It is a trade off,” Ribeiro said. “Because if you’re trying to decide whether to actually use that model to make decisions in the future, you’d want to understand all the decisions it might make, not just why it made this particular decision in this case.”

On the whole though, he thinks the lack of such a global explanation for a model’s potential behavior might actually be a good thing. This reality will force those using such complex models to check decisions individually, rather than completely deferring decision-making to the judgment of the neural network altogether.

“Explanations are a step in the right direction, but they do not solve the problem,” Ribeiro said. “Especially for things like medical judgments, or whether someone should be released on bail, I would not be comfortable with a machine learning model making those decisions. Explanations should be about enabling people to use the help of the model in making their own decisions—not to trust in it blindly.”