It's easy to be impressed these days by how well neural nets can categorize and classify images. Given millions of examples of a variety of things, a neural net can be trained to distinguish between a raccoon and a squirrel, a baseball and a basketball, or a hot dog and a hamburger. But as these systems are used more frequently to automatically classify things like faces in a crowd or cars on the street, it's become apparent that sometimes odd things happen when very small and almost unnoticeable changes are made to the original input image. This clearly illustrates that our fancy neural net image recognition systems just don't see things quite the same way we do. This is just one facet of the bigger "black box" problem with current neural network based systems: We don't know how it decides; it just does. Not only do we not know how the system arrives at its answer, but we have little idea how close the system is to making an error. When some of us actively try to get the system to make mistakes, it turns out to be pretty easy (maybe a bit harder than hacking a voting machine. If we want to take this a step further and want our "trained" system to make a particular mistake (e.g. mistake a raccoon for rhododendron), then that's harder — but not much harder — to do. Welcome to the world of adversarial Machine Learning.

One reason for the fragility of the solution space is that there are so many input values for an image. At the point of initial input, each pixel is a unique input value and there can be thousands or even millions of pixels in an image. In a very real sense, each of those input values can be thought of as an independent dimension.

Odd unintuitive things can happen at very high dimensions (sometimes referred to as the Curse of Dimensionality). One way to think about this is that the vertices of the unit cube can become quite distant from each other as the dimension increases. In a one dimensional "cube" (aka a line segment), the distance between the two vertices is unit one. In a two dimensional cube (aka a square), the largest diagonal distance is the square root of two or about 1.414. This is just applying the simple Pythagorean theorem: A squared plus B squared equals C squared. Likewise, as we go to three dimensions, the largest diagonal distance of our unit cube is the square root of three or about 1.732. For four dimensions, the distance is 2.00. You can see the pattern: the maximum diagonal distance is the square root of the dimension of the cube. So in the case of a modestly sized image of 100 x 100 pixels, we have a dimensionality of 10,000 and the square root of 10,000 is 100 and so the maximum diagonal distance is 100 for the unit cube! Points in high dimensional space get really spread out. In order to make any sense about how data points cluster you must provide increasingly huge numbers of examples (which is partially why so many input examples are required for machine learning from images).

The Machine Learning we apply towards these type of problems today do not fully model how we as humans think about images. A (very) simplistic way of thinking about today's AI image processing is that the algorithm places images in a large dimensional space and measures the distance to known training examples.

Before some of the more experienced readers attack me for this oversimplification: it's not really as horrible as it seems at first glance. This is where the application of deep neural nets help with the problem of dimensionality and perhaps leads to results closer to what a human vision system does. One of the functional results of multilayered neural nets is that the dimensionality is reduced by learning more and more complex features at each layer. The very nature of complex features is that a single feature embodies a collection of inputs from the previous layer: many dimensions are reduced to one dimension (e.g. many pixels can represent one square). And as the network gets deeper, multiple complex features can be combined into a single yet more complex feature (e.g. a bunch of squares can represent a tiling, like a whole chessboard). This too is a bit of an oversimplification of what really happens in the layers, but you get the idea. Biological research has shown that the visual systems of vertebrates have evolved layers of neurons that turn the individual retinal cell outputs (pixels) into more generalized features (e.g. short vertical lines or horizontal lines) and deeper layers combined those into even more generalized features (e.g. circles, blobs, squares etc.) So our deep neural net approach seems to be headed in the right direction.

But back to the adversarial issue: the way our neural nets learn things today does not lend these systems to successfully extrapolate at the edges of their training sample space examples. Projecting outward from the boundaries of the known example predicts solution points that are even farther away from the already sparse example points. These networks don't reason about images the same way humans do. Otherwise, we couldn't trick our AI into doing this:

I'm sure, just like me, you see the two images as being somewhat blurry stop signs. Note: this is typical of the resolution at which a self-driving car gets input from its camera. But these images are slightly different. If you look very closely you can see incredibly subtle differences between them. But neither you or I or even a three-year-old who has been shown a stop sign only once or twice before being presented with this image would deny that both of these are images of stop signs. The truth is that the image on the left is an original unaltered image of a stop sign. It is very typical of all of the images of stop signs that were used to train a neural net to recognize traffic signs. And in fact, the neural net does recognize the image on the left as a stop sign and distinguish it from all of the other traffic signs that it is capable of recognizing. The image on the right, however, is recognized by our perfectly good and well trained neural net to be a yield sign. The network thinks it's this:

Much more detail about how this was done can be found in this ACM article.

Obviously, the system has not learned to see the way people see. The adversarial image of the stop sign (on the right) was created by a program that makes small changes in the input image at the pixel level and tests what the network (the network is under attack) thinks the image represents. This procedure is done in a loop in which changes that make the output probability for a yield sign result ever so slightly better are kept. Eventually, enough of the small random changes of the individual pixel RGB values accumulate which enhance the output of the neural net to a point where it decides a yield sign is the most likely result. Even with our high-level human reasoning, it's not possible to compare and contrast the true and adversarial images and decide exactly why it should be confused with a yield sign.

It is possible to modify images so subtly that humans can barely detect the difference between the original and the modified image yet the neural net will confidently make a completely absurd classification of the object. Welcome to the world of adversarial Machine Learning. Machine Learning with the goal of getting it wrong.

By the way, the first picture in the article was indeed a bookcase as reported by a well-trained image processing neural net. Just a bit of noise added to a perfectly normal picture, just this tiny bit of added noise was all it took: