Breaking things is easy

Dec 16, 2016

by Nicolas Papernot and Ian Goodfellow

Until a few years ago, machine learning algorithms simply did
not work very well on many meaningful tasks like recognizing objects or translation. Thus, when
a machine learning algorithm failed
to do the right thing, this was the rule, rather than the
exception. Today, machine learning algorithms have advanced to the
next stage of development: when presented with
naturally occurring inputs, they can outperform humans.
Machine learning has not yet reached true human-level performance,
because when
confronted by even a trivial adversary,
most machine learning algorithms fail dramatically.
In other words, we have reached the point where machine learning works,
but may easily be broken.

This blog post serves to introduce our new Clever Hans blog,
in which we will discuss all
of the many ways an attacker can break a machine learning algorithm.
Academically speaking, our topic is the
security and privacy of machine learning [PMS16].
The blog is jointly written by Ian
Goodfellow and Nicolas Papernot. Ian
is a research scientist at OpenAI, and Nicolas
a Google PhD Fellow in
Security at Penn State. We jointly created
cleverhans, an open-source library for
benchmarking the vulnerability of machine learning models to adversarial
examples.
The blog gives us a way to informally share ideas about machine learning
security and privacy that are not yet concrete enough for traditional
academic publishing, and to share news and updates relevant to the cleverhans
library.

A
secure system is one that can be depended upon and is
guaranteed to behave as expected [GSS03].
When we attempt to provide guarantees about how the system will behave,
we do so with a particular threat model in mind.
The threat model is a formally defined set of assumptions about the
capabilities and goals of any attacker who may wish the
system to misbehave.

So far, most machine learning has been developed with a very weak
threat model, in which there is no opponent.
The machine learning system is designed to behave correctly when
confronted by nature.
Today, we are beginning to design machine learning systems that
behave correctly even when confronted with a malicious person
or a malicious machine learning-based adversary.

For instance, a machine learning system may be targeted by
an adversary while the model is being trained (the learning phase) or
when the model is making predictions (the inference phase). Adversaries also have varying degrees of capabilities, which may include access to the model internals like its architecture
and parameters, or to the model inputs and outputs.

To break a machine learning model, an attacker can compromise its
confidentiality, integrity, or availability.
Together, these properties form the
CIA model
of security.

To provide confidentiality, a machine learning system must not leak
information to unauthorized users. In practice, confidentiality in machine
learning makes more sense when thought about as privacy: the model
must not leak sensitive data.
For example, suppose that researchers built a machine learning model
that could examine a patients’ medical records and offer disease diagnoses.
Publishing such a model could provide a valuable resource to doctors,
but it is important to make sure that a malicious person cannot examine
the model and recover private medical data regarding the patients
who helped to train the model.

Adversaries capable
of tampering with the model’s integrity can alter its predictions
to differ from the intended ones.
For instance, spammers try to design their e-mail messages to be
incorrectly recognized as legitimate messages.

An attacker can also compromise a system’s availability.
For example, a self-driving car might be forced to go into a failsafe
mode and pull over if an adversary placed an extremely confusing object
next to the road ahead of a car.

Of course, all of this so far is hypothetical.
What kinds of attacks have security researchers actually demonstrated so far?
Later posts in this blog will give many more examples,
but let’s start with just three:
integrity attacks
at training time, integrity attack during inference, and privacy attacks.

Poisoning training sets

The adversary may interfere with the integrity of the training process by making
modifications to existing training data or inserting additional
data in the existing training set.
For example, suppose Professor Moriarty wishes to frame Sherlock Holmes
for a crime.
He may arrange for an unsuspected accomplice to give Sherlock Holmes a
pair of very unique and ornate boots.
After Sherlock has worn these boots in the presence of the policemen he
routinely assists, the policemen will learn to associate the unique boots
with him.
Professor Moriarty may then commit a crime while wearing a second copy
of the same pair of boots, leaving behind tracks that will cause
Holmes to fall under suspicion.

In machine learning, the strategy followed by the adversary is to
perturb training points in a way that increases the prediction error
of the machine learning
when it is used in production.
For instance, this method was used to poison the training
set of support vector machines: the convexity of the
loss function used to measure their prediction error
allows the adversary to find exactly the set of points that
when perturbed will most harm the model’s performance at inference [BNL12].
The problem of finding effective poisoning points remains open for more complex models, which
don’t yield convex optimization problems, like deep neural networks.

Forcing models to make mistakes with adversarial examples

In fact, breaking things is so easy that adversaries need not
poison the training data analyzed to learn the parameters of
a machine learning model. Instead,
adversaries can force models to make mistakes instantly by only perturbing the
inputs on which the model makes predictions (after completion of training—during
the inference phase).

A common way to find perturbations forcing
models to make wrong predictions is to compute adversarial examples [SZS13].
They yield perturbations that are very slight and often indistinguishable
to humans, yet force machine learning models to produce wrong predictions.
For instance, in the illustration below reproduced from [GSS14], the image on the left is correctly
classified by a machine learning model as a panda, but adding the
noise represented in the middle to that same image results in the image
on the right, which is classified as a gibbon by the model.

Note that despite being indistinguishable to the human eye, the
perturbation is sufficient to change the model’s prediction.
Indeed, the perturbation is computed to minimize a specific norm
in the input domain while increasing the model’s prediction error.
This effectively pushes the input, which was originally
correctly classified, across the model’s decision boundary into
a wrong class. This phenomenon is illustrated in the 2D Figure below,
for a problem with two output classes.

Many attacks based on adversarial examples require that the adversary
have knowledge of the machine learning model parameters
to solve the optimization problem for the input perturbation.
In constrast, some follow-up work considers the more realistic threat model
of an adversary only capable of interacting with the model by observing
its predictions on chosen inputs. For instance, this could be used by
an attacker to figure out how to design webpages that are well ranked
by a machine learning model designed a la PageRank, or how to craft
spam that evades detection. In these black-box settings, the machine learning model is said to
act as an oracle. A strategy is to first query
the oracle in order to extract
an approximation of its decision boundaries—the substitute model—and
then use that extracted model to craft
adversarial examples that are misclassified by the oracle [PMG16]. This is one
of the attacks that exploit the transferability
of adversarial examples:
they are often
misclassified simultaneously across different models solving the same
machine learning task, despite the fact that these models differ in their
architecture or training data [SZS13].

Privacy issues in machine learning

Privacy issues in machine learning need not necessarily involve adversaries.
For instance, an emerging field is addressing concerns related to the lack of
fairness and transparency when learning algorithms process the training data.
In fact, it has been pointed out recently that social biases encoded in the training
dataset translate to biased model predictions upon completion of learning [BS16].
In the following, we instead focus on scenarios involving adversaries.

The goal of adversaries is often to recover part of the training data used to
learn the model, or infer some sensitive properties of users by
observing the model’s prediction. For instance, smartphone keyboards
now learn better predictive completions from the patterns entered by
their users. However, typing a certain sequence of letters specific to a user
should not lead to its appearance on other phones—unless a sufficiently
large pool of users typed that sequence. As such, privacy attacks are
primarily meaningful at the inference stage, but mitigating them
typically requires some randomization in the learning algorithm [CMS11].

For instance, adversaries may seek to perform membership inference queries:
know whether a particular training point was used to train the model.
A recent paper looks closely at this question in the context of deep
neural networks [SSS16]. In a somewhat opposite manner to the gradients
used to craft adversarial examples (that change the model’s confidence in
the right answer), the membership attack searches for points classified
with very high confidence in directions identified using gradients.
It is also possible to infer more general statistical information about the training
data used to learn a deployed model [AMS15].

Conclusion

It is now December 2016. At present, we know many different ways of attacking machine learning models, and very few ways of defending. We hope that by December 2017, we will have more effective defenses.
The goal of this blog is to push forward the state of the art in machine learning security and privacy, by documenting advances as they happen, by provoking discussion within the community of researchers involved in these topics, and by inspiring a new generation of researchers to join the community.

Acknowledgements

Thanks to Catherine Olsson, for pointing out that we had
“the rule, rather than the
exception” backwards in the original post. We have since fixed the mistake.