Chapter 9

Listen to the Sound of Absent Experts

Finding safe behaviors for AIs is a much more difficult problem than it may have initially seemed. But perhaps that’s just because you’re new to the problem. Sure, it sounds hard, but maybe after thinking about it for a while someone or some group will be able to come up with a good, precise description that captures exactly what we want the AI to do and not do. After all, experts have expertise. Computer scientists and programmers have been at this task for decades, and philosophers for millennia—surely they’ll have solved the problem by now?

The reality is that they’re nowhere near. Philosophers have been at it the longest, and there has been some philosophical progress. But their most important current contribution to solving the AI motivation problem is . . . an understanding of how complicated the problem is. It is no surprise that philosophers reach different conclusions. But what is more disheartening is how they fail to agree on the basic terms and definitions. Philosophers are human, and humans share a lot of implicit knowledge and common sense. And one could argue that the whole purpose of modern analytic philosophy is to clarify and define terms and relations. And yet, despite that, philosophers still disagree on the meaning of basic terminology, write long dissertations, and present papers at conferences outlining their disagreements. This is not due to poor-quality philosophers, or to some lackadaisical approach to the whole issue: very smart people, driven to present their pet ideas with the utmost clarity, fail to properly communicate their concepts to very similar human beings. The complexity of the human brain is enormous (it includes connections among approximately a hundred billion neurons); the complexity of human concepts such as love, meaning, and life is probably smaller, but it still seems far beyond the ability of even brilliant minds to formalize these concepts.

Is the situation any better from the perspective of those dealing with computers—AI developers and computer scientists? Here the problem is reversed: while philosophers fail to capture human concepts in unambiguous language, some computer scientists are fond of presenting simple unambiguous definitions and claiming these capture human concepts. It’s not that there’s a lack of suggestions as to how to code an AI that is safe—it’s that there are too many, and most are very poorly thought out. The “one big idea that will solve AI” is a popular trope in the field.

For instance, one popular suggestion that reappears periodically is to confine the AI to only answering questions—no manipulators, no robot arms or legs. This suggestion has some merit, but often those who trot it out are trapped in the “Terminator” mode of thinking—if the AI doesn’t have a robot body bristling with guns, then it can’t harm us. This completely fails to protect against socially manipulative AIs, against patient AIs with long time horizons, or against AIs that simply become so essential to human societies and economies that we dare not turn them off.

Another common idea is to have the AI designed as a mere instrument, with no volition of its own, simply providing options to its human controller (akin to how Google search provides us with links on which to click—except the AI would bring vast intelligence to the task of providing us with the best alternatives). But that image of a safe, inert instrument doesn’t scale well: as we’ve seen, humans will be compelled by our slow thinking to put more and more trust in the AI’s decisions. So as the AI’s power grows, we will still need to code safety precautions.

How will the AI check whether it’s accomplishing its goals or not? Even instrumental software needs some criteria for what counts as a better or worse response. Note that goals like “provide humans with their preferred alternative” are closely akin to the “make sure humans report maximal happiness” goal that we discussed earlier—and flawed for the very same reason. The AI will be compelled to change our preferences to best reach its goal.

Other dangerous1 suggestions in the computer sciences start with something related to some human values and then claim that as the totality of all values. A recent example was “complexity.” Noticing that human preferences were complex and that we often prefer a certain type of complexity in art, a suggestion was made to program the AI to maximize that type of complexity.2 But humans care about more than just complexity—we wouldn’t want friendship, love, babies, and humans themselves squeezed out of the world, just to make way for complexity. Sure, babies and love are complex—but we wouldn’t want them replaced with more complex alternatives that the AI is able to come up with. Hence, complexity does not capture what we really value. It was a trick: we hoped we could code human morality without having to code human morality. We hoped that complexity would somehow unfold to match exactly what we valued, sparing us all the hard work.

This is just one example—lots of other simple solutions to human morality have been proposed by various people, generally with the same types of flaws. The designs are far too simple to contain much of human value at all, and their creators don’t put the work in to prove that what we value and what best maximizes X are actually the same thing. Saying that human values entail a high X does not mean that pursuing the highest X ensures that human values are fulfilled.

Other approaches, slightly more sophisticated, acknowledge the complexity of human values and attempt to instil them into the AI indirectly.3 The key features of these designs are social interactions and feedback with humans.4 Through conversations, the AIs develop their initial morality and eventually converge on something filled with happiness and light and ponies. These approaches should not be dismissed out of hand, but the proposers typically underestimate the difficulty of the problem and project too many human characteristics onto the AI. This kind of intense feedback is likely to produce moral humans. (I still wouldn’t trust them with absolute power, though.) But why would an alien mind such as the AI react in comparable ways? Are we not simply training the AI to give the correct answer in training situations?

The whole approach is a constraint problem: in the space of possible AI minds, we are going to give priority to those minds that pass successfully through this training process and reassure us that they’re safe. Is there some quantifiable way of measuring how likely this is to produce a human-friendly AI at the end of it? If there isn’t, why are we putting any trust in it?

These problems remain barely addressed, so though it is possible to imagine a safe AI being developed using the current approaches (or their descendants), it feels extremely unlikely. Hence we shouldn’t put our trust in the current crop of experts to solve the problem. More work is urgently, perhaps desperately, needed.

Dangerous because any suggestion that doesn’t cover nearly all of human values is likely to leave out many critical values we would never want to live without.↩