Five theses, two lemmas, and a couple of strategic implications

MIRI’s primary concern about self-improving AI isn’t so much that it might be created by ‘bad’ actors rather than ‘good’ actors in the global sphere; rather most of our concern is in remedying the situation in which no one knows at all how to create a self-modifying AI with known, stable preferences. (This is why we see the main problem in terms of doing research and encouraging others to perform relevant research, rather than trying to stop ‘bad’ actors from creating AI.)

This, and a number of other basic strategic views, can be summed up as a consequence of 5 theses about purely factual questions about AI, and 2 lemmas we think are implied by them, as follows:

Intelligence explosion thesis. A sufficiently smart AI will be able to realize large, reinvestable cognitive returns from things it can do on a short timescale, like improving its own cognitive algorithms or purchasing/stealing lots of server time. The intelligence explosion will hit very high levels of intelligence before it runs out of things it can do on a short timescale. See: Chalmers (2010); Muehlhauser & Salamon (2013); Yudkowsky (2013).

Orthogonality thesis. Mind design space is huge enough to contain agents with almost any set of preferences, and such agents can be instrumentally rational about achieving those preferences, and have great computational power. For example, mind design space theoretically contains powerful, instrumentally rational agents which act as expected paperclip maximizers and always consequentialistically choose the option which leads to the greatest number of expected paperclips. See: Bostrom (2012); Armstrong (2013).

Convergent instrumental goals thesis. Most utility functions will generate a subset of instrumental goals which follow from most possible final goals. For example, if you want to build a galaxy full of happy sentient beings, you will need matter and energy, and the same is also true if you want to make paperclips. This thesis is why we’re worried about very powerful entities even if they have no explicit dislike of us: “The AI does not love you, nor does it hate you, but you are made of atoms it can use for something else.” Note though that by the Orthogonality Thesis you can always have an agent which explicitly, terminally prefers not to do any particular thing — an AI which does love you will not want to break you apart for spare atoms. See: Omohundro (2008); Bostrom (2012).

Complexity of value thesis. It takes a large chunk of Kolmogorov complexity to describe even idealized human preferences. That is, what we ‘should’ do is a computationally complex mathematical object even after we take the limit of reflective equilibrium (judging your own thought processes) and other standard normative theories. A superintelligence with a randomly generated utility function would not do anything we see as worthwhile with the galaxy, because it is unlikely to accidentally hit on final preferences for having a diverse civilization of sentient beings leading interesting lives. See: Yudkowsky (2011); Muehlhauser & Helm (2013).

Fragility of value thesis. Getting a goal system 90% right does not give you 90% of the value, any more than correctly dialing 9 out of 10 digits of my phone number will connect you to somebody who’s 90% similar to Eliezer Yudkowsky. There are multiple dimensions for which eliminating that dimension of value would eliminate almost all value from the future. For example an alien species which shared almost all of human value except that their parameter setting for “boredom” was much lower, might devote most of their computational power to replaying a single peak, optimal experience over and over again with slightly different pixel colors (or the equivalent thereof). Friendly AI is more like a satisficing threshold than something where we’re trying to eke out successive 10% improvements. See: Yudkowsky (2009, 2011).

These five theses seem to imply two important lemmas:

Indirect normativity. Programming a self-improving machine intelligence to implement a grab-bag of things-that-seem-like-good-ideas will lead to a bad outcome, regardless of how good the apple pie and motherhood sounded. E.g., if you give the AI a final goal to “make people happy” it’ll just turn people’s pleasure centers up to maximum. “Indirectly normative” is Bostrom’s term for an AI that calculates the ‘right’ thing to do via, e.g., looking at human beings and modeling their decision processes and idealizing those decision processes (e.g. what you would-want if you knew everything the AI knew and understood your own decision processes, reflective equilibria, ideal advisior theories, and so on), rather than being told a direct set of ‘good ideas’ by the programmers. Indirect normativity is how you deal with Complexity and Fragility. If you can succeed at indirect normativity, then small variances in essentially good intentions may not matter much — that is, if two different projects do indirect normativity correctly, but one project has 20% nicer and kinder researchers, we could still hope that the end results would be of around equal expected value. See: Muehlhauser & Helm (2013).

Large bounded extra difficulty of Friendliness. You can build a Friendly AI (by the Orthogonality Thesis), but you need a lot of work and cleverness to get the goal system right. Probably more importantly, the rest of the AI needs to meet a higher standard of cleanness in order for the goal system to remain invariant through a billion sequential self-modifications. Any sufficiently smart AI to do clean self-modification will tend to do so regardless, but the problem is that intelligence explosion might get started with AIs substantially less smart than that — for example, with AIs that rewrite themselves using genetic algorithms or other such means that don’t preserve a set of consequentialist preferences. In this case, building a Friendly AI could mean that our AI has to be smarter about self-modification than the minimal AI that could undergo an intelligence explosion. See: Yudkowsky (2008) and Yudkowsky (2013).

These lemmas in turn have two major strategic implications:

We have a lot of work to do on things like indirect normativity and stable self-improvement. At this stage a lot of this work looks really foundational — that is, we can’t describe how to do these things using infinite computing power, let alone finite computing power. We should get started on this work as early as possible, since basic research often takes a lot of time.

There needs to be a Friendly AI project that has some sort of boost over competing projects which don’t live up to a (very) high standard of Friendly AI work — a project which can successfully build a stable-goal-system self-improving AI, before a less-well-funded project hacks together a much sloppier self-improving AI. Giant supercomputers may be less important to this than being able to bring together the smartest researchers (see the open question posed in Yudkowsky 2013) but the required advantage cannot be left up to chance. Leaving things to default means that projects less careful about self-modification would have an advantage greater than casual altruism is likely to overcome.

Did you like this post? You may enjoy our other Analysis posts, including:

Let us suppose, for the sake of argument, that MIRI one day realizes FAI. The most important question, and it is one that outweighs and overshadows any other, is: how does MIRI solve the security challenge that such a program result can be modified at will by anyone who cares to try? There is ‘theoretic’ stability, and then there is just stability, which must yield to practical reality. If one is going to discuss stability it must be in the latter context. While I realize the intellectual focus is on garnering the theoretic ability to devise systems for stability under self-modification, there are external factors that make all of that secondary.

Luke Muehlhauser

Are you trying to express the worry that a machine intelligence will be modified by external agents after it undergoes an intelligence explosion up to superhuman levels of intelligence? Or are you suggesting something else?

Amanda Zheng

My interpretation of the comment is that, though MIRI is focusing on making the intelligence itself friendly, there is nothing stopping external agents (such as humans with dangerous/harmful motives).

Under the same assumption, i.e. that FAI is realized, then shouldn’t the superintelligence be able to resist the efforts of a human? I recall that one of MIRI’s reasons for researching FAI before AI is realized, is that once an intelligence undergoes an intelligence explosion, it will be too late for us to implement our desires and goals into it. There is no way to go back and try to put it in afterward.

http://blog.dustinjuliano.com Dustin Juliano

Machine intelligence, by the nature of being an information processing system, will always be susceptible to modification, functional interference, and/or reverse engineering. No responsible individual educated in computer science and/or engineering would attempt to assert or imply otherwise. It’s program description, “superhuman” or otherwise, does not suddenly become immutable by merit of being more complex. Permanently guaranteeing the integrity and stability of such an invention ultimately implies notions of force and/or secrecy. This sets the stage for the next point.

Disregarding “bad” actors and calling for aid in the race for MIRI’s conception of FAI entails an unstated assumption that reaching the “intelligence explosion” first somehow guarantees mutual exclusion to others who may create an alternative at nearly the same time or thereafter. This would only make sense if MIRI could “win” the race, and if the race was even winnable. All such cases imply that MIRI would seek to directly or indirectly acquire the power and resources to impose and enforce that exclusivity, which calls many things into question. The political and ideological implications of which I will leave as an exercise for the reader.

http://mattmahoney.net Matt Mahoney

What is the minimal intelligence that can improve itself? How do you know? Exactly what do you mean by “improve”?

Either self-improving systems exist or they don’t. If they do exist, then isn’t it too late? If they don’t exist, then how do you know it is possible?

Haydn

I feel this post needs (at least) a paragraph sketching out a plausible story for how a computer program gets to the point where it can break me apart for spare atoms.

jbash

Do you ask for that because you yourself don’t see how that would happen, or because you think others wouldn’t?

I think it’s pretty obvious how it could do that… but I’ve been following issues like this for decades and it’s hard for me to see what others might find hard to swallow.

http://opentheory.net Mike Johnson

A lot of (mostly good) thoughts here. I’d single one out:

“no one knows at all how to create a self-modifying AI with known, stable preferences.”

This seems only half true. I.e., the human brain is a (weakly) self-modifying intelligence, with (somewhat) known, (semi-)stable preferences. I can’t say we really understand how to build a human brain, but we do have some knowledge as to the causes of variance in the relevant properties you abstracted: what makes some people more self-modifying, and what makes some people have more transparent and stable preferences.

http://joshuafox.com Joshua Fox

> Getting a goal system 90% right does not give you 90% of the value, any more than correctly
> dialing 9 out of 10 digits of my phone number will connect you to somebody
> who’s 90% similar to Eliezer Yudkowsky.

The comparison is inapt.

True, in both cases a miss is as good as a mile.

But getting a goal system 90% right does get you a lot of things that you want (and a hellish mess because of the failed 10%); whereas two phone numbers with 9 digits matching have nothing to do with each other semantically (except for area codes, sometimes, and vestigial remnants of neighborhood “exchanges” making the 3 digits after the area code match in a certain area).

jbash

Thank you for the clear and succinct summary (with nice supporting links). I think that’s exactly the sort of thing you need to produce on the outreach front.

ESRogs

The assumption is that once it had realized the “large, reinvestable cognitive returns” available “on a short timescale,” the AI could handle itself.