Disentangling arguments for the importance of AI safety

I re­cently at­tended the 2019 Benefi­cial AGI con­fer­ence or­ganised by the Fu­ture of Life In­sti­tute. I’ll pub­lish a more com­plete write-up later, but I was par­tic­u­larly struck by how varied at­ten­dees’ rea­sons for con­sid­er­ing AI safety im­por­tant were. Be­fore this, I’d ob­served a few differ­ent lines of thought, but in­ter­preted them as differ­ent facets of the same idea. Now, though, I’ve iden­ti­fied at least 6 dis­tinct se­ri­ous ar­gu­ments for why AI safety is a pri­or­ity. By dis­tinct I mean that you can be­lieve any one of them with­out be­liev­ing any of the oth­ers—al­though of course the par­tic­u­lar cat­e­gori­sa­tion I use is rather sub­jec­tive, and there’s a sig­nifi­cant amount of over­lap. In this post I give a brief overview of my own in­ter­pre­ta­tion of each ar­gu­ment (note that I don’t nec­es­sar­ily en­dorse them my­self). They are listed roughly from most spe­cific and ac­tion­able to most gen­eral. I finish with some thoughts on what to make of this un­ex­pected pro­lifer­a­tion of ar­gu­ments. Pri­mar­ily, I think it in­creases the im­por­tance of clar­ify­ing and de­bat­ing the core ideas in AI safety.

Max­imisers are dan­ger­ous. Su­per­in­tel­li­gent AGI will be­have as if it’s max­imis­ing the ex­pec­ta­tion of some util­ity func­tion, since do­ing oth­er­wise can be shown to be ir­ra­tional. Yet we can’t write down a util­ity func­tion which pre­cisely de­scribes hu­man val­ues, and op­ti­mis­ing very hard for any other func­tion will lead to that AI rapidly seiz­ing con­trol (as a con­ver­gent in­stru­men­tal sub­goal) and build­ing a fu­ture which con­tains very lit­tle of what we value (be­cause of Good­hart’s law and the com­plex­ity and frag­ility of val­ues). We won’t have a chance to no­tice and cor­rect mis­al­ign­ment be­cause an AI which has ex­ceeded hu­man level will im­prove its in­tel­li­gence very quickly (ei­ther by re­cur­sive self-im­prove­ment or by scal­ing up its hard­ware), and then pre­vent us from mod­ify­ing it or shut­ting it down.

This was the main the­sis ad­vanced by Yud­kowsky and Bostrom when found­ing the field of AI safety. Here I’ve tried to con­vey the origi­nal line of ar­gu­ment, al­though some parts of it have been strongly cri­tiqued since then. In par­tic­u­lar, Drexler and Shah have dis­puted the rele­vance of ex­pected util­ity max­imi­sa­tion (the lat­ter sug­gest­ing the con­cept of goal-di­rect­ed­ness as a re­place­ment), while Han­son and Chris­ti­ano dis­agree that AI in­tel­li­gence will in­crease in a very fast and dis­con­tin­u­ous way.

Most of the ar­gu­ments in this post origi­nate from or build on this one in some way. This is par­tic­u­larly true of the next two ar­gu­ments—nev­er­the­less, I think that there’s enough of a shift in fo­cus in each to war­rant sep­a­rate list­ings.

The tar­get load­ing prob­lem. Even if we knew ex­actly what we wanted a su­per­in­tel­li­gent agent to do, we don’t cur­rently know (even in the­ory) how to make an agent which ac­tu­ally tries to do that. In other words, if we were to cre­ate a su­per­in­tel­li­gent AGI be­fore solv­ing this prob­lem, the goals we would as­cribe to that AGI (by tak­ing the in­ten­tional stance to­wards it) would not be the ones we had in­tended to give it. As a mo­ti­vat­ing ex­am­ple, evolu­tion se­lected hu­mans for their ge­netic fit­ness, yet hu­mans have goals which are very differ­ent from just spread­ing their genes. In a ma­chine learn­ing con­text, while we can spec­ify a finite num­ber of data points and their re­wards, neu­ral net­works may then ex­trap­o­late from these re­wards in non-hu­man­like ways.

This is a more gen­eral ver­sion of the “in­ner op­ti­miser prob­lem”, and I think it cap­tures the main thrust of the lat­ter while avoid­ing the difficul­ties of defin­ing what ac­tu­ally counts as an “op­ti­miser”. I’m grate­ful to Nate Soares for ex­plain­ing the dis­tinc­tion, and ar­gu­ing for the im­por­tance of this prob­lem.

The pro­saic al­ign­ment prob­lem. It is plau­si­ble that we build “pro­saic AGI”, which repli­cates hu­man be­havi­our with­out re­quiring break­throughs in our un­der­stand­ing of in­tel­li­gence. Shortly af­ter they reach hu­man level (or pos­si­bly even be­fore), such AIs will be­come the world’s dom­i­nant eco­nomic ac­tors. They will quickly come to con­trol the most im­por­tant cor­po­ra­tions, earn most of the money, and wield enough poli­ti­cal in­fluence that we will be un­able to co­or­di­nate to place limits on their use. Due to eco­nomic pres­sures, cor­po­ra­tions or na­tions who slow down AI de­vel­op­ment and de­ploy­ment in or­der to fo­cus on al­ign­ing their AI more closely with their val­ues will be out­com­peted. As AIs ex­ceed hu­man-level in­tel­li­gence, their de­ci­sions will be­come too com­plex for hu­mans to un­der­stand or provide feed­back on (un­less we de­velop new tech­niques for do­ing so), and even­tu­ally we will no longer be able to cor­rect the di­ver­gences be­tween their val­ues and ours. Thus the ma­jor­ity of the re­sources in the far fu­ture will be con­trol­led by AIs which don’t pri­ori­tise hu­man val­ues. This ar­gu­ment was ex­plained in this blog post by Paul Chris­ti­ano.

More gen­er­ally, al­ign­ing mul­ti­ple agents with mul­ti­ple hu­mans is much harder than al­ign­ing one agent with one hu­man, be­cause value differ­ences might lead to com­pe­ti­tion and con­flict even be­tween agents that are each fully al­igned with some hu­mans. (As my own spec­u­la­tion, it’s also pos­si­ble that hav­ing mul­ti­ple agents would in­crease the difficulty of sin­gle-agent al­ign­ment—e.g. the ques­tion “what would hu­mans want if I didn’t ma­nipu­late them” would no longer track our val­ues if we would coun­ter­fac­tu­ally be ma­nipu­lated by a differ­ent agent).

The hu­man safety prob­lem. This line of ar­gu­ment (which Wei Dai hasre­cently­high­lighted) claims that no hu­man is “safe” in the sense that giv­ing them ab­solute power would pro­duce good fu­tures for hu­man­ity in the long term, and there­fore that build­ing AI which ex­trap­o­lates and im­ple­ments the val­ues of even a very al­tru­is­tic hu­man is in­suffi­cient. A pro­saic ver­sion of this ar­gu­ment em­pha­sises the cor­rupt­ing effect of power, and the fact that moral­ity is deeply in­ter­twined with so­cial sig­nal­ling—how­ever, I think there’s a stronger and more sub­tle ver­sion. In ev­ery­day life it makes sense to model hu­mans as mostly ra­tio­nal agents pur­su­ing their goals and val­ues. How­ever, this ab­strac­tion breaks down badly in more ex­treme cases (e.g. ad­dic­tive su­per­stim­uli, un­usual moral predica­ments), im­ply­ing that hu­man val­ues are some­what in­co­her­ent. One such ex­treme case is run­ning my brain for a billion years, af­ter which it seems very likely that my val­ues will have shifted or dis­torted rad­i­cally, in a way that my origi­nal self wouldn’t en­dorse. Yet if we want a good fu­ture, this is the pro­cess which we re­quire to go well: a hu­man (or a suc­ces­sion of hu­mans) needs to main­tain broadly ac­cept­able and co­her­ent val­ues for as­tro­nom­i­cally long time pe­ri­ods.

An ob­vi­ous re­sponse is that we shouldn’t en­trust the fu­ture to one hu­man, but rather to some group of hu­mans fol­low­ing a set of de­ci­sion-mak­ing pro­ce­dures. How­ever, I don’t think any cur­rently-known in­sti­tu­tion is ac­tu­ally much safer than in­di­vi­d­u­als over the sort of timeframes we’re talk­ing about. Pre­sum­ably a com­mit­tee of sev­eral in­di­vi­d­u­als would have lower var­i­ance than just one, but as that com­mit­tee grows you start run­ning into well-known prob­lems with democ­racy. And while democ­racy isn’t a bad sys­tem, it seems un­likely to be ro­bust on the timeframe of mil­len­nia or longer. (Alex Zhu has made the in­ter­est­ing ar­gu­ment that the prob­lem of an in­di­vi­d­ual main­tain­ing co­her­ent val­ues is roughly iso­mor­phic to the prob­lem of a civil­i­sa­tion do­ing so, since both are com­plex sys­tems com­posed of in­di­vi­d­ual “mod­ules” which of­ten want differ­ent things.)

While AGI am­plifies the hu­man safety prob­lem, it may also help solve it if we can use it to de­crease the value drift that would oth­er­wise oc­cur. Also, while it’s pos­si­ble that we need to solve this prob­lem in con­junc­tion with other AI safety prob­lems, it might be post­pon­able un­til af­ter we’ve achieved civil­i­sa­tional sta­bil­ity.

Note that I use “broadly ac­cept­able val­ues” rather than “our own val­ues”, be­cause it’s very un­clear to me which types or ex­tent of value evolu­tion we should be okay with. Nev­er­the­less, there are some val­ues which we definitely find un­ac­cept­able (e.g. hav­ing a very nar­row moral cir­cle, or want­ing your en­e­mies to suffer as much as pos­si­ble) and I’m not con­fi­dent that we’ll avoid drift­ing into them by de­fault.

Mi­suse and vuln­er­a­bil­ities. Th­ese might be catas­trophic even if AGI always car­ries out our in­ten­tions to the best of its abil­ity:

AI which is su­per­hu­man at sci­ence and en­g­ineer­ing R&D will be able to in­vent very de­struc­tive weapons much faster than hu­mans can. Hu­mans may well be ir­ra­tional or mal­i­cious enough to use such weapons even when do­ing so would lead to our ex­tinc­tion, es­pe­cially if they’re in­vented be­fore we im­prove our global co­or­di­na­tion mechanisms. It’s also pos­si­ble that we in­vent some tech­nol­ogy which de­stroys us un­ex­pect­edly, ei­ther through un­luck­i­ness or care­less­ness. For more on the dan­gers from tech­nolog­i­cal progress in gen­eral, see Bostrom’s pa­per on the vuln­er­a­ble world hy­poth­e­sis.

AI could be used to dis­rupt poli­ti­cal struc­tures, for ex­am­ple via un­prece­dent­edly effec­tive psy­cholog­i­cal ma­nipu­la­tion. In an ex­treme case, it could be used to es­tab­lish very sta­ble to­tal­i­tar­i­anism, with au­to­mated surveillance and en­force­ment mechanisms en­sur­ing an un­shake­able monopoly on power for lead­ers.

AI could be used for large-scale pro­jects (e.g. cli­mate en­g­ineer­ing to pre­vent global warm­ing, or man­ag­ing the colon­i­sa­tion of the galaxy) with­out suffi­cient over­sight or ver­ifi­ca­tion of ro­bust­ness. Soft­ware or hard­ware bugs might then in­duce the AI to make un­in­ten­tional yet catas­trophic mis­takes.

Ar­gu­ment from large im­pacts. Even if we’re very un­cer­tain about what AGI de­vel­op­ment and de­ploy­ment will look like, it seems likely that AGI will have a very large im­pact on the world in gen­eral, and that fur­ther in­ves­ti­ga­tion into how to di­rect that im­pact could prove very valuable.

Weak ver­sion: de­vel­op­ment of AGI will be at least as big an eco­nomic jump as the in­dus­trial rev­olu­tion, and there­fore af­fect the tra­jec­tory of the long-term fu­ture. See Ben Garfinkel’s talk at EA Global Lon­don 2018 (which I’ll link when it’s available on­line). Ben noted that to con­sider work on AI safety im­por­tant, we also need to be­lieve the ad­di­tional claim that there are fea­si­ble ways to pos­i­tively in­fluence the long-term effects of AI de­vel­op­ment—some­thing which may not have been true for the in­dus­trial rev­olu­tion. (Per­son­ally my guess is that since AI de­vel­op­ment will hap­pen more quickly than the in­dus­trial rev­olu­tion, power will be more con­cen­trated dur­ing the tran­si­tion pe­riod, and so in­fluenc­ing its long-term effects will be more tractable.)

Strong ver­sion: de­vel­op­ment of AGI will make hu­mans the sec­ond most in­tel­li­gent species on the planet. Given that it was our in­tel­li­gence which al­lowed us to con­trol the world to the large ex­tent that we do, we should ex­pect that en­tities which are much more in­tel­li­gent than us will end up con­trol­ling our fu­ture, un­less there are re­li­able and fea­si­ble ways to pre­vent it. So far we have not dis­cov­ered any.

What should we think about the fact that there are so many ar­gu­ments for the same con­clu­sion? As a gen­eral rule, the more ar­gu­ments sup­port a state­ment, the more likely it is to be true. How­ever, I’m in­clined to be­lieve that qual­ity mat­ters much more than quan­tity—it’s easy to make up weak ar­gu­ments, but you only need one strong one to out­weigh all of them. And this pro­lifer­a­tion of ar­gu­ments is (weak) ev­i­dence against their qual­ity: if the con­clu­sions of a field re­main the same but the rea­sons given for hold­ing those con­clu­sions change, that’s a warn­ing sign for mo­ti­vated cog­ni­tion (es­pe­cially when those be­liefs are con­sid­ered so­cially im­por­tant). This prob­lem is ex­ac­er­bated by a lack of clar­ity about which as­sump­tions and con­clu­sions are shared be­tween ar­gu­ments, and which aren’t.

On the other hand, su­per­in­tel­li­gent AGI is a very com­pli­cated topic, and so per­haps it’s nat­u­ral that there are many differ­ent lines of thought. One way to put this in per­spec­tive (which I credit to Beth Barnes) is to think about the ar­gu­ments which might have been given for wor­ry­ing about nu­clear weapons, be­fore they had been de­vel­oped. Off the top of my head, there are at least four:

They might be used de­liber­ately.

They might be set off ac­ci­den­tally.

They might cause a nu­clear chain re­ac­tion much larger than an­ti­ci­pated.

And there are prob­a­bly more which would have been cred­ible at the time, but which seem silly now due to hind­sight bias. So if there’d been an ac­tive anti-nu­clear move­ment in the 30’s or early 40’s, the mo­ti­va­tions of its mem­bers might well have been as dis­parate as those of AI safety ad­vo­cates to­day. Yet the over­all con­cern would have been (and still is) to­tally valid and rea­son­able.

I think the main take­away from this post is that the AI safety com­mu­nity as a whole is still con­fused about the very prob­lem we are fac­ing. The only way to dis­solve this tan­gle is to have more com­mu­ni­ca­tion and clar­ifi­ca­tion of the fun­da­men­tal ideas in AI safety, par­tic­u­larly in the form of writ­ing which is made widely available. And while it would be great to have AI safety re­searchers ex­plain­ing their per­spec­tives more of­ten, I think there is still a lot of ex­pli­ca­tory work which can be done re­gard­less of tech­ni­cal back­ground. In ad­di­tion to anal­y­sis of the ar­gu­ments dis­cussed in this post, I think it would be par­tic­u­larly use­ful to see more de­scrip­tions of de­ploy­ment sce­nar­ios and cor­re­spond­ing threat mod­els. It would also be valuable for re­search agen­das to high­light which prob­lem they are ad­dress­ing, and the as­sump­tions they re­quire to suc­ceed.

Strong up­vote. This is ex­actly the kind of post I’d like to see more of­ten on the Fo­rum: It sum­ma­rizes many differ­ent points of view with­out try­ing to per­suade any­one, points out some core ar­eas of agree­ment, and names peo­ple who seem to be­lieve differ­ent things (per­haps open­ing lines for pro­duc­tive dis­cus­sion in the pro­cess). Work like this will be crit­i­cal for EA’s fu­ture in­tel­lec­tual progress.

And this pro­lifer­a­tion of ar­gu­ments is (weak) ev­i­dence against their qual­ity: if the con­clu­sions of a field re­main the same but the rea­sons given for hold­ing those con­clu­sions change, that’s a warn­ing sign for mo­ti­vated cog­ni­tion (es­pe­cially when those be­liefs are con­sid­ered so­cially im­por­tant).

I’m not sure these con­sid­er­a­tions should be too con­cern­ing in this case for a cou­ple of rea­sons.

I agree that it’s con­cern­ing where “con­clu­sions… re­main the same but the rea­sons given for hold­ing those con­clu­sions change” in cases where peo­ple origi­nally (pu­ta­tively) be­lieve p be­cause of x, then x is shown to be a weak con­sid­er­a­tion and so they switch to cit­ing y as a rea­son to be­lieve y. But from your post it doesn’t seem like that’s nec­es­sar­ily what has hap­pened, rather than a con­clu­sion be­ing overde­ter­mined by mul­ti­ple lines of ev­i­dence. Of course, par­tic­u­lar peo­ple in the field may have switched be­tween some of these rea­sons, hav­ing de­cided that some of them are not so com­pel­ling, but in the case of many of the rea­sons cited above, the differ­ences be­tween the po­si­tions seem suffi­ciently sub­tle that we should ex­pect cases of peo­ple clar­ify­ing their own un­der­stand­ing by shift­ing to closely re­lated po­si­tions(e.g. it seems plau­si­ble some­one might rea­son­ably switch from think­ing that the main prob­lem is know­ing how to pre­cisely de­scribe what we value to think­ing that the main prob­lem is not know­ing how to make an agent try to do that).

It also seems like a pro­lifer­a­tion of ar­gu­ments in favour of a po­si­tion is not too con­cern­ing where there are plau­si­ble rea­sons why should ex­pect mul­ti­ple of the con­sid­er­a­tions to ap­ply si­mul­ta­neously. For ex­am­ple, you might think that any kind of pow­er­ful agent typ­i­cally pre­sents a threat in mul­ti­ple differ­ent ways, in which case it wouldn’t be sus­pi­cious if peo­ple cited mul­ti­ple dis­tinct con­sid­er­a­tions as to why they were im­por­tant.

I agree that it’s not too con­cern­ing, which is why I con­sider it weak ev­i­dence. Nev­er­the­less, there are some changes which don’t fit the pat­terns you de­scribed. For ex­am­ple, it seems to me that newer AI safety re­searchers tend to con­sider in­tel­li­gence ex­plo­sions less likely, de­spite them be­ing a key com­po­nent of ar­gu­ment 1. For more de­tails along these lines, check out the ex­change be­tween me and Wei Dai in the com­ments on the ver­sion of this post on the al­ign­ment fo­rum.

Agreed. I think these rea­sons seem to fit fairly eas­ily into the fol­low­ing schema: Each of A, B, C, and D is nec­es­sary for a good out­come. Differ­ent peo­ple fo­cus on failures of A, failures of B, etc. de­pend­ing on which nec­es­sary crite­rion seems to them most difficult to satisfy and most salient.

Hi Richard, re­ally in­ter­est­ing! How­ever I think all your 6 rea­sons still think of AGI as be­ing an in­de­pen­dent agent. What do you think of this https://​​www.fhi.ox.ac.uk/​​re­fram­ing/​​ by Drexler—AGI as a com­pre­hen­sive set of ser­vices? To me this makes the prob­lem much more tractable and bet­ter al­igns with how we see things ac­tu­ally pro­gress­ing.

Drexler would dis­agree with some of Richard’s phras­ing, but he seems to agree that most (pos­si­bly all) of (some­what mod­ified ver­sions of) those 6 rea­sons should cause us to be some­what wor­ried. In par­tic­u­lar, he’s pretty clear that pow­er­ful util­ity max­imisers are pos­si­ble and would be dan­ger­ous.

6 de­scribes the AGI as a “species”—ser­vices are not a species, agents are a species. 4 and 5 as writ­ten de­scribe the AGI as an agent—surely once the AGI is de­scribed as an “it” that is do­ing some­thing cer­tainly sounds like an in­de­pen­dent agent to me. A ser­vice and an agent are fun­da­men­tally differ­ent in na­ture, they are not just a differ­ent view, as the out­come would de­pend on the ob­jec­tives of the in­struct­ing agent.

I’ve ac­tu­ally spent a fair while think­ing about CAIS, and writ­ten up my thoughts here. Over­all I’m skep­ti­cal about the frame­work, but if it turns out to be ac­cu­rate I think that would heav­ily miti­gate ar­gu­ments 1 and 2, some­what miti­gate 3, and not af­fect the oth­ers very much. In­so­far as 4 and 5 de­scribe AGI as an agent, that’s mostly be­cause it’s lin­guis­ti­cally nat­u­ral to do so—I’ve now ed­ited some of those phrases. 6b does de­scribe AI as a species, but it’s un­clear whether that con­flicts with CAIS, in­so­far as the claim that AI will never be agentlike is a very strong one, and I’m not sure whether Drexler makes it ex­plic­itly (I dis­cuss this point in the blog post I linked above).

“Skep­ti­cal about the frame­work” I do not agree with. In­deed it seems a use­ful model for how we as hu­mans are. We be­come ex­pert to vary­ing de­grees at a range of tasks or ser­vices through train­ing—as we get in a car we turn on our “driv­ing ser­vices” mod­ule (and sub mod­ules) for ex­am­ple. And then un­der­ly­ing and sep­a­rately we have our un­con­scious which drives the ma­jor­ity of our mo­ti­va­tions as a “free agent”—our mam­malian brain—which drives our so­cial­is­ing and norm­ing ac­tions, and then un­der­neath that our lim­bic brain which deals with emo­tions like fear and sta­tus which in my ex­pe­rience are the things that “move the money” if they are en­couraged.

It does not seem to me we are par­tic­u­larly “gen­er­ally in­tel­li­gent”. Put in a com­pletely un­fa­mil­iar set­ting with­out all the tools that now prop us up, we will strug­gle far more than a species already fa­mil­iar in that en­vi­ron­ment.

The in­tel­li­gent agent ap­proach to me takes the de­bate in the wrong di­rec­tion, and most con­cern­ingly dra­mat­i­cally un­der­states the near and pre­sent dan­ger of util­ity max­imis­ing ser­vices (“this is not su­per­in­tel­li­gence”), such as this ex­am­ple dis­cussed by Yu­val Noah Harari and Tris­tan Har­ris.

I think this is a good com­ment about how the brain works, but do re­mem­ber that the hu­man brain can both hunt in packs and do physics. Most sys­tems you might build to hunt are not able to do physics, and vice versa. We’re not perfectly com­pe­tent, but we’re still gen­eral.

I agree that the ex­tent to which in­di­vi­d­ual hu­mans are ra­tio­nal agents is of­ten over­stated. Nev­er­the­less, there are many ex­am­ples of hu­mans who spend decades striv­ing to­wards dis­tant and ab­stract goals, who learn what­ever skills and perform what­ever tasks are re­quired to reach them, and who strate­gi­cally plan around or ma­nipu­late the ac­tions of other peo­ple. If AGI is any­where near as agentlike as hu­mans in the sense of pos­sess­ing the long-term goal-di­rect­ed­ness I just de­scribed, that’s cause for sig­nifi­cant con­cern.

A life­time learn­ing to be a 9th Dan mas­ter at go per­haps? Build­ing on the back of thou­sands of years of hu­man knowl­edge and wis­dom? De­mol­ished in hours.… I still look at the game and it looks in­cred­ibly ab­stract!!

Don’t get my wrong I am re­ally con­cerned, I just con­sider the dan­ger much closer than oth­ers, but also more sol­u­ble if we look at the right prob­lem and ask the right ques­tions.