Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back;

Highlights

A shift in ar­gu­ments for AI risk(Tom Sit­tler): Early ar­gu­ments for AI safety fo­cus on ex­is­ten­tial risk cause by a failure of al­ign­ment com­bined with a sharp, dis­con­tin­u­ous jump in AI ca­pa­bil­ities. The dis­con­ti­nu­ity as­sump­tion is needed in or­der to ar­gue for a treach­er­ous turn, for ex­am­ple: with­out a dis­con­ti­nu­ity, we would pre­sum­ably see less ca­pa­ble AI sys­tems fail to hide their mis­al­igned goals from us, or to at­tempt to de­ceive us with­out suc­cess. Similarly, in or­der for an AI sys­tem to ob­tain a de­ci­sive strate­gic ad­van­tage, it would need to be sig­nifi­cantly more pow­er­ful than all the other AI sys­tems already in ex­is­tence, which re­quires some sort of dis­con­ti­nu­ity.

Now, there are sev­eral other ar­gu­ments for AI risk, though none of them have been made in great de­tail and are spread out over a few blog posts. This post an­a­lyzes sev­eral of them and points out some open ques­tions.

First, even with­out a dis­con­ti­nu­ity, a failure of al­ign­ment could lead to a bad fu­ture: since the AIs have more power and in­tel­li­gence their val­ues will de­ter­mine what hap­pens in the fu­ture, rather than ours. (Here it is the differ­ence be­tween AIs and hu­mans that mat­ters, whereas for a de­ci­sive strate­gic ad­van­tage it is the differ­ence be­tween the most in­tel­li­gent agent and the next-most in­tel­li­gent agents that mat­ters.) See also More re­al­is­tic tales of doom (AN #50) and Three im­pacts of ma­chine in­tel­li­gence. How­ever, it isn’t clear why we wouldn’t be able to fix the mis­al­ign­ment at the early stages when the AI sys­tems are not too pow­er­ful.

Even if we ig­nore al­ign­ment failures, there are other AI risk ar­gu­ments. In par­tic­u­lar, since AI will be a pow­er­ful tech­nol­ogy, it could be used by mal­i­cious ac­tors; it could help en­sure ro­bust to­tal­i­tar­ian regimes; it could in­crease the like­li­hood of great-power war, and it could lead to stronger com­pet­i­tive pres­sures that erode value. With all of these ar­gu­ments, it’s not clear why they are spe­cific to AI in par­tic­u­lar, as op­posed to any im­por­tant tech­nol­ogy, and the ar­gu­ments for risk have not been sketched out in de­tail.

The post ends with an ex­hor­ta­tion to AI safety re­searchers to clar­ify which sources of risk mo­ti­vate them, be­cause it will in­fluence what safety work is most im­por­tant, it will help cause pri­ori­ti­za­tion efforts that need to de­ter­mine how much money to al­lo­cate to AI risk, and it can help avoid mi­s­un­der­stand­ings with peo­ple who are skep­ti­cal of AI risk.

Ro­hin’s opinion: I’m glad to see more work of this form; it seems par­tic­u­larly im­por­tant to gain more clar­ity on what risks we ac­tu­ally care about, be­cause it strongly in­fluences what work we should do. In the par­tic­u­lar sce­nario of an al­ign­ment failure with­out a dis­con­ti­nu­ity, I’m not satis­fied with the solu­tion “we can fix the mis­al­ign­ment early on”, be­cause early on even if the mis­al­ign­ment is ap­par­ent to us, it likely will not be easy to fix, and the mis­al­igned AI sys­tem could still be use­ful be­cause it is “al­igned enough”, at least at this low level of ca­pa­bil­ity.

Per­son­ally, the ar­gu­ment that mo­ti­vates me most is “AI will be very im­pact­ful, and it’s worth putting in effort into mak­ing sure that that im­pact is pos­i­tive”. I think the sce­nar­ios in­volv­ing al­ign­ment failures with­out a dis­con­ti­nu­ity are a par­tic­u­larly im­por­tant sub­cat­e­gory of this ar­gu­ment: while I do ex­pect we will be able to han­dle this is­sue if it arises, this is mostly be­cause of meta-level faith in hu­man­ity to deal with the prob­lem. We don’t cur­rently have a good ob­ject-level story for why the is­sue won’t hap­pen, or why it will be fixed when it does hap­pen, and it would be good to have such a story in or­der to be con­fi­dent that AI will in fact be benefi­cial for hu­man­ity.

I know less about the non-al­ign­ment risks, and my work doesn’t re­ally ad­dress any of them. They seem worth more in­ves­ti­ga­tion; cur­rently my feel­ing to­wards them is “yeah, those could be risks, but I have no idea how likely the risks are”.

In this pa­per, my coau­thors and I pro­pose that we learn the cog­ni­tive bi­ases of the demon­stra­tor, by learn­ing their plan­ning al­gorithm. The hope is that the cog­ni­tive bi­ases are en­coded in the learned plan­ning al­gorithm. We can then perform bias-aware IRL by find­ing the re­ward func­tion that when passed into the plan­ning al­gorithm re­sults in the ob­served policy. We have two al­gorithms which do this, one which as­sumes that we know the ground-truth re­wards for some tasks, and one which tries to keep the learned plan­ner “close to” the op­ti­mal plan­ner. In a sim­ple en­vi­ron­ment with simu­lated hu­man bi­ases, the al­gorithms perform bet­ter than the stan­dard IRL as­sump­tions of perfect op­ti­mal­ity or Boltz­mann ra­tio­nal­ity—but they lose a lot of perfor­mance by us­ing an im­perfect differ­en­tiable plan­ner to learn the plan­ning al­gorithm.

Ro­hin’s opinion: Although this only got pub­lished re­cently, it’s work I did over a year ago. I’m no longer very op­ti­mistic about am­bi­tious value learn­ing (AN #31), and so I’m less ex­cited about its im­pact on AI al­ign­ment now. In par­tic­u­lar, it seems un­likely to me that we will need to in­fer all hu­man val­ues perfectly, with­out any edge cases or un­cer­tain­ties, which we then op­ti­mize as far as pos­si­ble. I would in­stead want to build AI sys­tems that start with an ad­e­quate un­der­stand­ing of hu­man prefer­ences, and then learn more over time, in con­junc­tion with op­ti­miz­ing for the prefer­ences they know about. How­ever, this pa­per is more along the former line of work, at least for long-term AI al­ign­ment.

I do think that this is a con­tri­bu­tion to the field of in­verse re­in­force­ment learn­ing—it shows that by us­ing an ap­pro­pri­ate in­duc­tive bias, you can be­come more ro­bust to (cog­ni­tive) bi­ases in your dataset. It’s not clear how far this will gen­er­al­ize, since it was tested on simu­lated bi­ases on sim­ple en­vi­ron­ments, but I’d ex­pect it to have at least a small effect. In prac­tice though, I ex­pect that you’d get bet­ter re­sults by pro­vid­ing more in­for­ma­tion, as in T-REX (AN #54).

Cog­ni­tive Model Pri­ors for Pre­dict­ing Hu­man De­ci­sions(David D. Bour­gin, Joshua C. Peter­son et al) (sum­ma­rized by Cody): Hu­man de­ci­sion mak­ing is no­to­ri­ously difficult to pre­dict, be­ing a com­bi­na­tion of ex­pected value calcu­la­tion and likely-not-fully-enu­mer­ated cog­ni­tive bi­ases. Nor­mally we could pre­dict well us­ing a neu­ral net with a ton of data, but data about hu­man de­ci­sion mak­ing is ex­pen­sive and scarce. This pa­per pro­poses that we pre­train a neu­ral net on lots of data simu­lated from the­o­ret­i­cal mod­els of hu­man de­ci­sion mak­ing and then fine­tune on the small real dataset. In effect, we are us­ing the the­o­ret­i­cal model as a kind of prior, that pro­vides the neu­ral net with a strong in­duc­tive bias. The method achieves bet­ter perfor­mance than ex­ist­ing the­o­ret­i­cal or em­piri­cal meth­ods, with­out re­quiring fea­ture en­g­ineer­ing, both on ex­ist­ing datasets and a new, larger dataset col­lected via Me­chan­i­cal Turk.

Cody’s opinion: I am a lit­tle cau­tious to make a strong state­ment about the im­por­tance of this pa­per, since I don’t have as much do­main knowl­edge in cog­ni­tive sci­ence as I do in ma­chine learn­ing, but over­all this “treat your the­o­ret­i­cal model like a gen­er­a­tive model and sam­ple from it” idea seems like an el­e­gant and plau­si­bly more broadly ex­ten­si­ble way of in­cor­po­rat­ing the­o­ret­i­cal pri­ors alongside real data.

Mis­cel­la­neous (Align­ment)

Self-con­firm­ing prophe­cies, and sim­plified Or­a­cle de­signs(Stu­art Arm­strong): This post pre­sents a toy en­vi­ron­ment to model self-con­firm­ing pre­dic­tions by or­a­cles, and demon­strates the re­sults of run­ning a de­luded or­a­cle (that doesn’t re­al­ize its pre­dic­tions af­fect the world), a low-band­width or­a­cle (that must choose from a small set of pos­si­ble an­swers), a high-band­width or­a­cle (that can choose from a large set of an­swers) and a coun­ter­fac­tual or­a­cle (that chooses the cor­rect an­swer, con­di­tional on us not see­ing the an­swer).

Ro­hin’s opinion: While this doesn’t men­tion AI ex­plic­itly, I think it’s use­ful to read any­way, be­cause of­ten which of the five con­cepts you use will af­fect what you think the im­por­tant risks are.

AI strat­egy and policy

AGI will dras­ti­cally in­crease economies of scale(Wei Dai): Economies of scale would nor­mally mean that com­pa­nies would keep grow­ing larger and larger. With hu­man em­ploy­ees, the co­or­di­na­tion costs grow su­per­lin­early, which ends up limit­ing the size to which a com­pany can grow. How­ever, with the ad­vent of AGI, many of these co­or­di­na­tion costs will be re­moved. If we can al­ign AGIs to par­tic­u­lar hu­mans, then a cor­po­ra­tion run by AGIs al­igned to a sin­gle hu­man would at least avoid prin­ci­pal-agent costs. As a re­sult, the economies of scale would dom­i­nate, and com­pa­nies would grow much larger, lead­ing to more cen­tral­iza­tion.

Ro­hin’s opinion: This ar­gu­ment is quite com­pel­ling to me un­der the as­sump­tion of hu­man-level AGI sys­tems that can be in­tent-al­igned. Note though that while the de­vel­op­ment of AGI sys­tems re­moves prin­ci­pal-agent prob­lems, it doesn’t re­move is­sues that arise due to differ­ent agents hav­ing differ­ent (non-value-re­lated) in­for­ma­tion.

The ar­gu­ment prob­a­bly doesn’t hold with CAIS (AN #40), where each AI ser­vice is op­ti­mized for a par­tic­u­lar task, since there would be prin­ci­pal-agent prob­lems be­tween ser­vices.

It seems like the ar­gu­ment should mainly make us more wor­ried about sta­ble au­thor­i­tar­ian regimes: the main effect based on this ar­gu­ment is a cen­tral­iza­tion of power in the hands of the AGI’s over­seers. This is less likely to hap­pen with com­pa­nies, be­cause we have in­sti­tu­tions that pre­vent com­pa­nies from gain­ing too much power, though per­haps com­pe­ti­tion be­tween coun­tries could weaken such in­sti­tu­tions. It could hap­pen with gov­ern­ment, but if long-term gov­ern­men­tal power still rests with the peo­ple via democ­racy, that seems okay. So the risky situ­a­tion seems to be when the gov­ern­ment gains power, and the peo­ple no longer have effec­tive con­trol over gov­ern­ment. (This would in­clude sce­nar­ios with e.g. a gov­ern­ment that has suffi­ciently good AI-fueled pro­pa­ganda that they always win elec­tions, re­gard­less of whether their gov­ern­ing is ac­tu­ally good.)

Other progress in AI

Re­in­force­ment learning

Un­su­per­vised State Rep­re­sen­ta­tion Learn­ing in Atari(Ankesh Anand, Evan Racah, Sher­jil Ozair et al) (sum­ma­rized by Cody): This pa­per has two main con­tri­bu­tions: an ac­tual tech­nique for learn­ing rep­re­sen­ta­tions in an un­su­per­vised way, and an Atari-spe­cific in­ter­face for giv­ing ac­cess to the un­der­ly­ing con­cep­tual state of the game (e.g. the lo­ca­tions of agents, lo­ca­tions of small ob­jects, cur­rent re­main­ing lives, etc) by pars­ing out the RAM as­so­ci­ated with each state. Since the no­tional goal of un­su­per­vised rep­re­sen­ta­tion learn­ing is of­ten to find rep­re­sen­ta­tions that can cap­ture con­cep­tu­ally im­por­tant fea­tures of the state with­out hav­ing di­rect ac­cess to them, this su­per­vi­sion sys­tem al­lows for more mean­ingful eval­u­a­tion of ex­ist­ing meth­ods by ask­ing how well con­cep­tual fea­tures can be pre­dicted by learned rep­re­sen­ta­tion vec­tors. The ob­ject-level method of the pa­per cen­ters around learn­ing rep­re­sen­ta­tions that cap­ture in­for­ma­tion about tem­po­ral state dy­nam­ics, which they do by max­i­miz­ing mu­tual in­for­ma­tion be­tween rep­re­sen­ta­tions at ad­ja­cent timesteps. More speci­fi­cally, they have both a lo­cal ver­sion of this, where a given 1/​16th patch of the image has a rep­re­sen­ta­tion that is op­ti­mized to be pre­dic­tive of that same patches next-timestep rep­re­sen­ta­tion, and a lo­cal-global ver­sion, where the global rep­re­sen­ta­tion is op­ti­mized to be pre­dic­tive of rep­re­sen­ta­tions of each patch. They ar­gue this patch-level pre­dic­tion makes their method bet­ter at learn­ing con­cepts at­tached to small ob­jects, and the em­piri­cal re­sults do seem to sup­port this in­ter­pre­ta­tion.

Cody’s opinion: The spe­cific method is an in­ter­est­ing mod­ifi­ca­tion of pre­vi­ous Con­trastive Pre­dic­tive Cod­ing work, but what I found most im­pres­sive about this pa­per was the en­g­ineer­ing work in­volved in pul­ling meta­data su­per­vi­sion sig­nals out of the game by read­ing com­ments on dis­assem­bled source code to see ex­actly how meta­data was be­ing stored in RAM. This seems to have the po­ten­tial of be­ing a use­ful bench­mark for Atari rep­re­sen­ta­tion learn­ing go­ing for­ward (though ad­mit­tedly Atari games are fairly con­cep­tu­ally straight­for­ward to be­gin with).

I have the op­po­site in­tu­ition re­gard­ing economies of scale and CAIS: I feel like it would hold, just to a lesser de­gree than to a uni­tary agent. The core of my in­tu­ition is that with differ­ent op­ti­mized AIs, it will be straight­for­ward to de­ter­mine ex­actly what the prin­ci­pal-agent prob­lem con­sists of, and this can be com­pen­sated for. I would go as far as to say that such a func­tion seems like a high-like­li­hood tar­get for mon­i­tor­ing AIs within CAIS, in broadly the same way we can do re­source op­ti­miza­tion now.

I sus­pect the limits of both types are prob­a­bly some­where north of the cur­rent size of the planet’s econ­omy, though.

The core of my in­tu­ition is that with differ­ent op­ti­mized AIs, it will be straight­for­ward to de­ter­mine ex­actly what the prin­ci­pal-agent prob­lem con­sists of, and this can be com­pen­sated for.

I feel like it is not too hard to de­ter­mine prin­ci­pal-agent prob­lems with hu­mans ei­ther? It’s just hard to ad­e­quately com­pen­sate for them.

I part­way agree with this: it is much harder to com­pen­sate with peo­ple than to de­ter­mine what the prob­lem is.

The rea­son I still see de­ter­min­ing the prin­ci­pal-agent prob­lem as a hard prob­lem with peo­ple is that we are highly in­con­sis­tent: a sin­gle AI is more con­sis­tent then a sin­gle per­son, and much more con­sis­tent than sev­eral peo­ple in suc­ces­sion (as is the case with any nor­mal job).

My model for this is that de­ter­min­ing what the prob­lem is costs only slightly more for a per­son than for the AI, but you will have to re­peat the pro­cess many times for a hu­man po­si­tion, prob­a­bly about once per per­son to fill it.