Introduction

The AI al­ign­ment prob­lem has similar­i­ties with the prin­ci­pal-agent prob­lem stud­ied by economists. In both cases, the prob­lem is: how do we get agents to try to do what we want them to do? Economists have de­vel­oped a so­phis­ti­cated un­der­stand­ing of the agency prob­lem and a mea­sure of the cost of failure for the prin­ci­pal, “agency rents”.

If prin­ci­pal-agent mod­els cap­ture rele­vant as­pects of AI risk sce­nar­ios, they can be used to as­sess their plau­si­bil­ity. Robin Han­son has ar­gued that Paul Chris­ti­ano’s AI risk sce­nario is es­sen­tially an agency prob­lem, and there­fore that it im­plies ex­tremely high agency rents. Han­son be­lieves that the prin­ci­pal-agent liter­a­ture (PAL) pro­vides strong ev­i­dence against rents be­ing this high.

In this post, we con­sider whether PAL pro­vides ev­i­dence against Chris­ti­ano’s sce­nario and the origi­nal Bostrom/​Yud­kowsky sce­nario. We also ex­am­ine whether the ex­ten­sions to the agency frame­work could be used to gain in­sight into AI risk, and con­sider some gen­eral difficul­ties in ap­ply­ing PAL to AI risk.

Summary

PAL isn’t in ten­sion with Chris­ti­ano’s sce­nario be­cause his sce­nario doesn’t im­ply mas­sive agency rents; the big losses oc­cur out­side of the prin­ci­pal-agent prob­lem, and the agency liter­a­ture can’t as­sess the plau­si­bil­ity of these losses. Ex­ten­sions to PAL could po­ten­tially shed light on the size of agency rents in this sce­nario, which are an im­por­tant de­ter­mi­nant of the fu­ture in­fluen­tial­ness of AI sys­tems.

Mapped onto a PAL model, the Bostrom/​Yud­kowsky sce­nario is largely about the prin­ci­pal’s un­aware­ness of the agent’s catas­trophic ac­tions. Unaware­ness mod­els are rare in PAL prob­a­bly be­cause they usu­ally aren’t very in­sight­ful. This lack of in­sight­ful­ness also seems to pre­vent ex­ist­ing PAL mod­els or pos­si­ble ex­ten­sions from teach­ing us much about this sce­nario.

There are also a num­ber of more gen­eral difficul­ties with us­ing PAL to as­sess AI risk, some more prob­le­matic than oth­ers.

PAL mod­els typ­i­cally as­sume AIs work for hu­mans be­cause they are paid

Over­all, find­ings from PAL do not straight­for­wardly trans­fer to the AI risk sce­nar­ios con­sid­ered, so don’t provide much ev­i­dence for or against these sce­nar­ios. But new agency mod­els could teach us about the lev­els of agency rents which AI agents could ex­tract.

PAL and Chris­ti­ano’s AI risk scenarios

Part I: ma­chine learn­ing will in­crease our abil­ity to “get what we can mea­sure,” which could cause a slow-rol­ling catas­tro­phe. (“Go­ing out with a whim­per.”)

Part II: ML train­ing, like com­pet­i­tive economies or nat­u­ral ecosys­tems, can give rise to “greedy” pat­terns that try to ex­pand their own in­fluence. Such pat­terns can ul­ti­mately dom­i­nate the be­hav­ior of a sys­tem and cause sud­den break­downs. (“Go­ing out with a bang,” an in­stance of op­ti­miza­tion dae­mons.)

Han­son ar­gued that “Chris­ti­ano in­stead fears that as AIs get more ca­pa­ble, the AIs will gain so much more agency rents, and we will suffer so much more due to agency failures, that we will ac­tu­ally be­come worse off as as re­sult. And not just a bit worse off; we ap­par­ently get apoc­a­lypse level worse off!”

PAL isn’t in ten­sion with Chris­ti­ano’s story and isn’t es­pe­cially informative

On my view the prob­lem is just that agency rents make AI sys­tems col­lec­tively bet­ter off. Hu­mans were pre­vi­ously the sole su­per­power and so as a class we are made worse off when we in­tro­duce a com­peti­tor, via the pos­si­bil­ity of even­tual con­flict with AI who have been greatly en­riched via agency rents…hu­mans are bet­ter off in ab­solute terms un­less con­flict leaves them worse off (whether mil­i­tary con­flict or a race for scarce re­sources). Com­pare: a ris­ing China makes Amer­i­cans bet­ter off in ab­solute terms. Also true, un­less we con­sider the pos­si­bil­ity of con­flict....[with­out con­flict] hu­mans are only worse off rel­a­tive to AI (or to hu­mans who are able to lev­er­age AI effec­tively). The availa­bil­ity of AI still prob­a­bly in­creases hu­mans’ ab­solute wealth. This is a prob­lem for hu­mans be­cause we care about our frac­tion of in­fluence over the fu­ture, not just our ab­solute level of wealth over the short term.

Chris­ti­ano’s con­cern isn’t that agency rents will sky­rocket be­cause of some dis­tinc­tive fea­tures of the hu­man-AI agency re­la­tion­ship. In­stead, “prox­ies” and “in­fluence seek­ing” are two spe­cific ways AI in­ter­ests will di­verge from ac­tual hu­man goals. This leads to typ­i­cal lev­els of agency rents; PAL con­firms that due to di­verg­ing in­ter­ests and im­perfect mon­i­tor­ing, AI agents could get some rents.[1]

The main loss oc­curs later in time and out­side of the prin­ci­pal-agent con­text, due to the fact that these rents even­tu­ally lead AIs to wield more to­tal in­fluence on the fu­ture than hu­mans.[2] This is bad be­cause, even if hu­man­ity is richer over­all, we hu­mans also “care about our frac­tion of in­fluence over the fu­ture.”[3] Com­pared to a world with al­igned AI sys­tems, hu­man­ity is leav­ing value on the table, per­ma­nently if these sys­tems can’t be rooted out. The biggest po­ten­tial down­side comes from in­fluence-seek­ing sys­tems which Chris­ti­ano be­lieves could make hu­mans worse off ab­solutely, by en­gag­ing in vi­o­lent con­flict.

Th­ese later failures aren’t ex­am­ples of mas­sive agency rents (as the term is used in PAL) be­cause failure is not ex­pected to oc­cur when the agent works on the task it was del­e­gated.[4] Rather, the in­fluence-seek­ing sys­tems be­come more in­fluen­tial via typ­i­cal agency rents, and then at some later point use these rents to in­fluence the fu­ture, pos­si­bly by en­ter­ing into con­flict with hu­mans. PAL stud­ies the size of agency rents which can be ex­tracted, but not what the agents de­cide to do with this wealth and in­fluence.

Over­all, PAL is con­sis­tent with AI agents ex­tract­ing some agency rents, which oc­curs in both parts of Chris­ti­ano’s story (and we’ll see next that putting more struc­ture on agency mod­els could tell us more about the level of rent ex­trac­tion). But it has noth­ing to say about the plau­si­bil­ity of AI agents us­ing their rents to ex­ert in­fluence over the long term fu­ture (parts 1 and 2) or en­gage in con­flict (part 2).[5]

Chris­ti­ano’s sce­nario doesn’t rely on some­thing dis­tinc­tive about the hu­man-AI agency re­la­tion­ship gen­er­at­ing higher-than-usual agency rents.[6] But per­haps there is some­thing dis­tinc­tive and rents will be atyp­i­cal. In any case, the level of agency rents seems like a cru­cial con­sid­er­a­tion: if we think AI’s can ex­tract lit­tle to no rents, we prob­a­bly shouldn’t ex­pect them to ex­ert much in­fluence over the fu­ture, be­cause agency rents are what make AI rich.[7] Agency mod­els could help give us a bet­ter un­der­stand­ing of the size of agency rents in Chris­ti­ano’s story, and for fu­ture AI sys­tems more gen­er­ally.

The size of agency rents are de­ter­mined by a num­ber of fac­tors, in­clud­ing the agent’s pri­vate in­for­ma­tion, the na­ture of the task, the noise in the prin­ci­pal’s es­ti­mate of the value pro­duced by the agent, and the de­gree of com­pe­ti­tion. For in­stance, more com­plex tasks tend to cause higher rents. From The (ir)re­sistible rise of agency rents:

In the pres­ence of moral haz­ard, prin­ci­pals must leave rents to agents, to in­cen­tivize ap­pro­pri­ate ac­tions. The more com­plex and opaque the task del­e­gated to the agent, the more difficult it is to mon­i­tor his ac­tions, the larger his rents.

If, as AI agents be­come more in­tel­li­gent, mon­i­tor­ing gets in­creas­ingly difficult, or tasks get more com­plex, then we would ex­pect agency rents to in­crease.

On the other hand, com­pet­i­tive pres­sures be­tween AI agents might be greater (it’s easy to copy and run an AI; it’s hard to in­crease the hu­man work­force by trans­fer­ring hu­man cap­i­tal from one brain to an­other via teach­ing). This would limit rents:

The agents de­sire to cap­ture rents, how­ever, could be kept in check by mar­ket forces and com­pe­ti­tion among [agents]. If each prin­ci­pal could run an auc­tion with sev­eral, oth­er­wise iden­ti­cal, [agents], he could se­lect the agent with the small­est in­cen­tive prob­lem, and hence the small­est rent.

Model­ling the most rele­vant fac­tors in an agency model seems like a tractable re­search ques­tion (we dis­cuss some po­ten­tial difficul­ties be­low). Economists have only just started think­ing about AI, and there doesn’t seem to be any work study­ing rent ex­trac­tion by AI agents.

PAL and AI risk from “ac­ci­dents”

Ben Garfinkel has called the class of risks most as­so­ci­ated with Bostrom and Yud­kowsky, risks from “ac­ci­dents”. Garfinkel char­ac­ter­ises the gen­eral story in the fol­low­ing terms:

First, the au­thor imag­ines that a sin­gle AI sys­tem ex­pe­riences a mas­sive jump in ca­pa­bil­ities. Over some short pe­riod of time, a sin­gle sys­tem be­comes much more gen­eral or much more ca­pa­ble than any other sys­tem in ex­is­tence, and in fact any hu­man in ex­is­tence. Then given the sys­tem, re­searchers spec­ify a goal for it. They give it some in­put which is meant to com­mu­ni­cate what be­hav­ior it should en­gage in. The goal ends up be­ing some­thing quite sim­ple, and the sys­tem goes off and sin­gle-hand­edly pur­sues this very sim­ple goal in a way that vi­o­lates the full nu­ances of what its de­sign­ers in­tended.” Im­por­tantly, “At the limit you might worry that these safety failures could be­come so ex­treme that they could per­haps de­rail civ­i­liza­tion on the whole.

Th­ese catas­trophic ac­ci­dents con­sti­tute the main worry.

If the risk sce­nario is ad­e­quately rep­re­sented by a prin­ci­pal-agent prob­lem, agency rents ex­tracted by AI agents can be used to mea­sure the cost of mis­al­ign­ment. This time agency rents are a bet­ter mea­sure, be­cause failure is ex­pected to oc­cur when the agent works on the task it was del­e­gated.[8] The sce­nario im­plies very high agency rents, with the prin­ci­pal be­ing made much worse off be­cause he del­e­gated the task to the agent.

As Garfinkel’s nomen­cla­ture sug­gests, this story is about the de­sign­ers be­ing caught by sur­prise, not an­ti­ci­pat­ing the ac­tions the AI would take. The Wikipe­dia syn­op­sis of Su­per­in­tel­li­gence also em­pha­sizes that some­thing un­ex­pected oc­curs: “Solv­ing the con­trol prob­lem is sur­pris­ingly difficult be­cause most goals, when trans­lated into ma­chine-im­ple­mentable code, lead to un­fore­seen and un­de­sir­able con­se­quences.” In other words, the prin­ci­pal is un­aware of some spe­cific catas­troph­i­cally harm­ful ac­tions that the agent can take to achieve its goal.[9] This could be be­cause they in­cor­rectly be­lieve that the sys­tem doesn’t have cer­tain ca­pa­bil­ities, or they don’t fore­see that cer­tain ac­tions satisfy the agent’s goal, as with per­verse in­stan­ti­a­tion. Due to this, the agent takes ac­tions that greatly harm the prin­ci­pal, at great benefit to her­self.

PAL doesn’t tell us much about AI risk from accidents

Han­son’s cri­tique was aimed at Chris­ti­ano’s sce­nario, but it could equally ap­ply to this one. Is PAL at odds with this sce­nario?

As an AI agent be­comes more in­tel­li­gent, it’s ac­tion set will ex­pand, think­ing of new and some­times unan­ti­ci­pated ac­tions to achieve its goals. This may in­clude catas­trophic ac­tions that the prin­ci­pal is not aware of.[10] PAL can’t tell us what these ac­tions will be, nor if the prin­ci­pal will be aware of them.[11]

In­stead, the vast ma­jor­ity of prin­ci­pal-agent mod­els as­sume that the prin­ci­pal un­der­stands the en­vi­ron­ment perfectly, in­clud­ing perfect knowl­edge of the agent’s ac­tion set, while the premise of the ac­ci­dent sce­nario is that the prin­ci­pal is un­aware of a catas­trophic ac­tion that the agent could take. Be­cause the prin­ci­pal’s un­aware­ness is cen­tral, these mod­els as­sume, rather than show, that this source of AI risk does not ex­ist. They there­fore don’t tell us much about the plau­si­bil­ity of AI ac­ci­dents.

Microe­conomist Daniel Gar­rett ex­pressed this point nicely. We asked him about a hy­po­thet­i­cal ex­am­ple, slightly mis­re­mem­bered from Stu­art Rus­sell’s book, con­cern­ing an ad­vanced cli­mate con­trol AI sys­tem.[12] He replied:

You can eas­ily write down a model where the agent is re­warded ac­cord­ing to some out­come, and the prin­ci­pal isn’t aware the out­come can be achieved by some ac­tion the prin­ci­pal finds harm­ful. In your ex­am­ple, the out­come is the re­duc­tion of Co2 emis­sions. If the prin­ci­pal thinks car­bon se­ques­tra­tion is the only way to achieve this, but doesn’t think of an­other chem­i­cal re­ac­tion op­tion which would in­di­rectly kill ev­ery­one, she could end up pro­vid­ing in­cen­tives to kill ev­ery­one. The fact this con­clu­sion is so im­me­di­ate may ex­plain why this kind of un­aware­ness by the prin­ci­pal is given lit­tle at­ten­tion in the liter­a­ture. The prin­ci­pal-agent liter­a­ture should not be un­der­stood as say­ing that these kinds of in­cen­tives with per­verse out­comes can­not hap­pen. (our em­pha­sis)

PAL mod­els do typ­i­cally have mod­est agency rents; they typ­i­cally don’t model the prin­ci­pal as be­ing un­aware of ac­tions with catas­trophic con­se­quences. But this is the situ­a­tion dis­cussed by pro­po­nents of AI ac­ci­dent risk, so we can’t in­fer much from PAL ex­cept that such a situ­a­tion has not been of much in­ter­est to economists.

Most PAL mod­els don’t in­clude the kind of un­aware­ness needed to model the ac­ci­dent sce­nario, but ex­ten­sions of this sort are cer­tainly pos­si­ble. How­ever, we sus­pect try­ing to model AI risk in this way wouldn’t be fruit­ful, for three main rea­sons.

Firstly, as Daniel Gar­rett sug­gests, we sus­pect the as­sump­tions about the prin­ci­pal’s un­aware­ness of the agents ac­tion set would im­ply the ac­tion cho­sen by the agent, and its con­se­quences for the prin­ci­pal, in a fairly di­rect and un­in­ter­est­ing way. There is a (very) small sub-liter­a­ture on un­aware­ness in agency prob­lems where one can find mod­els like this. In one pa­per, a prin­ci­pal hires an agent to do a work task, but isn’t aware that the agent can ma­nipu­late “short-run work­ing perfor­mance at the ex­pense of the em­ployer’s fu­ture benefit.” The agent “is bet­ter off if he is ad­di­tion­ally aware that he could ma­nipu­late the work­ing perfor­mance,” and “in the post-con­trac­tual stage, [the prin­ci­pal] is hurt by the ma­nipu­lat­ing ac­tion of [the agent].” How­ever, the model didn’t re­veal any­thing un­ex­pected about the situ­a­tion, and the out­come was di­rectly de­ter­mined by the ac­tion set and un­aware­ness as­sump­tions.

Se­condly, the ma­jor source of the un­cer­tainty sur­round­ing ac­ci­dent risk con­cerns whether the prin­ci­pal will be un­aware of catas­trophic agent ac­tions. The agency liter­a­ture can’t help us re­duce this un­cer­tainty as the un­aware­ness is built into mod­els’ as­sump­tions. For in­stance, AI sci­en­tist Yann LeCun thinks that harm­ful ac­tions “are eas­ily avoid­able by sim­ple terms in the ob­jec­tive”. If LeCun im­ple­mented a su­per­in­tel­li­gent AI in this way, agency mod­els couldn’t tell us whether he had cor­rectly cov­ered all bases.

Lastly, the as­sump­tions about the agent’s ac­tion set would be highly spec­u­la­tive. We don’t know what ac­tions su­per­in­tel­li­gent sys­tems might take to pur­sue their goals. Agency mod­els must make as­sump­tions about these ac­tions, and we don’t know what these as­sump­tions should be.

In short, the un­cer­tainty per­tains to the as­sump­tions of the model, not the way the as­sump­tions trans­late into out­comes. PAL does not, and prob­a­bly can not, provide much ev­i­dence for or against this sce­nario.

Gen­eral difficul­ties with us­ing PAL to as­sess AI risk

We’ve dis­cussed the most rele­vant con­sid­er­a­tions re­gard­ing what PAL can tell us about two spe­cific vi­sions of AI risk. We now dis­cuss some difficul­ties rele­vant to a broader set of pos­si­ble sce­nar­ios (in­clud­ing those just ex­am­ined). We list the difficul­ties from most se­ri­ous to least se­ri­ous.

AI risk sce­nar­ios typ­i­cally in­volve the AI be­ing more in­tel­li­gent than hu­mans. The type of prob­lems that economists study usu­ally don’t have this fea­ture, and there seem to be very few mod­els where the prin­ci­pal is weaker than the agent. De­spite ex­ten­sive search­ing, in­clud­ing talk­ing to mul­ti­ple con­tract the­o­rists, we were only able to find twopa­pers with a prin­ci­pal who is more bound­edly ra­tio­nal than the agent.[14] This is per­haps not so sur­pris­ing given that bounded-ra­tio­nal­ity mod­els are rel­a­tively rare, and when they do ex­ist, they tend to bound both the prin­ci­pal and the agent in the same way, or have the prin­ci­pal more ca­pa­ble. The lat­ter is be­cause such a set up is more rele­vant to typ­i­cal eco­nomic prob­lems, e.g. “ex­ploita­tive” con­tract­ing stud­ies the mis­takes made by an in­di­vi­d­ual (the agent) when in­ter­act­ing with a more ca­pa­ble firm (the prin­ci­pal).

Most eco­nomic ques­tions re­lated to bounded ra­tio­nal­ity ex­plored in the prin­ci­pal-agent liter­a­ture are ap­pro­pri­ately mod­el­led by a bounded agent. It’s cer­tainly pos­si­ble to bound the prin­ci­pal, but by and large this hasn’t been done, just be­cause of the na­ture of the ques­tions that have been asked.

In al­most all ap­pli­ca­tions, re­searchers as­sume that the agent (she) be­haves ac­cord­ing to one psy­cholog­i­cally based model, while the prin­ci­pal (he) is fully ra­tio­nal and has a clas­si­cal goal (usu­ally profit max­i­miza­tion).

There doesn’t seem to be, in Han­son’s terms, a “large (mostly eco­nomic) liter­a­ture on agency failures” with an in­tel­li­gence gap rele­vant to AI risk.

PAL mod­els are brittle

PAL mod­els don’t model agency prob­lems in gen­eral. They con­sider very spe­cific agency re­la­tion­ships, stud­ied in highly struc­tured en­vi­ron­ments. Con­clu­sions can de­pend very sen­si­tively on the as­sump­tions used; find­ings from one model don’t nec­es­sar­ily gen­er­al­ise to new situ­a­tions. From the text­book Con­tract The­ory:

The ba­sic moral haz­ard prob­lem has a fairly sim­ple struc­ture, yet gen­eral con­clu­sions have been difficult to ob­tain...Very few gen­eral re­sults can be ob­tained about the form of op­ti­mal con­tracts. How­ever, this limi­ta­tion has not pre­vented ap­pli­ca­tions that use this paradigm from flour­ish­ing...Typ­i­cally, ap­pli­ca­tions have put more struc­ture on the moral haz­ard prob­lem un­der con­sid­er­a­tion, thus en­abling a sharper char­ac­ter­i­za­tion of the op­ti­mal in­cen­tive con­tract.” (our em­pha­sis)

Similar rea­son­ing ap­plies in ad­verse se­lec­tion mod­els where the out­come is very sen­si­tive to the map­ping be­tween effort and out­comes. Given an ar­bi­trary prob­lem, the op­ti­mal in­cen­tives can look like any­thing.

The agency prob­lems stud­ied by economists are typ­i­cally quite differ­ent to the sce­nar­ios en­visaged by AI risk pro­po­nents. There­fore, be­cause of the brit­tle­ness of PAL mod­els, we shouldn’t be too sur­prised if the imag­ined AI risk out­comes aren’t pre­sent in the ex­ist­ing liter­a­ture. PAL, in its cur­rent form, might just not be of much use. Fur­ther, we should not ex­pect there to be any generic an­swer to the ques­tion “How big are AI agency rents?”: the an­swer will de­pend on the spe­cific task the AI is do­ing and a host of other de­tails.

Agents rents are too nar­row a measure

As we’ve seen, AI risk sce­nar­ios can in­clude bad out­comes that aren’t agency rents, but that we nev­er­the­less care about. When ap­ply­ing PAL to AI risk, care must be taken to dis­t­in­guish be­tween rents and other bad out­comes, and we can­not as­sume that a bad out­come nec­es­sar­ily means high rents.

PAL mod­els typ­i­cally as­sume con­tract enforceability

Stu­art Arm­strong ar­gued that Han­son’s cri­tique doesn’t work be­cause PAL as­sumes con­tract en­force­abil­ity, and with ad­vanced AI, in­sti­tu­tions might not be up to the task.[15] In­deed, con­tract en­force­abil­ity is as­sumed in most of PAL, so it’s an im­por­tant con­sid­er­a­tion re­gard­ing their ap­pli­ca­bil­ity to AI sce­nar­ios more broadly.[16]

The as­sump­tion isn’t plau­si­ble in pes­simistic sce­nar­ios where hu­man prin­ci­pals and in­sti­tu­tions are in­suffi­ciently pow­er­ful to pun­ish the AI agent, e.g. due to very fast take-off. But it is plau­si­ble for when AIs are similarly smart to hu­mans, and in sce­nar­ios where pow­er­ful AIs are used to en­force con­tracts. Fur­ther­more, if we can­not en­force con­tracts with AIs then peo­ple will promptly re­al­ise and stop us­ing AIs; so we should ex­pect con­tracts to be en­force­able con­di­tional upon AIs be­ing used.[17]

There is a smaller sub-liter­a­ture on self-en­forc­ing con­tracts (sem­i­nal pa­per). Here con­tracts can be self-en­forced be­cause both par­ties have an in­ter­est in in­ter­act­ing re­peat­edly. We think these prob­a­bly won’t be helpful for un­der­stand­ing situ­a­tions with­out con­tract en­force­abil­ity, be­cause in wor­lds where con­tracts aren’t en­force­able be­cause of ad­vanced AI, con­tracts likely won’t be self-en­forc­ing ei­ther. If AIs are pow­er­ful enough that in­sti­tu­tions like the po­lice and mil­i­tary can’t con­strain them, it seems un­likely that they’d have much to gain from re­peated co­op­er­a­tive in­ter­ac­tions with hu­man prin­ci­pals. Why not make a copy of them­selves to do the task, co­erce hu­mans into do­ing it, or co­op­er­ate with other ad­vanced AIs?

PAL mod­els typ­i­cally as­sume AIs work for hu­mans be­cause they are paid

In re­al­ity AIs will prob­a­bly not re­ceive a wage, and in­stead work for hu­mans be­cause that is their de­fault be­havi­our. We think chang­ing this would prob­a­bly not make a big differ­ence to agency mod­els, be­cause the wage could be sub­sti­tuted for other re­sources the AI cares about. For in­stance, AI needs com­pute to run. If we sub­sti­tute “wage” for “com­pute”, the agency rents that the agent ex­tracts is ad­di­tional com­pute that it can use for its own pur­poses.

There is a sub-liter­a­ture on Op­ti­mal Del­e­ga­tion that does away with wages. This liter­a­ture fo­cuses on the best way to re­strict the agents ac­tion set. For AI agents, this is equiv­a­lent to AI box­ing. We don’t think this liter­a­ture will be helpful; PAL doesn’t study how re­al­is­tic it is to box AI suc­cess­fully, it just as­sumes it’s tech­nolog­i­cally pos­si­ble. It there­fore isn’t in­for­ma­tive about whether AI box­ing will work.

Conclusion

There are similar­i­ties be­tween the AI al­ign­ment and prin­ci­pal-agent prob­lems, sug­gest­ing that PAL could teach us about AI risk. How­ever, the situ­a­tions economists have stud­ied are very differ­ent to those dis­cussed by pro­po­nents of AI risk, mean­ing that find­ings from PAL don’t trans­fer eas­ily to this con­text. There are a few main is­sues. The prin­ci­pal-agent setup is only a part of AI risk sce­nar­ios, mak­ing agency rents too nar­row a met­ric. PAL mod­els rarely con­sider agents more in­tel­li­gent than their prin­ci­pals and the mod­els are very brit­tle. And the lack of in­sight from PAL un­aware­ness mod­els severely re­stricts their use­ful­ness for un­der­stand­ing the ac­ci­dent risk sce­nario.

Nev­er­the­less, ex­ten­sions to PAL might still be use­ful. Agency rents are what might al­low AI agents to ac­cu­mu­late wealth and in­fluence, and agency mod­els are the best way we have to learn about the size of these rents. Th­ese find­ings should in­form a wide range of fu­ture sce­nar­ios, per­haps bar­ring ex­treme ones like Bostrom/​Yud­kowsky.[18]

Agency rents are about e.g. work­ing vs shirk­ing. If the agent uses the money she earned to buy a gun and later shoot the prin­ci­pal, clearly this is very bad for her, but it’s not cap­tured by agency rents. ↩︎

It’s not to­tally clear to us why we should care about our frac­tion of in­fluence over the fu­ture, rather than the to­tal in­fluence. Prob­a­bly be­cause the frac­tion of in­fluence af­fects the to­tal in­fluence, in­fluence be­ing zero-sum and re­sources finite. ↩︎

It wasn’t clear to us from the origi­nal post, at least in Part 1 of the story with no con­flict, that hu­mans are bet­ter off in ab­solute terms. For in­stance, word­ing like “over time those prox­ies will come apart” and “Peo­ple re­ally will be get­ting richer for a while” seemed to sug­gest that things are ex­pected to worsen. Given this, Han­son’s in­ter­pre­ta­tion (that Chris­ti­ano’s story im­plied mas­sive agency rents) seems rea­son­able with­out fur­ther clar­ifi­ca­tion.
Ben Garfinkel men­tioned an out­side-view mea­sure which he thought un­der­mined the plau­si­bil­ity of Part 1: since the in­dus­trial rev­olu­tion we seem to have been us­ing more and more prox­ies, which are op­ti­mized for more and more heav­ily, but things have been get­ting bet­ter and bet­ter. So he also seems to have un­der­stood the sce­nario to mean things get worse in ab­solute terms. ↩︎

Clar­ify­ing what it means for an AI sys­tem to earn and use rents also seems im­por­tant, helping us make sure that the ab­strac­tion maps cleanly onto the prac­ti­cal sce­nar­ios we are en­visag­ing.
Re­lat­edly, what traits would an AI sys­tem need to have for it to make sense to think of the sys­tem as “ac­cu­mu­lat­ing and us­ing rents”? Rents can be cashed out in in­fluence of many differ­ent kinds — a hu­man worker might get higher wage, or more free time — and what ends up oc­cur­ing will de­pend on the ca­pa­bil­ities of the AI sys­tems. Con­cretely, money can be saved in a bank ac­count, peo­ple can be in­fluenced, or com­puter hard­ware can be bought and run. One ex­am­ple of an ob­vi­ous ca­pa­bil­ity con­straint for AI: some AI sys­tems will be “switched off” af­ter they are run, limit­ing their abil­ity to trans­fer rents through time.
As AI agents will (ini­tially) be owned by hu­mans, his­tor­i­cal in­stances of slaves earn­ing rents seem worth look­ing into. ↩︎

Although his sce­nario is more plau­si­ble if a smarter agent ex­tracts more agency rents. ↩︎

Han­son and Chris­ti­ano agree on this point.
Han­son: “Just as most wages that slaves earned above sub­sis­tence went to slave own­ers, most of the wealth gen­er­ated by AI could go to the cap­i­tal own­ers, i.e. their slave own­ers. Agency rents are the differ­ence above that min­i­mum amount.”
Chris­ti­ano: “Agency rents are what makes the AI rich. It’s not that com­put­ers would “be­come rich” if they were su­per­hu­man, and they just aren’t rich yet be­cause they aren’t smart enough. On the cur­rent tra­jec­tory com­put­ers just won’t get rich.” ↩︎

One limi­ta­tion is that rents are the cost to the prin­ci­pal, whereas the ac­ci­dent sce­nario has costs for all hu­man­ity. This dis­tinc­tion isn’t es­pe­cially im­por­tant be­cause in the ac­ci­dent sce­nario the out­come for the prin­ci­pal is catas­trophic (i.e. ex­tremely high agency rents), and this is what is po­ten­tially in ten­sion with PAL. Nonethe­less, we should keep in mind that the to­tal costs of this sce­nario are not limited to agency rents, just as in Chris­ti­ano’s sce­nario. ↩︎

Per­haps a more re­al­is­tic fram­ing: the prin­ci­pal is aware that there’s some prob­a­bil­ity that the agent will take an unan­ti­ci­pated catas­trophic ac­tion, with­out know­ing what that ac­tion might be. Un­der com­pet­i­tive pres­sures, maybe in a time of war, it could be benefi­cial for the prin­ci­pal to del­e­gate (in ex­pec­ta­tion) de­spite sig­nifi­cant risk, while hu­man­ity is made worse off (in ex­pec­ta­tion).
This, of course, would be mod­el­led quite differ­ently to the ac­ci­dent AI risk we con­sider in the text, and we sus­pect that eco­nomic mod­els would con­firm that prin­ci­pals would take the risk in suffi­ciently com­pet­i­tive sce­nar­ios. Th­ese mod­els would fo­cus on nega­tive ex­ter­nal­ities of risky AI de­vel­op­ment, some­thing more nat­u­rally stud­ied in do­mains like pub­lic eco­nomics rather than with agency the­ory.
In any case, we fo­cus here on the more tra­di­tional AI risk fram­ing along the lines of “you think you have the AI un­der con­trol, but be­ware, you could be wrong”. ↩︎

AI ac­ci­dent risk will be large when the AI agent thinks of new ac­tions that i) harm the prin­ci­pal ii) fur­ther the agent’s goals iii) the prin­ci­pal hasn’t an­ti­ci­pated. ↩︎

This is be­cause claims about the ac­tions available to the agent and the prin­ci­pal’s aware­ness are part of PAL mod­els’ as­sump­tions. We dis­cuss this more be­low. ↩︎

The cor­rect ex­am­ple: “If you pre­fer solv­ing en­vi­ron­men­tal prob­lems, you might ask the ma­chine to counter the rapid acid­ifi­ca­tion of the oceans that re­sults from higher car­bon diox­ide lev­els. The ma­chine de­vel­ops a new cat­a­lyst that fa­cil­i­tates an in­cred­ibly rapid chem­i­cal re­ac­tion be­tween ocean and at­mo­sphere and re­stores the oceans’ pH lev­els. Un­for­tu­nately, a quar­ter of the oxy­gen in the at­mo­sphere is used up in the pro­cess, leav­ing us [hu­mans] to as­phyx­i­ate slowly and painfully.”↩︎

I.e. the prin­ci­pal’s ra­tio­nal­ity is bounded to a greater ex­tent than the agent’s ↩︎

In the model in “Mo­ral Hazard With Unaware­ness” ei­ther the prin­ci­pal or the agent’s ra­tio­nal­ity can be bounded ↩︎

As ar­gued above, we don’t think con­tract en­force­abil­ity is the main rea­son Han­son’s cri­tique of Chris­ti­ano fails; agency rents are just not un­usu­ally high in his sce­nario. ↩︎

From Con­tract The­ory: “The bench­mark con­tract­ing situ­a­tion that we shall con­sider in this book is one be­tween two par­ties who op­er­ate in a mar­ket econ­omy with a well-func­tion­ing le­gal sys­tem. Un­der such a sys­tem, any con­tract the par­ties de­cide to write will be en­forced perfectly by a court, pro­vided, of course, that it does not con­tra­vene any ex­ist­ing laws.”↩︎

Robin Han­son pointed out to us that when think­ing about strange fu­ture sce­nar­ios, we should try to think about similar strange sce­nar­ios that we have seen in the past (we are very sym­pa­thetic to this, de­spite our some­what skep­ti­cal po­si­tion re­gard­ing PAL). With this in mind, an­other field which seems worth look­ing into is Se­cu­rity, es­pe­cially mil­i­tary se­cu­rity. Na­tional lead­ers have been as­sas­si­nated by their guards; kings have been kil­led by their pro­tec­tors. Th­ese seem like a closer analogue to many AI risk sce­nar­ios than the typ­i­cal PAL setup. It seems im­por­tant to un­der­stand what the ma­jor risk fac­tors are in these situ­a­tions, how peo­ple have guarded against catas­trophic failures, and how this trans­lates to cases of catas­trophic AI risk. ↩︎

PAL con­firms that due to di­verg­ing in­ter­ests and im­perfect mon­i­tor­ing, agents will get some rents.

Can you provide a source for this, or ex­plain more? I’m ask­ing be­cause your note about com­pe­ti­tion be­tween agents re­duc­ing agency rents made me think that such com­pe­ti­tion ought to elimi­nate all rents that the agent could (for ex­am­ple) gain by shirk­ing, be­cause agents will bid against each other to ac­cept lower wages un­til they have no rent left. For ex­am­ple in the model of prin­ci­ple-agent prob­lem pre­sented in this lec­ture (which has di­verg­ing in­ter­ests and im­perfect mon­i­tor­ing) there is no agency rent. (ETA: This model does not have ex­plicit com­pe­ti­tion be­tween agents, but it mod­els the prin­ci­ple as hav­ing all of the bar­gain­ing power, by let­ting it make a take-it-or-leave-it offer to the agent.)

If agents only earn rents when there isn’t enough com­pe­ti­tion, that seems more like “monopoly rent” than “agency rent”, plus it seem­ingly wouldn’t ap­ply to AIs… Can you help me de­velop a bet­ter in­tu­ition of where agency rents come from, ac­cord­ing to PAL?

Thanks for catch­ing this! You’re cor­rect that that sen­tence is in­ac­cu­rate. Our views changed while iter­at­ing the piece and that sen­tence should have been changed to: “PAL con­firms that due to di­verg­ing in­ter­ests and im­perfect mon­i­tor­ing, AI agents could get some rents.”

This sen­tence too: “Over­all, PAL tells us that agents will in­evitably ex­tract some agency rents…” would be bet­ter as “Over­all, PAL is con­sis­tent with AI agents ex­tract­ing some agency rents…”

I’ll make these ed­its, with a foot­note point­ing to your com­ment.

The main aim of that sec­tion was to point out that Paul’s sce­nario isn’t in con­flict with PAL. Without fur­ther re­search, I wouldn’t want to make strong claims about what PAL im­plies for AI agency rents be­cause the mod­els are so brit­tle and AIs will likely be very differ­ent to hu­mans; it’s an open ques­tion.

For there to be no agency rents at all, I think you’d need some­thing close to perfect com­pe­ti­tion be­tween agents. In prac­tice the nec­es­sary con­di­tions are ba­si­cally never satis­fied be­cause they are very strong, so it seems very plau­si­ble to me that AI agents ex­tract rents.

Re monopoly rents vs agency rents: Monopoly rents re­fer to the op­po­site ex­treme with very lit­tle com­pe­ti­tion, and in the eco­nomics liter­a­ture is used when talk­ing about firms, while agency rents are pre­sent when­ever com­pe­ti­tion and mon­i­tor­ing are im­perfect. Also, agency rents re­fer speci­fi­cally to the costs in­her­ent to del­e­gat­ing to an agent (e.g. an agent mak­ing in­vest­ment de­ci­sions op­ti­mis­ing for com­mis­sion over firm profit) vs the rents from monopoly power (e.g. be­ing the only firm able to use a tech­nol­ogy due to a patent). But as you say, it’s true that lack of com­pe­ti­tion is a cause of both of these.

Thanks for mak­ing the changes, but even with “PAL con­firms that due to di­verg­ing in­ter­ests and im­perfect mon­i­tor­ing, AI agents could get some rents.” I’d still like to un­der­stand why im­perfect mon­i­tor­ing could lead to rents, be­cause I don’t cur­rently know a model that clearly shows this (i.e., where the rent isn’t due to the agent hav­ing some other kind of ad­van­tage, like not hav­ing many com­peti­tors).

Also, I get that the PAL in its cur­rent form may not be di­rectly rele­vant to AI, so I’m just try­ing to un­der­stand it on its own terms for now. Pos­si­bly I should just dig into the liter­a­ture my­self...

The in­tu­ition is that if the prin­ci­pal could perfectly mon­i­tor whether the agent was work­ing or shirk­ing, they can just spec­ify a cause in the con­tract the pun­ishes them when­ever they shirk. Equiv­a­lently, if the prin­ci­pal knows the agent’s cost of pro­duc­tion (or abil­ity level), they can ex­tract all the sur­plus value with­out leav­ing any rent.

I hadn’t no­ticed I should be con­fused about the agency rent vs monopoly rent dis­tinc­tion till I saw Wei Dai’s com­ment, but now I re­al­ise I’m con­fused. And the replies don’t seem to clear it up for me. Tom wrote:

Re the differ­ence be­tween Monopoly rents and agency rents: monopoly rents would be elimi­nated by com­pe­ti­tion be­tween firms whereas agency rents would be elimi­nated by com­pe­ti­tion be­tween work­ers. So they’re differ­ent in that sense.

That’s definitely one way in which they’re differ­ent. Is that the only way? Are they ba­si­cally the same con­cept, and it’s just that you use one la­bel (agency rents) when fo­cus­ing on rents the worker can ex­tract due to lack of com­pe­ti­tion be­tween work­ers, and the other (monopoly rents) when fo­cus­ing on rents the firms can ex­tract due to lack of com­pe­ti­tion be­tween firms? But ev­ery­thing is the same on an ab­stract/​struc­tural level?

Could we go a lit­tle fur­ther, and in fact de­scribe the firm as an agent, with con­sumers as its prin­ci­pal? The agent (the firm) can ex­tract agency rents to the ex­tent that (a) its ac­tivi­ties at least some­what al­ign with those of the prin­ci­pal (e.g., it pro­duces a product that the pub­lic prefers to noth­ing, and that they’re will­ing to pay some­thing for), and (b) there’s limited com­pe­ti­tion (e.g., due to a patent). I.e., are both types rents due to one ac­tor (a) op­ti­mis­ing for some­thing other than what the other ac­tors wants, and (b) be­ing able to get away with it?

That seems con­sis­tent with (but not stated in) most of the fol­low­ing quote from you:

Re monopoly rents vs agency rents: Monopoly rents re­fer to the op­po­site ex­treme with very lit­tle com­pe­ti­tion, and in the eco­nomics liter­a­ture is used when talk­ing about firms, while agency rents are pre­sent when­ever com­pe­ti­tion and mon­i­tor­ing are im­perfect. Also, agency rents re­fer speci­fi­cally to the costs in­her­ent to del­e­gat­ing to an agent (e.g. an agent mak­ing in­vest­ment de­ci­sions op­ti­mis­ing for com­mis­sion over firm profit) vs the rents from monopoly power (e.g. be­ing the only firm able to use a tech­nol­ogy due to a patent). But as you say, it’s true that lack of com­pe­ti­tion is a cause of both of these.

What my pro­posed fram­ing seems to not ac­count for is that dis­cus­sion of agency rents in­volves men­tion of im­perfect mon­i­tor­ing as well as im­perfect com­pe­ti­tion. But I think I share Wi Dai’s con­fu­sion there. If the prin­ci­pal had no other choice (i.e., there’s no com­pe­ti­tion), then even with perfect mon­i­tor­ing, wouldn’t there still be agency rents, as long as the agent is op­ti­mis­ing for some­thing at least some­what cor­re­lated with the prin­ci­pal’s in­ter­ests? Is it just that im­perfect mon­i­tor­ing in­creases how much the agent can “get away with”, at any given level of cor­re­la­tion be­tween its ac­tivi­ties and the prin­ci­pal’s in­ter­ests?

And could we say a similar thing for monopoly rents—e.g., a mo­nop­o­lis­tic firm, or one with lit­tle com­pe­ti­tion, may be able to ex­tract some­what more rents if it’s es­pe­cially hard to tell how valuable its product is in ad­vance?

Note that I don’t have a wealth of econ knowl­edge and didn’t take the op­tion of do­ing a bunch of googling to try to figure this out for my­self. No one is obliged to pla­cate my lethargy with a re­sponse :)

Re the differ­ence be­tween Monopoly rents and agency rents: monopoly rents would be elimi­nated by com­pe­ti­tion be­tween firms whereas agency rents would be elimi­nated by com­pe­ti­tion be­tween work­ers. So they’re differ­ent in that sense.

I think that more en­gage­ment in this area is use­ful, and mostly agree. I’ll point out that I think much of the is­sue with pow­er­ful agents and missed con­se­quences is more use­fully cap­tured by work on Good­hart’s law, which is definitely my pet idea, but seems rele­vant. I’ll self pro­mote shame­lessly here.

Cu­rated. This post rep­re­sents a sig­nifi­cant amount of re­search, look­ing into the ques­tion of whether an es­tab­lished area of liter­a­ture might be in­for­ma­tive to con­cerns about AI al­ign­ment. It looks at that liter­a­ture, ex­am­ines its rele­vance in light of the ques­tions that have been dis­cussed so far, and checks the con­clu­sions with ex­ist­ing do­main ex­perts. Fi­nally, it sug­gests fur­ther work that might provide use­ful in­sights to these kinds of ques­tions.

I do have the con­cern that cur­rently, the post re­lies a fair bit on the reader trust­ing the au­thors to have done a com­pre­hen­sive search—the post men­tions hav­ing done “ex­ten­sive search­ing”, but be­sides the men­tion of con­sult­ing do­main ex­perts, does not elab­o­rate on how that search pro­cess was car­ried out. This is a sig­nifi­cant con­sid­er­a­tion since a large part of the post’s con­clu­sions rely on nega­tive re­sults (there not be­ing pa­pers which ex­am­ine the rele­vant as­sump­tions). I would have ap­pre­ci­ated see­ing some kind of a de­scrip­tion of the search strat­egy, similar in spirit to the search de­scrip­tions in­cluded in sys­tem­atic re­views. This would have al­lowed read­ers to both re­pro­duce the search steps, as well as no­tice any pos­si­ble short­com­ings that might have led to rele­vant liter­a­ture be­ing missed.

Nonethe­less, this is an im­por­tant con­tri­bu­tion, and I’m very happy both to see this kind of work done, as well as it be­ing writ­ten up in a clear form on LW.

I wouldn’t char­ac­ter­ise the con­clu­sion as “nope, doesn’t pan out”. Maybe more like: we can’t in­fer too much from ex­ist­ing PAL, but AI agency rents are an im­por­tant con­sid­er­a­tion, and for a wide range of fu­ture sce­nar­ios new agency mod­els could tell us about the de­gree of rent ex­trac­tion.

Nev­er­the­less, ex­ten­sions to PAL might still be use­ful. Agency rents are what might al­low AI agents to ac­cu­mu­late wealth and in­fluence, and agency mod­els are the best way we have to learn about the size of these rents. Th­ese find­ings should in­form a wide range of fu­ture sce­nar­ios, per­haps bar­ring ex­treme ones like Bostrom/​Yud­kowsky.

For my­self, this is the most ex­cit­ing thing in this post—the pos­si­bil­ity of tak­ing the prin­ci­pal-agent model and us­ing it to rea­son about AI even if most of the ex­ist­ing prin­ci­pal-agent liter­a­ture doesn’t provide re­sults that ap­ply. I see lit­tle here to make me think the prin­ci­pal-agent model wouldn’t be use­ful, only that it hasn’t been used in ways that are use­ful to AI risk sce­nar­ios yet. It seems worth­while, for ex­am­ple, to pur­sue re­search on the prin­ci­pal-agent prob­lem with some of the ad­just­ments to make it bet­ter ap­ply to AI sce­nar­ios, such as let­ting the agent be more pow­er­ful than the prin­ci­pal and ad­just­ing the rent mea­sure to bet­ter work with AI.

Maybe this ap­proach won’t yield any­thing (as we should ex­pect on pri­ors, sim­ply be­cause most ap­proaches to AI safety are likely not go­ing to work), but it seems worth ex­plor­ing fur­ther on the chance it can de­liver valuable in­sights, even if, as you say, the ex­ist­ing liter­a­ture doesn’t offer much that is di­rectly use­ful to AI risk now.

I agree that this seems like a promis­ing re­search di­rec­tion! I think this would be done best while also think­ing about con­crete traits of AI sys­tems, as dis­cussed in this foot­note. One po­ten­tial benefi­cial out­come would be to un­der­stand which kind of sys­tems earn rents and which don’t; I wouldn’t be sur­prised if the dis­tinc­tion be­tween rent earn­ing agents vs oth­ers mapped pretty cleanly onto a Bostro­mian util­ity max­imiser vs CAIS dis­tinc­tion, but maybe it won’t.

In any case, the al­ter­na­tive per­spec­tive offered by the agency rents fram­ing com­pared to typ­i­cal AI al­ign­ment dis­cus­sion could help gen­er­ate in­ter­est­ing new in­sights.

The agency liter­a­ture is there to model real agency re­la­tions in the world. Those real re­la­tions no doubt con­tain plenty of “un­aware­ness”. If mod­els with­out un­aware­ness were failing to cap­ture and ex­plain a big frac­tion of real agency prob­lems, there would be plenty of scope for peo­ple to try to fill that gap via mod­els that in­clude it. The claim that this couldn’t work be­cause such mod­els are limited seems just ar­bi­trary and wrong to me. So ei­ther one must claim that AI-re­lated un­aware­ness is of a very differ­ent type or scale from or­di­nary hu­man cases in our world to­day, or one must im­plic­itly claim that un­aware­ness mod­el­ing would in fact be a con­tri­bu­tion to the agency liter­a­ture. It seems to me a mild bur­den of proof sits on ad­vo­cates for this lat­ter case to in fact cre­ate such con­tri­bu­tions.

The claim that this couldn’t work be­cause such mod­els are limited seems just ar­bi­trary and wrong to me.

The economists I spoke to seemed to think that in agency un­aware­ness mod­els con­clu­sions fol­low pretty im­me­di­ately from the as­sump­tions and so don’t teach you much. It’s not that they can’t model real agency prob­lems, just that you don’t learn much from the model. Per­haps if we’d spo­ken to more economists there would have been more dis­agree­ment on this point.

So ei­ther one must claim that AI-re­lated un­aware­ness is of a very differ­ent type or scale from or­di­nary hu­man cases in our world to­day, or one must im­plic­itly claim that un­aware­ness mod­el­ing would in fact be a con­tri­bu­tion to the agency liter­a­ture.

I agree that the Bostrom/​Yud­kowsky sce­nario im­plies AI-re­lated un­aware­ness is of a very differ­ent scale from or­di­nary hu­man cases. From an out­side view per­spec­tive, this is a strike against the sce­nario. How­ever, this de­vi­a­tion from past trends does fol­low fairly nat­u­rally (though not nec­es­sar­ily) from the hy­poth­e­sis of a sud­den and mas­sive in­tel­li­gence gap

Stu­art Arm­strong ar­gued that Han­son’s cri­tique doesn’t work be­cause PAL as­sumes con­tract en­force­abil­ity, and with ad­vanced AI, in­sti­tu­tions might not be up to the task. In­deed, con­tract en­force­abil­ity is as­sumed in most of PAL, so it’s an im­por­tant con­sid­er­a­tion re­gard­ing their ap­pli­ca­bil­ity to AI sce­nar­ios more broadly.

This seems kind of off to me. When I think about us­ing the anal­y­sis of con­tracts be­tween hu­mans and AIs, I’m not imag­in­ing le­gal con­tracts: I’m us­ing it as a metaphor for the hu­man get­ting to di­rectly set what ‘the AI’* is mo­ti­vated to do. As such, the con­tract re­ally is strictly en­forced, be­cause the ‘con­tract’ is the mo­ti­va­tional sys­tem of ‘the AI’ which there’s rea­son think ‘the AI’ is mo­ti­vated to pre­serve and ca­pa­ble of pre­serv­ing, a la Omo­hun­dro’s ba­sic AI drives.

Now, I think there are two is­sues with this:

Agents some­times are in­cen­tivised to change their prefer­ences as a re­sult of bar­gain­ing. Think: “In or­der for us to work to­gether on this pro­ject, which will de­liver you boun­tiful re­wards, I need you to stop be­ing mo­ti­vated to steal my lightly-guarded be­long­ings, be­cause I’m just not good enough at se­cu­rity to dis­in­cen­tivise you from steal­ing them my­self.”

More gen­er­ally, we can think of the ‘con­tract’ as the pro­gram that con­sti­tutes ‘the AI’, which might be so com­pli­cated that hu­mans don’t un­der­stand it, and that might in­clude a plan­ning rou­tine. In this case, ‘the AI’ might be mo­ti­vated to mod­ify the ‘con­tract’ to make ‘it­self’ smarter.

But at any rate, I think the con­tract en­force­abil­ity prob­lem isn’t a knock-down against the PAL be­ing rele­vant.

[*] scare quotes to be a lit­tle more ac­cu­rate and to pla­cate my simu­lated Eric Drexler

I think it’s worth dis­t­in­guish­ing be­tween a le­gal con­tract and set­ting the AI’s mo­ti­va­tional sys­tem, even though the lat­ter is a con­tract in some sense. My read­ing of Stu­art’s post was that it was in­tended liter­ally, not as a metaphor. Re­gard­less, both are rele­vant; in PAL, you’d model mo­ti­va­tional sys­tem via the agents util­ity func­tion, and the con­tract en­force­abil­ity via the back­ground as­sump­tion.

But I agree that con­tract en­force­abil­ity isn’t a knock-down, and in­deed won’t be an is­sue by de­fault. I think we should have framed this more clearly in the post. Here’s the most im­por­tant part of what we said:

But it is plau­si­ble for when AIs are similarly smart to hu­mans, and in sce­nar­ios where pow­er­ful AIs are used to en­force con­tracts. Fur­ther­more, if we can­not en­force con­tracts with AIs then peo­ple will promptly re­al­ise and stop us­ing AIs; so we should ex­pect con­tracts to be en­force­able con­di­tional upon AIs be­ing used.

I think it’s worth dis­t­in­guish­ing be­tween a le­gal con­tract and set­ting the AI’s mo­ti­va­tional sys­tem, even though the lat­ter is a con­tract in some sense.

To restate/​clar­ify my above com­ment, I agree, but think that we are likely to del­e­gate tasks to AIs by set­ting their mo­ti­va­tional sys­tem and not by draft­ing literal le­gal con­tracts with them. So the PAL is rele­vant to the ex­tent that it works as a metaphor for set­ting an AIs mo­ti­va­tional sys­tem and source code, and in this con­text con­tract en­force­abil­ity isn’t an is­sue, and Stu­art is mak­ing a mis­take to be think­ing about literal le­gal con­tracts (as­sum­ing that he is do­ing so).

Well be­cause I think they wouldn’t be en­force­able in the re­ally bad cases the con­tracts would be try­ing to pre­vent :) And also by de­fault peo­ple cur­rently del­e­gate tasks to com­put­ers by writ­ing soft­ware, which I ex­pect to con­tinue in fu­ture (al­though I guess smart con­tracts are an in­ter­est­ing edge case here).

There are THOUSANDS of cri­tiques out there of the form “Eco­nomic the­ory can’t be trusted be­cause eco­nomic the­ory analy­ses make as­sump­tions that can’t be proven and are of­ten wrong, and con­clu­sions are of­ten sen­si­tive to as­sump­tions.” Really, this is a very stan­dard and generic cri­tique, and of course it is quite wrong, as such a cri­tique can be equally made against any area of the­ory what­so­ever, in any field.

Aside from the ar­gu­ments we made about mod­el­ling un­aware­ness, I don’t think we were claiming that econ the­ory wouldn’t be use­ful. We ar­gue that new agency mod­els could tell us about the lev­els of rents ex­tracted by AI agents; just that i) we can’t in­fer much from ex­ist­ing mod­els be­cause they model differ­ent situ­a­tions and are brit­tle, ii) that mod­els won’t shed light on phe­nom­ena be­yond what they are try­ing to model

But of course, it can’t be used against them all equally. Physics is so good you can send a probe to a planet mil­lions of miles away. But try­ing to achieve a prac­ti­cal re­sult in eco­nomics is largely guess­work.

Great post! It ex­plained clearly both po­si­tions, clar­ified the po­ten­tial uses of PAL and pro­posed vari­a­tions when it was con­sid­ered ac­cessible.

Maybe my only is­sue is with the (lack of) defi­ni­tion of the prin­ci­pal-agent prob­lem. The rest of the post works rel­a­tively well with­out you defin­ing it ex­plic­itly, but I think a short defi­ni­tion (even just a rephras­ing of the one on Wikipe­dia) would make the post even more read­able.

Fur­ther­more, if we can­not en­force con­tracts with AIs then peo­ple will promptly re­al­ise and stop us­ing AIs; so we should ex­pect con­tracts to be en­force­able con­di­tional upon AIs be­ing used.

I could eas­ily be wrong, but this strikes me as a plau­si­ble but de­bat­able state­ment, rather than a cer­tainty. It seems like more ar­gu­ment would be re­quired even to es­tab­lish that it’s likely, and much more to es­tab­lish we can say “peo­ple will promptly re­al­ise...” It also seems like that state­ment is sort of as­sum­ing part of pre­cisely what’s up for de­bate in these sorts of dis­cus­sions.

Some frag­mented thoughts that feed into those opinions:

As you note just be­fore that: “The as­sump­tion [of con­tract en­force­abil­ity] isn’t plau­si­ble in pes­simistic sce­nar­ios where hu­man prin­ci­pals and in­sti­tu­tions are in­suffi­ciently pow­er­ful to pun­ish the AI agent, e.g. due to very fast take-off.” So the Bostrom/​Yud­kowsky sce­nario is pre­cisely one in which con­tracts aren’t en­force­able, for very similar rea­sons to why that sce­nario could lead to ex­is­ten­tial catas­tro­phe.

Very re­lat­edly—per­haps this is even just the same point in differ­ent words—you say “then peo­ple will promptly re­al­ise and stop us­ing AIs”. This as­sumes some pos­si­bil­ity of at least some trial-and-er­ror, and thus as­sumes that there’ll be nei­ther a very dis­con­tin­u­ous ca­pa­bil­ity jump to­wards de­ci­sive strate­gic ad­van­tage, nor de­cep­tion fol­lowed by a treach­er­ous turn.

As you point out, Paul Chris­ti­ano’s “Part 1” sce­nario might be one in which all or most hu­mans are happy, and in­creas­ingly wealthy, and don’t have mo­ti­va­tion to stop us­ing the AIs. You quote him say­ing “hu­mans are bet­ter off in ab­solute terms un­less con­flict leaves them worse off (whether mil­i­tary con­flict or a race for scarce re­sources). Com­pare: a ris­ing China makes Amer­i­cans bet­ter off in ab­solute terms. Also true, un­less we con­sider the pos­si­bil­ity of con­flict....[with­out con­flict] hu­mans are only worse off rel­a­tive to AI (or to hu­mans who are able to lev­er­age AI effec­tively). The availa­bil­ity of AI still prob­a­bly in­creases hu­mans’ ab­solute wealth. This is a prob­lem for hu­mans be­cause we care about our frac­tion of in­fluence over the fu­ture, not just our ab­solute level of wealth over the short term.”

Similarly, it seems to me that we could have a sce­nario in which peo­ple re­al­ise they can’t en­force con­tracts with AIs, but the losses that re­sult from that are rel­a­tively small, and are out­weighed by the benefits of the AI, so peo­ple con­tinue us­ing the AIs de­spite the lack of en­force­abil­ity of the con­tracts.

And then this could still lead to ex­is­ten­tial catas­tro­phe due to black swan events peo­ple didn’t ad­e­quately ac­count for, com­pet­i­tive dy­nam­ics, or “ex­ter­nal­ities” e.g. in re­la­tion to fu­ture gen­er­a­tions.

I’m not per­son­ally sure how likely I find any of the above sce­nar­ios. I’m just say­ing that they seem to re­veal rea­sons to have at least some doubts that “if we can­not en­force con­tracts with AIs then peo­ple will promptly re­al­ise and stop us­ing AIs”.

Although I think it would still be true that the pos­si­bil­ities of trial-and-er­ror, recog­ni­tion of lack of en­force­abil­ity, and peo­ple’s con­cerns about that are at least some rea­son to as­sume that if AIs are used con­tracts will be en­force­abil­ity.