In the pre­vi­ous post, I ar­gued that sim­ply know­ing that an AI sys­tem is su­per­in­tel­li­gent does not im­ply that it must be goal-di­rected. How­ever, there are many other ar­gu­ments that sug­gest that AI sys­tems will or should be goal-di­rected, which I will dis­cuss in this post.

Note that I don’t think of this as the Tool AI vs. Agent AI ar­gu­ment: it seems pos­si­ble to build agent AI sys­tems that are not goal-di­rected. For ex­am­ple, imi­ta­tion learn­ing al­lows you to cre­ate an agent that be­haves similarly to an­other agent—I would clas­sify this as “Agent AI that is not goal-di­rected”. (But see this com­ment thread for dis­cus­sion.)

Note that these ar­gu­ments have differ­ent im­pli­ca­tions than the ar­gu­ment that su­per­in­tel­li­gent AI must be goal-di­rected due to co­her­ence ar­gu­ments. Sup­pose you be­lieve all of the fol­low­ing:

Any of the ar­gu­ments in this post.

Su­per­in­tel­li­gent AI is not re­quired to be goal-di­rected, as I ar­gued in the last post.

Goal-di­rected agents cause catas­tro­phe by de­fault.

Then you could try to cre­ate al­ter­na­tive de­signs for AI sys­tems such that they can do the things that goal-di­rected agents can do with­out them­selves be­ing goal-di­rected. You could also try to per­suade AI re­searchers of these facts, so that they don’t build goal-di­rected sys­tems.

Eco­nomic effi­ciency: goal-di­rected humans

Hu­mans want to build pow­er­ful AI sys­tems in or­der to help them achieve their goals—it seems quite clear that hu­mans are at least par­tially goal-di­rected. As a re­sult, it seems nat­u­ral that they would build AI sys­tems that are also goal-di­rected.

This is re­ally an ar­gu­ment that the sys­tem com­pris­ing the hu­man and AI agent should be di­rected to­wards some goal. The AI agent by it­self need not be goal-di­rected as long as we get goal-di­rected be­hav­ior when com­bined with a hu­man op­er­a­tor. How­ever, in the situ­a­tion where the AI agent is much more in­tel­li­gent than the hu­man, it is prob­a­bly best to del­e­gate most or all de­ci­sions to the agent, and so the agent could still look mostly goal-di­rected.

Even so, you could imag­ine that even the small part of the work that the hu­man con­tinues to do al­lows the agent to not be goal-di­rected, es­pe­cially over long hori­zons. For ex­am­ple, per­haps the hu­man de­cides what the agent should do each day, and the agent ex­e­cutes the in­struc­tion, which in­volves plan­ning over the course of a day, but no longer. (I am not ar­gu­ing that this is safe; on the con­trary, hav­ing very pow­er­ful op­ti­miza­tion over the course of a day seems prob­a­bly un­safe.) This could be ex­tremely pow­er­ful with­out the AI be­ing goal-di­rected over the long term.

Another ex­am­ple would be a cor­rigible agent, which could be ex­tremely pow­er­ful while not be­ing goal-di­rected over the long term. (Though the mean­ings of “goal-di­rected” and “cor­rigible” are suffi­ciently fuzzy that this is not ob­vi­ous and de­pends on the defi­ni­tions we set­tle on for each.)

Eco­nomic effi­ciency: be­yond hu­man performance

Another benefit of goal-di­rected be­hav­ior is that it al­lows us to find novel ways of achiev­ing our goals that we may not have thought of, such as AlphaGo’s move 37. Goal-di­rected be­hav­ior is one of the few meth­ods we know of that al­low AI sys­tems to ex­ceed hu­man perfor­mance.

I think this is a good ar­gu­ment for goal-di­rected be­hav­ior, but given the prob­lems of goal-di­rected be­hav­ior I think it’s worth search­ing for al­ter­na­tives, such as the two ex­am­ples in the pre­vi­ous sec­tion (op­ti­miz­ing over a day, and cor­rigi­bil­ity). Alter­na­tively, we could learn hu­man rea­son­ing, and ex­e­cute it for a longer sub­jec­tive time than hu­mans would, in or­der to make bet­ter de­ci­sions. Or we could have sys­tems that re­main un­cer­tain about the goal and clar­ify what they should do when there are mul­ti­ple very differ­ent op­tions (though this has its own prob­lems).

Cur­rent progress in re­in­force­ment learning

If we had to guess to­day which paradigm would lead to AI sys­tems that can ex­ceed hu­man perfor­mance, I would guess re­in­force­ment learn­ing (RL). In RL, we have a re­ward func­tion and we seek to choose ac­tions that max­i­mize the sum of ex­pected dis­counted re­wards. This sounds a lot like an agent that is search­ing over ac­tions for the best one ac­cord­ing to a mea­sure of good­ness (the re­ward func­tion [1]), which I said pre­vi­ously is a goal-di­rected agent. And the math be­hind RL says that the agent should be try­ing to max­i­mize its re­ward for the rest of time, which makes it long-term [2].

That said, cur­rent RL agents learn to re­play be­hav­ior that in their past ex­pe­rience worked well, and typ­i­cally do not gen­er­al­ize out­side of the train­ing dis­tri­bu­tion. This does not seem like a search over ac­tions to find ones that are the best. In par­tic­u­lar, you shouldn’t ex­pect a treach­er­ous turn, since the whole point of a treach­er­ous turn is that you don’t see it com­ing be­cause it never hap­pened be­fore.

In ad­di­tion, cur­rent RL is epi­sodic, so we should only ex­pect that RL agents are goal-di­rected over the cur­rent epi­sode and not in the long-term. Of course, many tasks would have very long epi­sodes, such as be­ing a CEO. The vanilla deep RL ap­proach here would be to spec­ify a re­ward func­tion for how good a CEO you are, and then try many differ­ent ways of be­ing a CEO and learn from ex­pe­rience. This re­quires you to col­lect many full epi­sodes of be­ing a CEO, which would be ex­tremely time-con­sum­ing.

Per­haps with enough ad­vances in model-based deep RL we could train the model on par­tial tra­jec­to­ries and that would be enough, since it could gen­er­al­ize to full tra­jec­to­ries. I think this is a ten­able po­si­tion, though I per­son­ally don’t ex­pect it to work since it re­lies on our model gen­er­al­iz­ing well, which seems un­likely even with fu­ture re­search.

Th­ese ar­gu­ments lead me to be­lieve that we’ll prob­a­bly have to do some­thing that is not vanilla deep RL in or­der to train an AI sys­tem that can be a CEO, and that thing may not be goal-di­rected.

Over­all, it is cer­tainly pos­si­ble that im­proved RL agents will look like dan­ger­ous long-term goal-di­rected agents, but this does not seem to be the case to­day and there seem to be se­ri­ous difficul­ties in scal­ing cur­rent al­gorithms to su­per­in­tel­li­gent AI sys­tems that can op­ti­mize over the long term. (I’m not ar­gu­ing for long timelines here, since I wouldn’t be sur­prised if we figured out some way that wasn’t vanilla deep RL to op­ti­mize over the long term, but that method need not be goal-di­rected.)

Ex­ist­ing in­tel­li­gent agents are goal-directed

So far, hu­mans and per­haps an­i­mals are the only ex­am­ple of gen­er­ally in­tel­li­gent agents that we know of, and they seem to be quite goal-di­rected. This is some ev­i­dence that we should ex­pect in­tel­li­gent agents that we build to also be goal-di­rected.

Ul­ti­mately we are ob­serv­ing a cor­re­la­tion be­tween two things with sam­ple size 1, which is re­ally not much ev­i­dence at all. If you be­lieve that many an­i­mals are also in­tel­li­gent and goal-di­rected, then per­haps the sam­ple size is larger, since there are in­tel­li­gent an­i­mals with very differ­ent evolu­tion­ary his­to­ries and neu­ral ar­chi­tec­tures (eg. oc­to­puses).

How­ever, this is speci­fi­cally about agents that were cre­ated by evolu­tion, which did a rel­a­tively stupid blind search over a large space, and we could use a differ­ent method to de­velop AI sys­tems. So this ar­gu­ment makes me more wary of cre­at­ing AI sys­tems us­ing evolu­tion­ary searches over large spaces, but it doesn’t make me much more con­fi­dent that all good AI sys­tems must be goal-di­rected.

Interpretability

Another ar­gu­ment for build­ing a goal-di­rected agent is that it al­lows us to pre­dict what it’s go­ing to do in novel cir­cum­stances. While you may not be able to pre­dict the spe­cific ac­tions it will take, you can pre­dict some fea­tures of the fi­nal world state, in the same way that if I were to play Mag­nus Car­lsen at chess, I can’t pre­dict how he will play, but I can pre­dict that he will win.

I do not un­der­stand the in­tent be­hind this ar­gu­ment. It seems as though faced with the nega­tive re­sults that sug­gest that goal-di­rected be­hav­ior tends to cause catas­trophic out­comes, we’re ar­gu­ing that it’s a good idea to build a goal-di­rected agent so that we can more eas­ily pre­dict that it’s go­ing to cause catas­tro­phe.

I also think that we would typ­i­cally be able to pre­dict sig­nifi­cantly more about what any AI sys­tem we ac­tu­ally build will do (than if we mod­eled it as try­ing to achieve some goal). This is be­cause “agent seek­ing a par­tic­u­lar goal” is one of the sim­plest mod­els we can build, and with any sys­tem we have more in­for­ma­tion on, we start re­fin­ing the model to make it bet­ter.

Summary

Over­all, I think there are good rea­sons to think that “by de­fault” we would de­velop goal-di­rected AI sys­tems, be­cause the things we want AIs to do can be eas­ily phrased as goals, and be­cause the stated goal of re­in­force­ment learn­ing is to build goal-di­rected agents (al­though they do not look like goal-di­rected agents to­day). As a re­sult, it seems im­por­tant to figure out ways to get the pow­er­ful ca­pa­bil­ities of goal-di­rected agents through agents that are not them­selves goal-di­rected. In par­tic­u­lar, this sug­gests that we will need to figure out ways to build AI sys­tems that do not in­volve spec­i­fy­ing a util­ity func­tion that the AI should op­ti­mize, or even learn­ing a util­ity func­tion that the AI then op­ti­mizes.

[1] Tech­ni­cally, ac­tions are cho­sen ac­cord­ing to the Q func­tion, but the dis­tinc­tion isn’t im­por­tant here.

[2] Dis­count­ing does cause us to pri­ori­tize short-term re­wards over long-term ones. On the other hand, dis­count­ing seems mostly like a hack to make the math not spit out in­fini­ties, and so that learn­ing is more sta­ble. On the third hand, in­finite hori­zon MDPs with undis­counted re­ward aren’t solv­able un­less you al­most surely en­ter an ab­sorb­ing state. So dis­count­ing com­pli­cates the pic­ture, but not in a par­tic­u­larly in­ter­est­ing way, and I don’t want to rest an ar­gu­ment against long-term goal-di­rected be­hav­ior on the pres­ence of dis­count­ing.

Note that I don’t think of this as the Tool AI vs. Agent AI ar­gu­ment: it seems pos­si­ble to build agent AI sys­tems that are not goal-di­rected. For ex­am­ple, imi­ta­tion learn­ing al­lows you to cre­ate an agent that be­haves similarly to an­other agent—I would clas­sify this as “Agent AI that is not goal-di­rected”.

I’m not very con­vinced by this ex­am­ple, or al­ter­na­tively I’m not get­ting the dis­tinc­tion you’re draw­ing be­tween “agent” and “goal-di­rected”. Sup­pose the agent you’re try­ing to imi­tate is it­self goal-di­rected. In or­der for the imi­ta­tor to gen­er­al­ize be­yond its train­ing dis­tri­bu­tion, it seem­ingly has to learn to be­come goal-di­rected (i.e., perform the same sort of com­pu­ta­tions that a goal-di­rected agent would). I don’t see how else it can pre­dict what the goal-di­rected agent would do in a novel situ­a­tion. If the imi­ta­tor is not able to gen­er­al­ize, then it seems more tool-like than agent-like. On the other hand, if the imi­ta­tee is not goal-di­rected… I guess the agent could imi­tate hu­mans and be not en­tirely goal-di­rected to the ex­tent that hu­mans are not en­tirely goal-di­rected. (Is this the point you’re try­ing to make, or are you say­ing that an imi­ta­tion of a goal-di­rected agent would con­sti­tute a non-goal-di­rected agent?)

Your post re­minded me of Paul Chris­ti­ano’s ap­proval-di­rected agents which was also about try­ing to find an al­ter­na­tive to goal-di­rected agents. Look­ing at it again, it ac­tu­ally sounds a lot like ap­ply­ing imi­ta­tion learn­ing to hu­mans (ex­cept imi­tat­ing a speeded-up hu­man):

Es­ti­mate the ex­pected rat­ing Hugh would give each ac­tion if he con­sid­ered it at length. Take the ac­tion with the high­est ex­pected rat­ing.

Can ap­proval-di­rected agents be con­sid­ered a form of imi­ta­tion learn­ing, and if not, are there any safety-rele­vant differ­ences be­tween imi­ta­tion learn­ing of (speeded-up) hu­mans, and ap­proval-di­rected agents?

Can ap­proval-di­rected agents be con­sid­ered a form of imi­ta­tion learn­ing, and if not, are there any safety-rele­vant differ­ences be­tween imi­ta­tion learn­ing of (speeded-up) hu­mans, and ap­proval-di­rected agents?

I think that the only rea­son to be in­ter­ested in ap­proval-di­rected agents rather than straight­for­ward imi­ta­tion learn­ers is that it may be harder to effec­tively imi­tate be­hav­ior than to solve the same task in a very differ­ent way.

On the other hand, if the imi­ta­tee is not goal-di­rected… I guess the agent could imi­tate hu­mans and be not en­tirely goal-di­rected to the ex­tent that hu­mans are not en­tirely goal-di­rected. (Is this the point you’re try­ing to make, or are you say­ing that an imi­ta­tion of a goal-di­rected agent would con­sti­tute a non-goal-di­rected agent?)

I definitely en­dorse this point, think that it’s an im­por­tant as­pect, and that it alone jus­tifies the claim that I was mak­ing about non-goal-di­rected Agent AI be­ing pos­si­ble.

That said, I do have an in­tu­ition that agents whose goal-di­rect­ed­ness comes from other agents shouldn’t be con­sid­ered goal-di­rected, at least if it hap­pens in a par­tic­u­lar way. Let’s say that I’m pur­su­ing goal X, and my as­sis­tant AI agent is also pur­su­ing goal X as a re­sult. If I then start to pur­sue goal Y, and my AI agent also starts pur­su­ing Y be­cause it is al­igned with me, then it feels like the AI was not re­ally di­rected at goal X, but more di­rected at “what­ever goal Ro­hin has”, and this feels dis­tinctly less goal-di­rected to me. (In par­tic­u­lar, my AI agent would not have all of the con­ver­gent in­stru­men­tal sub­goals in this set­ting, so it is re­ally differ­ent in kind from an AI agent that was sim­ply pur­su­ing X to the best of its abil­ity.)

“Goal-di­rected” may not be the right word to cap­ture the prop­erty I’m think­ing about. It might be some­thing like “thing that pur­sues the stan­dard con­ver­gent in­stru­men­tal sub­goals”, or “thing that pur­sues a goal that is not defined in terms of some­one else’s goal”.

Your post re­minded me of Paul Chris­ti­ano’s ap­proval-di­rected agents which was also about try­ing to find an al­ter­na­tive to goal-di­rected agents.

Yeah, that idea was a big in­fluence on the views that caused me to write this post.

Can ap­proval-di­rected agents be con­sid­ered a form of imi­ta­tion learn­ing, and if not, are there any safety-rele­vant differ­ences be­tween imi­ta­tion learn­ing of (speeded-up) hu­mans, and ap­proval-di­rected agents?

It’s not ex­actly the same, but it is very similar. You could think of ap­proval-di­rec­tion as imi­ta­tion of a par­tic­u­lar weird kind of hu­man, who de­liber­ates for a while be­fore choos­ing any ac­tion.

They feel differ­ent enough to me that there prob­a­bly are safety-rele­vant differ­ences, but I don’t know of any off the top of my head. Ini­tially I was go­ing to say that my­opia was a safety-rele­vant differ­ence, but think­ing about it more I don’t think that’s an ac­tual differ­ence. Ap­proval-di­rected agents are more ex­plic­itly my­opic, but I think imi­ta­tion learn­ing could be my­opic in the same way.

Btw, this post also views Paul’s agenda through the lens of con­struct­ing imi­ta­tions of hu­mans.

For ex­am­ple, imi­ta­tion learn­ing al­lows you to cre­ate an agent that be­haves similarly to an­other agent—I would clas­sify this as “Agent AI that is not goal-di­rected”.

Let’s say that I’m pur­su­ing goal X, and my as­sis­tant AI agent is also pur­su­ing goal X as a re­sult. If I then start to pur­sue goal Y, and my AI agent also starts pur­su­ing Y be­cause it is al­igned with me, then it feels like the AI was not re­ally di­rected at goal X, but more di­rected at “what­ever goal Ro­hin has”

What causes the agent to switch from X to Y?

Are you think­ing of the “agent” as A) the product of the demon­stra­tions and train­ing (e.g. the re­sult­ing neu­ral net­work), or as B) a sys­tem that in­cludes both the trained agent and also the train­ing pro­cess it­self (and fa­cil­ities for con­tinual on­line learn­ing)?

I would as­sume A by de­fault, but then I would ex­pect that if you trained such an agent with imi­ta­tion learn­ing while pur­su­ing goal X, you’d likely get an agent that con­tinues to pur­sue goal X even af­ter you’ve switched to pur­su­ing goal Y. (Un­less the agent also learned to imi­tate what­ever the de­ci­sion-mak­ing pro­cess was that led you to switch from X to Y, in which case the agent seems non-goal-di­rected only in­so­far as you de­cided to switch from X to Y for non-goal-re­lated rea­sons rather than in ser­vice of some higher level goal Ω. Is that what you want?)

Are you think­ing of the “agent” as A) the product of the demon­stra­tions and train­ing (e.g. the re­sult­ing neu­ral net­work), or as B) a sys­tem that in­cludes both the trained agent and also the train­ing pro­cess it­self (and fa­cil­ities for con­tinual on­line learn­ing)?

I was imag­in­ing some­thing more like B for the imi­ta­tion learn­ing case.

I would as­sume A by de­fault, but then I would ex­pect that if you trained such an agent with imi­ta­tion learn­ing while pur­su­ing goal X, you’d likely get an agent that con­tinues to pur­sue goal X even af­ter you’ve switched to pur­su­ing goal Y. (Un­less the agent also learned to imi­tate what­ever the de­ci­sion-mak­ing pro­cess was that led you to switch from X to Y, in which case the agent seems non-goal-di­rected only in­so­far as you de­cided to switch from X to Y for non-goal-re­lated rea­sons rather than in ser­vice of some higher level goal Ω. Is that what you want?)

That anal­y­sis seems right to me.

With re­spect to whether it is what I want, I wouldn’t say that I want any of these things in par­tic­u­lar, I’m more point­ing at the ex­is­tence of sys­tems that aren’t goal-di­rected, yet be­have like an agent.

With re­spect to whether it is what I want, I wouldn’t say that I want any of these things in par­tic­u­lar, I’m more point­ing at the ex­is­tence of sys­tems that aren’t goal-di­rected, yet be­have like an agent.

Would you agree that a B-type agent would be ba­si­cally as goal-di­rected as a hu­man (be­cause it ex­hibits goal-di­rected be­hav­ior when the hu­man does, and doesn’t when the hu­man doesn’t)?

In which case, would it be fair to sum­ma­rize (part of) your ar­gu­ment as:

1) Many of the po­ten­tial prob­lems with build­ing safe su­per­in­tel­li­gent sys­tems comes from them be­ing too goal-di­rected.

2) An agent that is only as goal-di­rected as a hu­man is much less sus­cep­ti­ble to many of these failure modes.

3) It is likely pos­si­ble to build su­per­in­tel­li­gent sys­tems that are only as goal-di­rected as hu­mans.

I don’t think so. Maybe this would be true if you had a perfect imi­ta­tion of a hu­man, but in prac­tice you’ll be un­cer­tain about what the hu­man is go­ing to do. If you’re un­cer­tain in this way, and you are get­ting your goals from a hu­man, then you don’t do all of the in­stru­men­tal sub­goals. (See The Off-Switch Game for a sim­ple anal­y­sis show­ing that you can avoid the sur­vival in­cen­tive.)

It may be that “goal-di­rected” is the wrong word for the prop­erty I’m talk­ing about, but I’m pre­dict­ing that agents of this form are less sus­cep­ti­ble to con­ver­gent in­stru­men­tal sub­goals than hu­mans are.

If you’ve seen the hu­man ac­quire re­sources, then you’ll ac­quire re­sources in the same way.

If there’s now some new re­source that you’ve never seen be­fore, you may ac­quire it if you’re suffi­ciently con­fi­dent that the hu­man would, but oth­er­wise you might try to gather more ev­i­dence to see what the hu­man would do. This is as­sum­ing that we have some way of do­ing imi­ta­tion learn­ing that al­lows the re­sult­ing sys­tem to have un­cer­tainty that it can re­solve by watch­ing the hu­man, or ask­ing the hu­man. If you imag­ine the ex­act way that we do imi­ta­tion learn­ing to­day, it would ex­trap­o­late some­how in a way that isn’t ac­tu­ally what the hu­man would do. Maybe it ac­quires the new re­source, maybe it leaves it alone, maybe it burns it to pre­vent any­one from hav­ing it, who knows.

Btw, this post also views Paul’s agenda through the lens of con­struct­ing imi­ta­tions of hu­mans.

Right, so I think I wasn’t re­ally mak­ing a new ob­ser­va­tion, but just clear­ing up a con­fu­sion on my own part, where for a long time I didn’t un­der­stand how the idea of ap­proval-di­rected agency fits into IDA be­cause peo­ple switched from talk­ing about ap­proval-di­rected agency to imi­ta­tion learn­ing (or were talk­ing about them in­ter­change­ably) and I didn’t catch the con­nec­tion. So at this point I un­der­stand Paul’s tra­jec­tory of views as fol­lows:

goal-di­rected agent ⇒ ap­proval-di­rected agent ⇒ use IDA to scale up ap­proval-di­rect agent ⇒ ap­proval-di­rected agency as a form of imi­ta­tion learn­ing /​ gen­er­al­ize to other forms of imi­ta­tion learn­ing ⇒ gen­er­al­ize IDA to safely scale up other (in­clud­ing more goal-di­rected /​ con­se­quen­tial­ist) forms of ML (see An Unal­igned Bench­mark which I think rep­re­sents his cur­rent views)

They feel differ­ent enough to me that there prob­a­bly are safety-rele­vant differences

It looks like imi­ta­tion learn­ing isn’t one thing but a fairly broad cat­e­gory in ML which even in­cludes IRL. But if we com­pare ap­proval di­rec­tion to the nar­rower kinds of imi­ta­tion learn­ing, ap­proval di­rec­tion seems a lot riskier be­cause you’re op­ti­miz­ing over an es­ti­ma­tion of hu­man ap­proval, which seems to be an ad­ver­sar­ial pro­cess that could eas­ily trig­ger safety prob­lems in both the ground-truth hu­man ap­proval as well as in the es­ti­ma­tion pro­cess. I won­der when you wrote the OP, which form of imi­ta­tion learn­ing did you have in mind?

ETA: From this com­ment it looks like you were think­ing of an on­line ver­sion of nar­row imi­ta­tion learn­ing. Might be good to clar­ify that in the post?

But if we com­pare ap­proval di­rec­tion to the nar­rower kinds of imi­ta­tion learn­ing, ap­proval di­rec­tion seems a lot riskier be­cause you’re op­ti­miz­ing over an es­ti­ma­tion of hu­man ap­proval, which seems to be an ad­ver­sar­ial pro­cess that could eas­ily trig­ger safety prob­lems in both the ground-truth hu­man ap­proval as well as in the es­ti­ma­tion pro­cess.

But if there are safety prob­lems in ap­proval, wouldn’t there also be safety prob­lems in the hu­man’s be­hav­ior, which imi­ta­tion learn­ing would copy?

Similarly, if there are safety prob­lems in the es­ti­ma­tion pro­cess, wouldn’t there also be safety prob­lems in the pre­dic­tion of what ac­tion a hu­man would take?

From this com­ment it looks like you were think­ing of an on­line ver­sion of nar­row imi­ta­tion learn­ing. Might be good to clar­ify that in the post?

I some­what think that it ap­plies to most imi­ta­tion learn­ing, not just the on­line var­i­ant of nar­row imi­ta­tion learn­ing, but I am pretty con­fused/​un­sure. I’ll add a poin­ter to this dis­cus­sion to the post.

But if there are safety prob­lems in ap­proval, wouldn’t there also be safety prob­lems in the hu­man’s be­hav­ior, which imi­ta­tion learn­ing would copy?

The hu­man’s be­hav­ior could be safer be­cause a hu­man mind doesn’t op­ti­mize so much as to move out­side of the range of in­puts where ap­proval is safe, or it has a “pro­posal gen­er­a­tor” that only gen­er­ates pos­si­ble ac­tions that with high prob­a­bil­ity stay within that range.

Similarly, if there are safety prob­lems in the es­ti­ma­tion pro­cess, wouldn’t there also be safety prob­lems in the pre­dic­tion of what ac­tion a hu­man would take?

Same here, if you just pre­dict what ac­tion a hu­man would take, you’re less likely to op­ti­mize so much that you likely end up out­side of where the es­ti­ma­tion pro­cess is safe.

I some­what think that it ap­plies to most imi­ta­tion learn­ing, not just the on­line var­i­ant of nar­row imi­ta­tion learn­ing, but I am pretty con­fused/​un­sure.

Your post re­minded me of Paul Chris­ti­ano’s ap­proval-di­rected agents which was also about try­ing to find an al­ter­na­tive to goal-di­rected agents. Look­ing at it again, it ac­tu­ally sounds a lot like ap­ply­ing imi­ta­tion learn­ing to hu­mans (ex­cept imi­tat­ing a speeded-up hu­man):

It seems like ap­proval di­rec­tion al­lows for cre­ative ac­tions that the hu­man op­er­a­tor ap­proves of but would not have thought of do­ing them­selves. Not sure if imi­ta­tion learn­ing does this.

That’s a good ques­tion. It looks like imi­ta­tion learn­ing ac­tu­ally cov­ers a num­ber of ML tech­niques (see this) none of which ex­actly matches ap­proval-di­rected agents. But the cat­e­gory seems broad enough that I think ap­proval-di­rected agents can be con­sid­ered to be a form of imi­ta­tion learn­ing. In par­tic­u­lar, IRL is con­sid­ered a form of imi­ta­tion learn­ing and IRL would also be able to perform ac­tions that the hu­man would not have thought of do­ing them­selves.

A lit­tle bit of nu­ance: IRL is con­sid­ered to be a form of imi­ta­tion learn­ing be­cause in many cases the in­ferred re­ward in IRL is only meant to re­pro­duce the hu­man’s perfor­mance and isn’t ex­pected to gen­er­al­ize out­side of the train­ing dis­tri­bu­tion.

There are ver­sions of IRL which are meant to go be­yond imi­ta­tion. For ex­am­ple, ad­ver­sar­ial IRL was try­ing to in­fer a re­ward that would gen­er­al­ize to new en­vi­ron­ments, in which case it would be do­ing some­thing more than imi­ta­tion.

Sup­pose the agent you’re try­ing to imi­tate is it­self goal-di­rected. In or­der for the imi­ta­tor to gen­er­al­ize be­yond its train­ing dis­tri­bu­tion, it seem­ingly has to learn to be­come goal-di­rected (i.e., perform the same sort of com­pu­ta­tions that a goal-di­rected agent would). I don’t see how else it can pre­dict what the goal-di­rected agent would do in a novel situ­a­tion. If the imi­ta­tor is not able to gen­er­al­ize, then it seems more tool-like than agent-like. On the other hand, if the imi­ta­tee is not goal-di­rected… I guess the agent could imi­tate hu­mans and be not en­tirely goal-di­rected to the ex­tent that hu­mans are not en­tirely goal-di­rected. (Is this the point you’re try­ing to make, or are you say­ing that an imi­ta­tion of a goal-di­rected agent would con­sti­tute a non-goal-di­rected agent?)

I’m not sure these are the points Ro­hin was try­ing to make, but there seem to be at least two im­por­tant points here:

Imi­ta­tion learn­ing ap­plied to hu­mans pro­duces agents no more ca­pa­ble than hu­mans. (I think IDA goes be­yond this by adding am­plifi­ca­tion steps, which are sep­a­rate. And IRL goes be­yond this by try­ing to cor­rect “er­rors” that the hu­mans make.)

Re­gard­ing the sec­ond point, there’s a safety-rele­vant sense in which a hu­man-imi­tat­ing agent is less goal-di­rected than the hu­man. Be­cause if you scale the hu­man’s ca­pa­bil­ities, the hu­man will be­come bet­ter at achiev­ing its per­sonal ob­jec­tives. By con­trast, if you scale the imi­ta­tor’s ca­pa­bil­ities, it’s only sup­posed to be­come even bet­ter at imi­tat­ing the un­scaled hu­man.

I will list—just for my own un­der­stand­ing—the no-goal-ori­ented types of agents.

1. Univer­sal library. This is an agent which cre­ate all sig­nifi­cant solu­tions to all pos­si­ble sig­nifi­cant prob­lems and then stops. An ex­am­ple of it is the past biolog­i­cal evolu­tion which in­vented enor­mous amount of adap­ta­tions: fly­ing solu­tions, pro­teins etc, - and could be used for in­spira­tion for the tech­nolog­i­cal progress. Past hu­man his­tory or some un­con­scious pro­cesses in the brain, like dream­ing, may be an­other pos­si­ble ex­am­ples.

2. Hu­man-mimick­ing neu­ral net—this is an ex­am­ple of an agent which is mimick­ing an­other agent.

3. Ob­vi­ously, AI Or­a­cles and AI Tools.

4. “Homeo­static” su­per­in­tel­li­gence. An ex­am­ple of such sys­tem is OS like Win­dows, which doesn’t do any­thing in a goal-di­rected sense, but just sup­ports pro­cesses. Most na­tional states also work in this way (ex­cept ide­olog­i­cally driven like USSR or Iran).

6. Swarm in­tel­li­gences which com­pete to solve a task. If one cre­ate a prize for X, many peo­ple will com­pete to get it. The whole swarm is not a goal ori­ented agent, while its el­e­ments are such agents. Scott’s Moloh is a bad ex­am­ple of such swarm be­havi­our.

Thanks for do­ing this—it’s helpful for me as well. I have some ques­tions/​quib­bles:

Isn’t #2 as goal-di­rected as the hu­man it mimics, in all the rele­vant ways? If I learn that a cer­tain ma­chine runs a neu­ral net that mimics Hitler, shouldn’t I worry that it will try to take over the world? Maybe I don’t get what you mean by “mimics.”

What ex­actly is the differ­ence be­tween an Or­a­cle and a Tool? I thought an Or­a­cle was a kind of Tool; I thought Tool was a catch-all cat­e­gory for ev­ery­thing that’s not a Sovereign or a Ge­nie.

I’m skep­ti­cal of this no­tion of “home­o­static” su­per­in­tel­li­gence. It seems to me that na­tions like the USA are fully goal-di­rected in the rele­vant senses; they ex­hibit the ba­sic AI drives, they are ca­pa­ble of things like the treach­er­ous turn, etc. As for Win­dows, how is it an agent at all? What does it do? Allo­cate mem­ory re­sources across cur­rently-be­ing-run pro­grams? How does it do that—is there an ex­plicit func­tion that it fol­lows to do the al­lo­ca­tion (e.g. give all pro­grams equal re­sources), or does it do some­thing like con­se­quen­tial­ist rea­son­ing?

On #6, it seems to me that it might ac­tu­ally be cor­rect to say that the swarm is an agent—it’s just that the swarm has differ­ent goals than each of its in­di­vi­d­ual mem­bers. Maybe Moloch is an agent af­ter all! On the other hand, some­thing seems not quite right about this—what is Moloch’s util­ity func­tion? What­ever it is, Moloch seems par­tic­u­larly un­in­ter­ested in self-preser­va­tion, which makes it hard to think of it as an agent with nor­mal-ish goals. (Ar­gu­ment: Sup­pose some­one were to ini­ti­ate a pro­ject that would, with high prob­a­bil­ity, kill Moloch for­ever in 100 years time. Sup­pose the pro­ject has no other effects, such that al­most all hu­mans think it’s a good idea. And ev­ery­one knows about it. All it would take to stop the pro­ject is a mil­lion peo­ple vot­ing against it. Now, is there a sense in which Moloch would re­sist it or seek to un­der­mine the pro­ject? It would maaaybe in­cen­tivize most peo­ple not to con­tribute to the pro­ject (tragedy of the com­mons!) but that’s it. So ei­ther Moloch isn’t an agent, or it’s an agent that doesn’t care about dy­ing, or it’s an agent that doesn’t know it’s go­ing to die, or it’s a very weak agent—can’t even stop one pro­ject!)

Some­thing could ex­hibit goal-like be­havi­our for the out­side view­ers with­out hav­ing in­ter­nal struc­ture of an agent. For ex­am­ple, a brick is fal­ling to the ground—we could say that it is aimed on the spe­cific point on the ground, but it is not an agent. The same way an in­fec­tious dis­ease can take over the world with­out be­ing an agent. More­over, even some hu­mans some­times are not agent.

In my opinion, Or­a­cle AI out­put only an­swers to ques­tions, and Tool AI can do some other staff, like con­tin­u­ous data stream trans­for­ma­tion or con­trol­ling mechanisms.

Na­tional states, hu­man body and OSs—all of them are good and even clever in pre­serv­ing home­o­static state (ex­cept the time of gov­ern­ment shut­down) - but they typ­i­cally achieve it not via high level agen­tial rea­son­ing.

Swarm of agents could ex­hibit be­havi­our differ­ent from the be­havi­our or goals of any sep­a­rate agent.

Hu­mans want to build pow­er­ful AI sys­tems in or­der to help them achieve their goals—it seems quite clear that hu­mans are at least par­tially goal-di­rected. As a re­sult, it seems nat­u­ral that they would build AI sys­tems that are also goal-di­rected.

This is re­ally an ar­gu­ment that the sys­tem com­pris­ing the hu­man and AI agent should be di­rected to­wards some goal. The AI agent by it­self need not be goal-di­rected as long as we get goal-di­rected be­hav­ior when com­bined with a hu­man op­er­a­tor. How­ever, in the situ­a­tion where the AI agent is much more in­tel­li­gent than the hu­man, it is prob­a­bly best to del­e­gate most or all de­ci­sions to the agent, and so the agent could still look mostly goal-di­rected.

Even so, you could imag­ine that even the small part of the work that the hu­man con­tinues to do al­lows the agent to not be goal-di­rected, es­pe­cially over long hori­zons.

An ad­di­tional is­sue is that if you have a com­pet­i­tive situ­a­tion, there may be an in­cen­tive to min­i­mize the amount of hu­man in­volve­ment in the sys­tem, in or­der to speed up re­sponse time and avoid los­ing ground to com­peti­tors. I dis­cussed this a bit in Disjunc­tive Sce­nar­ios of Catas­trophic AI Risk:

… the U.S. mil­i­tary is seek­ing to even­tu­ally tran­si­tion to a state where the hu­man op­er­a­tors of robot weapons are “on the loop” rather than “in the loop” (Wal­lach & Allen 2013). In other words, whereas a hu­man was pre­vi­ously re­quired to ex­plic­itly give the or­der be­fore a robot was al­lowed to ini­ti­ate pos­si­bly lethal ac­tivity, in the fu­ture hu­mans are meant to merely su­per­vise the robot’s ac­tions and in­terfere if some­thing goes wrong. While this would al­low the sys­tem to re­act faster, it would also limit the win­dow that the hu­man op­er­a­tors have for over­rid­ing any mis­takes that the sys­tem makes. For a num­ber of mil­i­tary sys­tems, such as au­to­matic weapons defense sys­tems de­signed to shoot down in­com­ing mis­siles and rock­ets, the ex­tent of hu­man over­sight is already limited to ac­cept­ing or over­rid­ing a com­puter’s plan of ac­tions in a mat­ter of sec­onds, which may be too lit­tle to make a mean­ingful de­ci­sion in prac­tice (Hu­man Rights Watch 2012).

Cur­rently ex­ist­ing re­motely pi­loted mil­i­tary “drones,” such as the U.S. Preda­tor and Reaper, re­quire a high amount of com­mu­ni­ca­tions band­width. This limits the amount of drones that can be fielded at once, and makes them de­pen­dent on com­mu­ni­ca­tions satel­lites which not ev­ery na­tion has, and which can be jammed or tar­geted by en­e­mies. A need to be in con­stant com­mu­ni­ca­tion with re­mote op­er­a­tors also makes it im­pos­si­ble to cre­ate drone sub­marines, which need to main­tain a com­mu­ni­ca­tions black­out be­fore and dur­ing com­bat. Mak­ing the drones au­tonomous and ca­pa­ble of act­ing with­out hu­man su­per­vi­sion would avoid all of these prob­lems.

Par­tic­u­larly in air-to-air com­bat, vic­tory may de­pend on mak­ing very quick de­ci­sions. Cur­rent air com­bat is already push­ing against the limits of what the hu­man ner­vous sys­tem can han­dle: fur­ther progress may be de­pen­dent on re­mov­ing hu­mans from the loop en­tirely.

Much of the rou­tine op­er­a­tion of drones is very monotonous and bor­ing, which is a ma­jor con­trib­u­tor to ac­ci­dents. The train­ing ex­penses, salaries, and other benefits of the drone op­er­a­tors are also ma­jor ex­penses for the mil­i­taries em­ploy­ing them.

Spar­row’s ar­gu­ments are spe­cific to the mil­i­tary do­main, but they demon­strate the ar­gu­ment that “any broad do­main in­volv­ing high stakes, ad­ver­sar­ial de­ci­sion mak­ing, and a need to act rapidly is likely to be­come in­creas­ingly dom­i­nated by au­tonomous sys­tems” (So­tala & Yam­polskiy 2015, p. 18). Similar ar­gu­ments can be made in the busi­ness do­main: elimi­nat­ing hu­man em­ploy­ees to re­duce costs from mis­takes and salaries is some­thing that com­pa­nies would also be in­cen­tivized to do, and mak­ing a profit in the field of high-fre­quency trad­ing already de­pends on out­perform­ing other traders by frac­tions of a sec­ond. While the cur­rently ex­ist­ing AI sys­tems are not pow­er­ful enough to cause global catas­tro­phe, in­cen­tives such as these might drive an up­grad­ing of their ca­pa­bil­ities that even­tu­ally brought them to that point.

In the ab­sence of suffi­cient reg­u­la­tion, there could be a “race to the bot­tom of hu­man con­trol” where state or busi­ness ac­tors com­peted to re­duce hu­man con­trol and in­creased the au­ton­omy of their AI sys­tems to ob­tain an edge over their com­peti­tors (see also Arm­strong et al. 2016 for a sim­plified “race to the precipice” sce­nario). This would be analo­gous to the “race to the bot­tom” in cur­rent poli­tics, where gov­ern­ment ac­tors com­pete to dereg­u­late or to lower taxes in or­der to re­tain or at­tract busi­nesses.

AI sys­tems be­ing given more power and au­ton­omy might be limited by the fact that do­ing this poses large risks for the ac­tor if the AI malfunc­tions. In busi­ness, this limits the ex­tent to which ma­jor, es­tab­lished com­pa­nies might adopt AI-based con­trol, but in­cen­tivizes star­tups to try to in­vest in au­tonomous AI in or­der to out­com­pete the es­tab­lished play­ers. In the field of al­gorith­mic trad­ing, AI sys­tems are cur­rently trusted with enor­mous sums of money de­spite the po­ten­tial to make cor­re­spond­ing losses—in 2012, Knight Cap­i­tal lost $440 mil­lion due to a glitch in their trad­ing soft­ware (Pop­per 2012, Se­cu­ri­ties and Ex­change Com­mis­sion 2013). This sug­gests that even if a malfunc­tion­ing AI could po­ten­tially cause ma­jor risks, some com­pa­nies will still be in­clined to in­vest in plac­ing their busi­ness un­der au­tonomous AI con­trol if the po­ten­tial profit is large enough.

U.S. law already al­lows for the pos­si­bil­ity of AIs be­ing con­ferred a le­gal per­son­al­ity, by putting them in charge of a limited li­a­bil­ity com­pany. A hu­man may reg­ister a limited li­a­bil­ity cor­po­ra­tion (LLC), en­ter into an op­er­at­ing agree­ment spec­i­fy­ing that the LLC will take ac­tions as de­ter­mined by the AI, and then with­draw from the LLC (Bay­ern 2015). The re­sult is an au­tonomously act­ing le­gal per­son­al­ity with no hu­man su­per­vi­sion or con­trol. AI-con­trol­led com­pa­nies can also be cre­ated in var­i­ous non-U.S. ju­ris­dic­tions; re­stric­tions such as ones for­bid­ding cor­po­ra­tions from hav­ing no own­ers can largely be cir­cum­vented by tricks such as hav­ing net­works of cor­po­ra­tions that own each other (LoPucki 2017). A pos­si­ble start-up strat­egy would be for some­one to de­velop a num­ber of AI sys­tems, give them some ini­tial en­dow­ment of re­sources, and then set them off in con­trol of their own cor­po­ra­tions. This would risk only the ini­tial re­sources, while promis­ing what­ever prof­its the cor­po­ra­tion might earn if suc­cess­ful. To the ex­tent that AI-con­trol­led com­pa­nies were suc­cess­ful in un­der­min­ing more es­tab­lished com­pa­nies, they would pres­sure those com­pa­nies to trans­fer con­trol to au­tonomous AI sys­tems as well.

I get why the MCTS is im­por­tant, but what about the train­ing? It seems to me that if we stop train­ing AlphaGo (Zero) and I play a game against it, it’s goal-di­rected even though we have stopped train­ing it.

Here are a few more rea­sons for hu­mans to build goal-di­rected agents:

Goal di­rected AI is a way to defend against value drift/​cor­rup­tion/​ma­nipu­la­tion. Peo­ple might be forced to build goal di­rected agents if they can’t figure out an­other way to do that.

Goal di­rected AI is a way to co­op­er­ate and thereby in­crease eco­nomic effi­ciency and/​or mil­i­tary com­pet­i­tive­ness. (A group of peo­ple can build a goal di­rected agent that they can ver­ify rep­re­sents an ag­gre­ga­tion of their val­ues.) Peo­ple might be forced to build or trans­fer con­trol to goal di­rected agents in or­der to par­ti­ci­pate in such co­op­er­a­tion to re­main com­pet­i­tive, un­less they can figure out an­other way to co­op­er­ate that is as effi­cient as this.

Goal di­rected AI is a way to ad­dress other hu­man safety prob­lems. Peo­ple might trust an AI with ex­plicit and ver­ifi­able val­ues more than an AI that is con­trol­led by a dis­tant stranger.

For the first one, I guess I would use “ar­gu­ment for defense against value drift” in­stead since you could con­ceiv­ably use a goal-di­rected AI to defend against value drift with­out lock in, e.g., by do­ing some­thing like Paul Chris­ti­ano’s 2012 ver­sion of in­di­rect nor­ma­tivity (which I don’t think is fea­si­ble but maybe there’s some­thing like it that is, like my hy­brid ap­proach, if you con­sider that goal-di­rected).

For the third one, I guess in­ter­pretabil­ity is part of it, but a big­ger prob­lem is that it seems hard to make a suffi­ciently trust­wor­thy hu­man over­seer even if we could “in­ter­pret” them. In other words, in­ter­pretabil­ity for a hu­man might just let us see ex­actly why we shouldn’t trust them.

That said, cur­rent RL agents learn to re­play be­hav­ior that in their past ex­pe­rience worked well, and typ­i­cally do not gen­er­al­ize out­side of the train­ing dis­tri­bu­tion. This does not seem like a search over ac­tions to find ones that are the best.

What is stop­ping AI re­searchers from us­ing RL to (end-to-end) train agents that do search over ac­tions to find ones that are the best? It seems like an ob­vi­ous next step to take in or­der to build agents that gen­er­al­ize bet­ter than cur­rent RL agents, doesn’t it? Is it just that the challenges they’ve at­tempted so far haven’t re­quired go­ing be­yond build­ing agents that are es­sen­tially just lossy com­pres­sions of be­hav­iors that work well on the train­ing dis­tri­bu­tion, or is there a fun­da­men­tal rea­son why us­ing RL to train goal-di­rected agents would be hard?

What is stop­ping AI re­searchers from us­ing RL to (end-to-end) train agents that do search over ac­tions to find ones that are the best?

That tech­nique is called model-based RL, and in prac­tice, given suffi­cient data and com­pute, it ends up perform­ing worse than model-free RL. (It does perform bet­ter in low-data regimes, and my guess is that it will also gen­er­al­ize slightly bet­ter but not much.) In model-based RL, you learn a model of the world, and then search over se­quences of ac­tions and take the one that seems best.

Spec­u­la­tion on why it doesn’t work: In prac­tice, your model of the world only makes good pre­dic­tions for states and ac­tions that you have already ex­pe­rienced. So search­ing over ac­tions for the best one ei­ther gives you some­thing you have already ex­pe­rienced, or some non­sense ac­tion (sort of like an ad­ver­sar­ial ex­am­ple for the world model).

It is worth not­ing that this isn’t end-to-end: the model is trained “end-to-end”, but the ac­tion se­lec­tion is typ­i­cally some hard­coded func­tion like “sam­ple 1000 tra­jec­to­ries from the model, choose the tra­jec­tory that gives the best re­ward, and take the first ac­tion of that tra­jec­tory”. I don’t know how you would train an agent end-to-end such that it ex­plic­itly learns to search over ac­tions (as op­posed to an im­plicit search that model-free RL al­gorithms might already be do­ing).

When you are given an ac­cu­rate model of the world, then you can in fact search over ac­tions and do much bet­ter, see for ex­am­ple value iter­a­tion or policy iter­a­tion. (Those are for very small en­vi­ron­ments, but you could cre­ate ap­prox­i­mate ver­sions for more com­plex en­vi­ron­ments.)

Spec­u­la­tion on why it doesn’t work: In prac­tice, your model of the world only makes good pre­dic­tions for states and ac­tions that you have already ex­pe­rienced. So search­ing over ac­tions for the best one ei­ther gives you some­thing you have already ex­pe­rienced, or some non­sense ac­tion (sort of like an ad­ver­sar­ial ex­am­ple for the world model).

I don’t know how you would train an agent end-to-end such that it ex­plic­itly learns to search over actions

I was think­ing you could train the world model sep­a­rately at first, man­u­ally im­ple­ment an ini­tial ac­tion se­lec­tion method as a neu­ral net­work or some other kind of differ­en­tiable pro­gram, and then let RL act on the agent to op­ti­mize it as a whole.

What kind of im­plicit search are model-free RL al­gorithms already do­ing? If we just keep scal­ing up model-free RL, can they even­tu­ally be­come goal-di­rected agents through this kind of im­plicit search?

Our en­vi­ron­ment is suffi­ciently harsh and com­plex that ev­ery­thing is in-distribution

Our brains are so small and our en­vi­ron­ment is so harsh and com­plex that the only way that they can get good perfor­mance is to have struc­tured, mod­u­lar rep­re­sen­ta­tions, which lead to worse perfor­mance in dis­tri­bu­tion but bet­ter generalization

Some sys­tem that lets us know what we know, and only gen­er­ates ac­tions for con­sid­er­a­tion where we know what the con­se­quences will be

I don’t know. This is mostly an ex­pres­sion of un­cer­tainty about what model-free RL agents are do­ing. Maybe some of the mul­ti­pli­ca­tions and ad­di­tions go­ing on in there turn out to be equiv­a­lent to a search over ac­tions. Maybe not.

My in­tu­ition says “nah, our cur­rent en­vi­ron­ments are all sim­ple enough that you can solve them by us­ing heuris­tics to com­pute ac­tions, and the train­ing pro­cess is go­ing to dis­till those heuris­tics into the policy rather than turn­ing the policy into a search al­gorithm”. But even if I trust that in­tu­ition, there is some level of en­vi­ron­ment com­plex­ity at which this would stop be­ing true, and I don’t trust my in­tu­ition on what that level is.

If we just keep scal­ing up model-free RL, can they even­tu­ally be­come goal-di­rected agents through this kind of im­plicit search?

Plau­si­bly, but plau­si­bly not. I have con­flict­ing not-well-formed in­tu­itions that pull in both di­rec­tions.

Then you could try to cre­ate al­ter­na­tive de­signs for AI sys­tems such that they can do the things that goal-di­rected agents can do with­out them­selves be­ing goal-di­rected. You could also try to per­suade AI re­searchers of these facts, so that they don’t build goal-di­rected sys­tems.

I’m not sure this strat­egy is net pos­i­tive. If dan­ger­ous AI (dan­ger­ous at least as Slaugh­ter­bots) is de­vel­oped be­fore al­ign­ment is solved, the world is prob­a­bly bet­ter off if the first visi­bly-dan­ger­ous-AI is goal-di­rected rather than, say, an Or­a­cle. The former would prob­a­bly be a much weaker op­ti­miza­tion pro­cess and prob­a­bly won’t re­sult in an ex­is­ten­tial catas­tro­phe; and per­haps will make some gov­er­nance solu­tions more fea­si­ble.

I’m not op­ti­miz­ing for rais­ing aware­ness via an “ob­vi­ous AI dis­aster” due to mul­ti­ple rea­sons, in­clud­ing the huge risk to the rep­u­ta­tion of the AI safety com­mu­nity and the unilat­er­al­ist’s curse.

I do think that when con­sid­er­ing whether to in­vest in an effort which might pre­vent re­cov­er­able near-term AI ac­ci­dents, one should con­sider the pos­si­bil­ity that the effort would pre­vent pivotal events (e.g. one that would have en­abled use­ful gov­er­nance solu­tions re­sult­ing in more time for al­ign­ment re­search).

Efforts that pre­vent re­cov­er­able near-term AI ac­ci­dents might be as­tro­nom­i­cally net-pos­i­tive if they help make AI al­ign­ment more main­stream in the gen­eral ML com­mu­nity.

(any­one who thinks I shouldn’t dis­cuss this pub­li­cly is wel­come to let me know via a PM or anony­mously here)

In this sce­nario, wouldn’t you even­tu­ally build a suffi­ciently pow­er­ful goal-di­rected AI that leads to an ex­is­ten­tial catas­tro­phe?

Per­haps the hope is that when ev­ery­one sees that the first goal-di­rected AI is visi­bly dan­ger­ous then they ac­tu­ally be­lieve that goal-di­rected AI is dan­ger­ous. But in the sce­nario where we are build­ing al­ter­na­tives to goal-di­rected AI and they are ac­tu­ally get­ting used, I would pre­dict that we have con­vinced most AI re­searchers that goal-di­rected AI is dan­ger­ous.

(Also, I think you can level this ar­gu­ment at nearly all AI safety re­search agen­das, with pos­si­bly the ex­cep­tion of Agent Foun­da­tions.)

I think I didn’t ar­tic­u­late my ar­gu­ment clearly, I tried to clar­ify it in my re­ply to Jes­sica.

I think my ar­gu­ment might be es­pe­cially rele­vant to the effort of per­suad­ing AI re­searchers not to build goal-di­rected sys­tems.

If a re­sult of this effort is con­vinc­ing more AI re­searchers in the gen­eral premise that x-risk from AI is some­thing worth wor­ry­ing about, then that’s a very strong ar­gu­ment in fa­vor of car­ry­ing out the effort (and I agree this re­sult should cor­re­late with con­vinc­ing AI re­searchers not to build goal-di­rected sys­tems—if that’s what you ar­gued in your com­ment).

Build­ing a non goal di­rected agent is like build­ing a cart out of non-wood ma­te­ri­als. Goal di­rected be­hav­ior is rel­a­tively well un­der­stood. We know that most goal di­rected de­signs don’t do what we want. Most ar­range­ments of wood do not form a func­tion­ing cart.

I sus­pect that a ran­domly se­lected agent from the space of all non goal di­rected agents is also use­less or dan­ger­ous, in much the same way that a ran­dom ar­range­ment of non wood ma­te­ri­als is.

Now there are a cou­ple of re­gions of de­sign space that are not goal di­rected and look like they con­tain use­ful AI’s. We might be bet­ter off mak­ing our cart from Iron, but Iron has its own prob­lems.

I sus­pect that a ran­domly se­lected agent from the space of all non goal di­rected agents is also use­less or dan­ger­ous, in much the same way that a ran­dom ar­range­ment of non wood ma­te­ri­als is.

Sure. We aren’t go­ing to choose an agent ran­domly.

Now there are a cou­ple of re­gions of de­sign space that are not goal di­rected and look like they con­tain use­ful AI’s. We might be bet­ter off mak­ing our cart from Iron, but Iron has its own prob­lems.