Human-AI Interaction

The im­por­tance of feedback

Con­sider try­ing to pro­gram a self-driv­ing car to drive from San Fran­cisco to Los An­ge­les—with no sen­sors that al­low it to gather in­for­ma­tion as it is driv­ing. This is pos­si­ble in prin­ci­ple. If you can pre­dict the ex­act weather con­di­tions, the ex­act move­ment of all of the other cars on the road, the ex­act amount of fric­tion along ev­ery part of the road sur­face, the ex­act im­pact of (the equiv­a­lents of) press­ing the gas or turn­ing the steer­ing wheel, and so on, then you could com­pute ahead of time how ex­actly to con­trol the car such that it gets from SF to LA. Nev­er­the­less, it seems un­likely that we will ever be able to ac­com­plish such a feat, even with pow­er­ful AI sys­tems.

No, in prac­tice there is go­ing to be some un­cer­tainty about how the world is go­ing to evolve; such that any plan com­puted ahead of time will have some er­rors that will com­pound over the course of the plan. The solu­tion is to use sen­sors to gather in­for­ma­tion while ex­e­cut­ing the plan, so that we can no­tice any er­rors or de­vi­a­tions from the plan, and take cor­rec­tive ac­tion. It is much eas­ier to build a con­trol­ler that keeps you pointed in the gen­eral di­rec­tion, than to build a plan that will get you there perfectly with­out any adap­ta­tion.

Con­trol the­ory stud­ies these sorts of sys­tems, and you can see the gen­eral power of feed­back con­trol­lers in the the­o­rems that can be proven. Espe­cially for mo­tion tasks, you can build feed­back con­trol­lers that are guaran­teed to safely achieve the goal, even in the pres­ence of ad­ver­sar­ial en­vi­ron­men­tal forces (that are bounded in size, so you can’t have ar­bi­trar­ily strong wind). In the pres­ence of an ad­ver­sary, in most en­vi­ron­ments it be­comes im­pos­si­ble even in prin­ci­ple to make such a guaran­tee if you do not have any sen­sors or feed­back and must com­pute a plan in ad­vance. Typ­i­cally, for ev­ery such plan, there is some en­vi­ron­men­tal force that would cause it to fail.

The con­trol the­ory per­spec­tive on AI alignment

With am­bi­tious value learn­ing, we’re hop­ing that we can learn a util­ity func­tion that tells us the op­ti­mal thing to do into the fu­ture. You need to be able to en­code ex­actly how to be­have in all pos­si­ble en­vi­ron­ments, no mat­ter what new things hap­pen in the fu­ture, even if it’s some­thing we hu­mans never con­sid­ered a pos­si­bil­ity so far.

This is analo­gous to the prob­lem of try­ing to pro­gram a self-driv­ing car. Just as in that case, we might hope that we can solve the prob­lem by in­tro­duc­ing sen­sors and feed­back. In this case, the “feed­back” would be hu­man data that in­forms our AI sys­tem what we want it to do, that is, data that can be used to learn val­ues. The evolu­tion of hu­man val­ues and prefer­ences in new en­vi­ron­ments with new tech­nolo­gies is analo­gous to the un­pre­dictable en­vi­ron­men­tal dis­tur­bances that con­trol the­ory as­sumes.

This does not mean that an AI sys­tem must be ar­chi­tected in such a way that hu­man data is ex­plic­itly used to “con­trol” the AI ev­ery few timesteps in or­der to keep it on track. It does mean that any AI al­ign­ment pro­posal should have some method of in­cor­po­rat­ing in­for­ma­tion about what hu­mans want in rad­i­cally differ­ent cir­cum­stances. I have found this an im­por­tant frame with which to view AI al­ign­ment pro­pos­als. For ex­am­ple, with in­di­rect nor­ma­tivity or ideal­ized hu­mans it’s im­por­tant that the ideal­ized or simu­lated hu­mans are go­ing through similar ex­pe­riences that real hu­mans go through, so that they provide good feed­back.

Feed­back through interaction

Of course, while the con­trol the­ory per­spec­tive does not re­quire the feed­back con­trol­ler to be ex­plicit, one good way to en­sure that there is feed­back would be to make it ex­plicit. This would mean that we cre­ate an AI sys­tem that ex­plic­itly col­lects fresh data about what hu­mans want in or­der to in­form what it should do. This is ba­si­cally call­ing for an AI sys­tem that is con­stantly us­ing tools from nar­row value learn­ing to figure out what to do. In prac­tice, this will re­quire in­ter­ac­tion be­tween the AI and the hu­man. How­ever, there are still is­sues to think about:

Con­ver­gent in­stru­men­tal sub­goals: A sim­ple way of im­ple­ment­ing hu­man-AI in­ter­ac­tion would be to have an es­ti­mate of a re­ward func­tion that is con­tinu­ally up­dated us­ing nar­row value learn­ing. When­ever the AI needs to choose an ac­tion, it uses the cur­rent re­ward es­ti­mate to choose.

With this sort of setup, we still have the prob­lem that we are max­i­miz­ing a re­ward func­tion which leads to con­ver­gent in­stru­men­tal sub­goals. In par­tic­u­lar, the plan “dis­able the nar­row value learn­ing sys­tem” is likely very good ac­cord­ing to the cur­rent es­ti­mate of the re­ward func­tion, be­cause it pre­vents the re­ward from chang­ing caus­ing all fu­ture ac­tions to con­tinue to op­ti­mize the cur­rent re­ward es­ti­mate.

Another way of see­ing that this setup is a bit weird is that it has in­con­sis­tent prefer­ences over time—at any given point in time, it treats the ex­pected change in its re­ward as an ob­sta­cle that should be un­done if pos­si­ble.

That said, it is worth not­ing that in this setup, the goal-di­rect­ed­ness is com­ing from the hu­man. In fact, any ap­proach where goal-di­rect­ed­ness comes from the hu­man re­quires some form of hu­man-AI in­ter­ac­tion. We might hope that some sys­tem of this form al­lows us to have a hu­man-AI sys­tem that is over­all goal-di­rected (in or­der to achieve eco­nomic effi­ciency), while the AI sys­tem it­self is not goal-di­rected, and so the over­all sys­tem pur­sues the hu­man’s in­stru­men­tal sub­goals. The next post will talk about re­ward un­cer­tainty as a po­ten­tial ap­proach to get this be­hav­ior.

Hu­mans are un­able to give feed­back: As our AI sys­tems be­come more and more pow­er­ful, we might worry that they are able to vastly out­think us, such that they would need our feed­back on sce­nar­ios that are too hard for us to com­pre­hend.

On the one hand, if we’re ac­tu­ally in this sce­nario I feel quite op­ti­mistic: if the ques­tions are so difficult that we can’t an­swer them, we’ve prob­a­bly already solved all the sim­ple parts of the re­ward, which means we’ve prob­a­bly stopped x-risk.

But even if it is im­per­a­tive that we an­swer these ques­tions ac­cu­rately, I’m still op­ti­mistic: as our AI sys­tems be­come more pow­er­ful, we can have bet­ter AI-en­abled tools that help us un­der­stand the ques­tions on which we are sup­posed to give feed­back. This could be AI sys­tems that do cog­ni­tive work on our be­half, as in re­cur­sive re­ward mod­el­ing, or it could be AI-cre­ated tech­nolo­gies that make us more ca­pa­ble, such as brain en­hance­ment or the abil­ity to be up­loaded and have big­ger “brains” that can un­der­stand larger things.

Hu­mans don’t know the goal: An im­por­tant dis­anal­ogy be­tween the con­trol the­ory/​self-driv­ing car ex­am­ple and the AI al­ign­ment prob­lem is that in con­trol the­ory it is as­sumed that the gen­eral path to the des­ti­na­tion is known, and we sim­ply need to stay on it; whereas in AI al­ign­ment even the hu­man does not know the goal (i.e. the “true hu­man re­ward”). As a re­sult, we can­not rely on hu­mans to always provide ad­e­quate feed­back; we also need to man­age the pro­cess by which hu­mans learn what they want. Con­cerns about hu­man safety prob­lems and ma­nipu­la­tion fall into this bucket.

Summary

If I want an AI sys­tem that acts au­tonomously over a long pe­riod of time, but it isn’t do­ing am­bi­tious value learn­ing (only nar­row value learn­ing), then we nec­es­sar­ily re­quire a feed­back mechanism that keeps the AI sys­tem “on track” (since my in­stru­men­tal val­ues will change over that pe­riod of time).

While the feed­back mechanism need not be ex­plicit (and could arise sim­ply be­cause it is an effec­tive way to ac­tu­ally help me), we could con­sider AI de­signs that have an ex­plicit feed­back mechanism. There are still many prob­lems with such a de­sign, most no­tably that the ob­vi­ous de­sign has the prob­lem that at any given point the AI sys­tem looks like it could be goal-di­rected with a long-term re­ward func­tion, which is the sort of sys­tem that we are most wor­ried about.

One small bit of such in­ter­ac­tion could be rephras­ing of com­mands.

Hu­man: “I want ap­ple.”

Robot: “Do you want a com­puter or a fruit?”

Another way of in­ter­ac­tion is pre­sent­ing of the plan of ac­tions, may be draw­ing it as a vi­sual image:

Robot: “To give you an ap­ple, I have to go to the shop, which will take at least one hour.”

Hu­man: “No, just find the ap­ple in the re­friger­a­tor.”

The third way is to con­firm that hu­man still want X af­ter rea­son­able amounts of time, say, ev­ery one day:

Robot: “Yes­ter­day you asked for an ap­ple. Do you still want it?”

The forth is send­ing re­ports af­ter ev­ery times­tamp, which de­scribes how the the pro­ject is go­ing and which new sub­goals ap­peared:

Robot: “There is no ap­ples in the shop. I am go­ing to an­other village, but it will take two more hours.”

Hu­man: “No, it is too long, buy me an or­ange.”

The gen­eral in­tu­ition pomp for such in­ter­ac­tions is re­la­tions be­tween a hu­man and ideal hu­man sec­re­tary, and such pomp could be even used to train the robot. Again, this type of learn­ing is pos­si­ble only af­ter the biggest part of AI safety is solved, or the robot will go foom af­ter the first ques­tion.

I agree that all of these seem like good as­pects of hu­man-AI in­ter­ac­tion to have, es­pe­cially for nar­row AI sys­tems. For su­per­hu­man AI sys­tems, there’s a ques­tion of how much of this should the AI in­fer for it­self vs. make sure to ask the hu­man.

There is a prob­lem of “moral un­em­ploy­ment”—that is, if su­per­in­tel­li­gent AI will do all hard work of analysing “what I should want”, it will strip from me the last pleas­ant duty I may have.

E.g: Robot: “I know that the your deep­est de­sire, which you may be not fully aware of, but af­ter a lot of suffer­ing, you will learn it for sure – is to write a novel. And I already wrote this novel for you—the best one which you could pos­si­bly write.”

In the pre­vi­ous “Am­bi­tious vs. nar­row value learn­ing” post, Paul Chris­ti­ano char­ac­ter­ized nar­row value learn­ing as learn­ing “sub­goals and in­stru­men­tal val­ues”. From that post, I got the im­pres­sion that am­bi­tious vs nar­row was about the scope of the task. How­ever, in this post you sug­gest that am­bi­tious vs nar­row value learn­ing is about the amount of feed­back the al­gorithm re­quires. I think there is ac­tu­ally a 2x2 ma­trix of pos­si­ble ap­proaches here: we can imag­ine ap­proaches which do or don’t de­pend on feed­back, and we can imag­ine ap­proaches which try to learn all of my val­ues or just some in­stru­men­tal sub­set.

With this sort of setup, we still have the prob­lem that we are max­i­miz­ing a re­ward func­tion which leads to con­ver­gent in­stru­men­tal sub­goals. In par­tic­u­lar, the plan “dis­able the nar­row value learn­ing sys­tem” is likely very good ac­cord­ing to the cur­rent es­ti­mate of the re­ward func­tion, be­cause it pre­vents the re­ward from chang­ing caus­ing all fu­ture ac­tions to con­tinue to op­ti­mize the cur­rent re­ward es­ti­mate.

I think it de­pends on the de­tails of the im­ple­men­ta­tion:

We could con­struct the sys­tem’s world model so hu­man feed­back is a spe­cial event that ex­ists in a sep­a­rate mages­terium from the phys­i­cal world, and it doesn’t be­lieve any ac­tion taken in the phys­i­cal world could do any­thing to af­fect the type or quan­tity of hu­man feed­back that’s given.

For re­dun­dancy, if the nar­row value learn­ing sys­tem is try­ing to learn how much hu­mans ap­prove of var­i­ous ac­tions, we can tell the sys­tem that the nega­tive score from our dis­ap­proval of tam­per­ing with the value learn­ing sys­tem out­weighs any pos­i­tive score it could achieve through tam­per­ing.

If the re­ward func­tion weights re­wards ac­cord­ing to the cer­tainty of the nar­row value learn­ing sys­tem that they are the cor­rect re­ward, that cre­ates in­cen­tives to keep the nar­row value learn­ing sys­tem op­er­at­ing, so the nar­row value learn­ing sys­tem can ac­quire greater cer­tainty and provide a greater re­ward.

To elab­o­rate a bit on the first two bul­let points: It mat­ters a lot whether the sys­tem thinks our ap­proval is con­tin­gent on the phys­i­cal con­figu­ra­tion of the atoms in our brains. If the sys­tem thinks we will con­tinue to dis­ap­prove of an ac­tion even af­ter it’s re­con­figured our brain’s atoms, that’s what we want.

How­ever, in this post you sug­gest that am­bi­tious vs nar­row value learn­ing is about the amount of feed­back the al­gorithm re­quires.

That wasn’t ex­actly my point. My main point was that if we want an AI sys­tem that acts au­tonomously over a long pe­riod of time (think cen­turies), but it isn’t do­ing am­bi­tious value learn­ing (only nar­row value learn­ing), then we nec­es­sar­ily re­quire a feed­back mechanism that keeps the AI sys­tem “on track” (since my in­stru­men­tal val­ues will change over that pe­riod of time). Will add a sum­mary sen­tence to the post.

I think it de­pends on the de­tails of the implementation

Agreed, I was imag­in­ing the “de­fault” im­ple­men­ta­tion (eg. as in this pa­per).

For re­dun­dancy, if the nar­row value learn­ing sys­tem is try­ing to learn how much hu­mans ap­prove of var­i­ous ac­tions, we can tell the sys­tem that the nega­tive score from our dis­ap­proval of tam­per­ing with the value learn­ing sys­tem out­weighs any pos­i­tive score it could achieve through tam­per­ing.

Some­thing along these lines seems promis­ing, I hadn’t thought of this pos­si­bil­ity be­fore.

If the re­ward func­tion weights re­wards ac­cord­ing to the cer­tainty of the nar­row value learn­ing sys­tem that they are the cor­rect re­ward, that cre­ates in­cen­tives to keep the nar­row value learn­ing sys­tem op­er­at­ing, so the nar­row value learn­ing sys­tem can ac­quire greater cer­tainty and provide a greater re­ward.

Yeah, un­cer­tainty can definitely help get around this prob­lem. (See also the next post, which should hope­fully go up soon.)

When think­ing about how a smarter-than-hu­man AI would treat hu­man in­put to close the con­trol loop, it pays to con­sider the cases where hu­mans are that smart in­tel­li­gence. How do we close the loop when deal­ing with young chil­dren? pri­mates/​dolphins/​mag­pies? dogs/​cats? fish? in­sects? bac­te­ria? In all these cases the ap­par­ent val­ues/​prefer­ences of the “en­vi­ron­ment” are ba­si­cally ad­ver­sar­ial, some­thing that must be taken into ac­count, but definitely not obeyed. In the origi­nal setup a su­per-in­tel­li­gent al­igned AI’s ac­tions would be in­com­pre­hen­si­ble to us, no mat­ter how much it would try to ex­plain them to us (go ex­plain to a baby that eat­ing all the choco­late it wants is not a good idea, or to a cat that their fa­vorite win­dow must re­main closed). Again, in the origi­nal setup it can be as dras­tic as an AI cul­ling the hu­man pop­u­la­tion, to help save us from a worse fate, etc. Sadly, this is not far from the “God works in mys­te­ri­ous ways” ex­cuse one hears as a uni­ver­sal an­swer to the ques­tions of theod­icy.