We kind of open with how agents have to grow and learn and be sta­ble, but talk most of the time about this two agent prob­lem, where there is an ini­tial agent and a suc­ces­sor agent. When think­ing about it as the suc­ces­sion prob­lem, it seems like a bit of a stretch as a fun­da­men­tal part of agency. The first two sec­tions were about how agents have to make de­ci­sions and have mod­els, and choos­ing a suc­ces­sor does not seem like as much of a fun­da­men­tal part of agency. How­ever, when you think it as an agent has to sta­bly con­tinue to op­ti­mize over time, it seems a lot more fun­da­men­tal.

So, I want to em­pha­size that when we say there are mul­ti­ple forms of the prob­lem, like choos­ing suc­ces­sors or learn­ing/​grow­ing over time, the view in which these are differ­ent at all is a du­al­is­tic view. To an em­bed­ded agent, the fu­ture self is not priv­ileged, it is just an­other part of the en­vi­ron­ment, so there is no differ­ence be­tween mak­ing a suc­ces­sor and pre­serv­ing your own goals.

It feels very differ­ent to hu­mans. This is be­cause it is much eas­ier for us to change our­selves over time that it is to make a clone of our­selves and change the clone, but that differ­ence is not fun­da­men­tal.

I think this has been the longest and most in­for­ma­tion­ally dense/​com­pli­cated part of the se­quence so far. It’s a lot to take in, and definitely worth read­ing a cou­ple times. That said, this is a great se­quence, and I look for­ward to the next in­stal­l­ment.

I want to ex­pand a bit on ad­ver­sar­ial Good­hart, which this post de­scribes as when an­other agent ac­tively at­tempts to make the met­ric fail, and the pa­per I wrote with Scott split into sev­eral sub-cat­e­gories, but which I now think of in some­what sim­pler terms. There is noth­ing spe­cial hap­pen­ing in the multi-agent set­ting in terms of met­rics or mod­els, it’s the same three failure modes we see in the sin­gle agent case.

What changes more fun­da­men­tally is that there are now co­or­di­na­tion prob­lems, re­source con­tention, and game-the­o­retic dy­nam­ics that make the prob­lem po­ten­tially much worse in prac­tice. I’m be­gin­ning to think of these multi-agent is­sues as a prob­lem more closely re­lated to the other parts of em­bed­ded agency—need­ing small mod­els of com­plex sys­tems, re­flex­ive con­sis­tency, and need­ing self-mod­els, as well as the is­sues less in­trin­si­cally about em­bed­ded agency, of co­or­di­na­tion prob­lems and game the­o­retic com­pe­ti­tion.

I usu­ally think of the non-wire­head­ing prefer­ence in terms of mul­ti­ple val­ues—hu­mans value both free­dom and plea­sure. We are not will­ing to fully max­i­mize one fully at the ex­pense of the other. Wire­head­ing is always defined by giv­ing up free­dom of ac­tion by max­i­miz­ing “plea­sure” defined in some way that does not in­clude choice.

Could any­one help ex­plain the differ­ence be­tween an un­bi­ased es­ti­ma­tor and a Bayes es­ti­ma­tor? The un­bi­ased es­ti­ma­tor is un­bi­ased on what ex­actly? The already ex­ist­ing data or new data points? And surely the Bayes es­ti­ma­tor is un­bi­ased in some sense as well, but in what sense?

Up­date: If I un­der­stand cor­rectly, the un­bi­ased es­ti­ma­tor would be the es­ti­mate that you should make if you drew a new point with given x-value, while the Bayes es­ti­ma­tor takes into ac­count that the fact that you don’t just have a sin­gle point, but that you also know its rank­ing within a dis­tri­bu­tion (in par­tic­u­lar that out­liers from a dis­tri­bu­tion are likely to have a higher er­ror term)

I think you’ve got a lot of the core idea. But it’s not im­por­tant that we know that the data point has some rank­ing within a dis­tri­bu­tion. Let me try and ex­plain the ideas as I un­der­stand them.

The un­bi­ased es­ti­ma­tor is un­bi­ased in the sense that for any ac­tual value of the thing be­ing es­ti­mated, the ex­pected value of the es­ti­ma­tion across the pos­si­ble data is the true value.

To be con­crete, sup­pose I tell you that I will gen­er­ate a true value, and then add ei­ther +1 or −1 to it with equal prob­a­bil­ity. An un­bi­ased es­ti­ma­tor is just to re­port back the value you get:

E[es­ti­mate(x)] = es­ti­mate(x + 1)/​2 + es­ti­mate(x − 1)/​2

If the es­ti­mate func­tion is iden­tity, we have (x + x +1 −1)/​2 = x. So its un­bi­ased.

Now sup­pose I tell you that I will gen­er­ate the true value by draw­ing from a nor­mal dis­tri­bu­tion with mean 0 and var­i­ance 1, and then I tell you 23,000 as the re­ported value. Via Bayes, you can see that it is more likely that the true value is 22,999 than 23,001. But the un­bi­ased es­ti­ma­tor blithely re­ports 23,000.

So, though the asym­me­try is do­ing some work here (the fur­ther we move above 0, the more likely that +1 rather than −1 is do­ing some of the work), it could still be that 23,000 is the small­est of the val­ues I sam­pled.

“So, though the asym­me­try is do­ing some work here (the fur­ther we move above 0, the more likely that +1 rather than −1 is do­ing some of the work), it could still be that 23,000 is the small­est of the val­ues I sam­pled”—That’s very in­ter­est­ing.

So I looked at the defi­ni­tion on Wikipe­dia and it says: “An es­ti­ma­tor is said to be un­bi­ased if its bias is equal to zero for all val­ues of pa­ram­e­ter θ.”

This greatly clar­ifies the situ­a­tion for me as I had thought that the bias was a global ag­gre­gate, rather than a value calcu­lated for each value of the pa­ram­e­ter be­ing op­ti­mised (say bas­ket­ball abil­ity). Bayesian es­ti­mates are only un­bi­ased in the former, weaker sense. For nor­mal dis­tri­bu­tions, the Bayesian es­ti­mate is happy to un­der­es­ti­mate the ex­treme­ness of val­ues in or­der to nar­row the prob­a­bil­ity dis­tri­bu­tion of pre­dic­tions for less ex­treme val­ues. In other words, it is ac­cept­ing a level of bias in or­der to nar­row the range.

Read­ing this I had an idea about us­ing the re­ward hack­ing ca­pa­bil­ity to self-limit AI’s power. May be it was already dis­cussed some­where?

In this setup, AI’s re­ward func­tion is pro­tected by a task of some known com­plex­ity (e.g. cryp­tog­ra­phy or a need to cre­ate nan­otech­nol­ogy). If the AI in­creases its in­tel­li­gence above a cer­tain level, it will be able to hack its re­ward func­tion and when the AI will stop.

This gives us a chance to cre­ate a fuse against un­con­trol­lably self-im­prov­ing AI: if it be­comes too clever, it will self-ter­mi­nate.

Also, AI may do use­ful work while try­ing to hack its own re­ward func­tion, like cre­at­ing nan­otech or solv­ing cer­tain type of equa­tions or math prob­lems (es­pe­cially in cryp­tog­ra­phy).