The contest

I’m offer­ing $1,000 for good ques­tions to ask of AI Or­a­cles. Good ques­tions are those that are safe and use­ful: that al­lows us to get in­for­ma­tion out of the Or­a­cle with­out in­creas­ing risk.

To en­ter, put your sug­ges­tion in the com­ments be­low. The con­test ends at the end[1] of the 31st of Au­gust, 2019.

Oracles

A peren­nial sug­ges­tion for a safe AI de­sign is the Or­a­cle AI: an AI con­fined to a sand­box of some sort, that in­ter­acts with the world only by an­swer­ing ques­tions.

Two of the safest de­signs seem to be the coun­ter­fac­tual Or­a­cle, and the low band­width Or­a­cle. Th­ese are de­tailed here, here, and here, but in short:

A coun­ter­fac­tual Or­a­cle is one whose ob­jec­tive func­tion (or re­ward, or loss func­tion) is only non-triv­ial in wor­lds where its an­swer is not seen by hu­mans. Hence it has no mo­ti­va­tion to ma­nipu­late hu­mans through its an­swer.

A low band­width Or­a­cle is one that must se­lect its an­swers off a rel­a­tively small list. Though this an­swer is a self-con­firm­ing pre­dic­tion, the nega­tive effects and po­ten­tial for ma­nipu­la­tion is re­stricted be­cause there are only a few pos­si­ble an­swers available.

Note that both of these Or­a­cles are de­signed to be epi­sodic (they are run for sin­gle epi­sodes, get their re­wards by the end of that epi­sode, aren’t asked fur­ther ques­tions be­fore the epi­sode ends, and are only mo­ti­vated to best perform on that one epi­sode), to avoid in­cen­tives to longer term ma­nipu­la­tion.

Get­ting use­ful answers

The coun­ter­fac­tual and low band­width Or­a­cles are safer than un­re­stricted Or­a­cles, but this safety comes at a price. The price is that we can no longer “ask” the Or­a­cle any ques­tion we feel like, and we cer­tainly can’t have long dis­cus­sions to clar­ify terms and so on. For the coun­ter­fac­tual Or­a­cle, the an­swer might not even mean any­thing real to us—it’s about an­other world, that we don’t in­habit.

De­spite this, its pos­si­ble to get a sur­pris­ing amount of good work out of these de­signs. To give one ex­am­ple, sup­pose we want to fund var­i­ous one of a mil­lion pro­jects on AI safety, but are un­sure which one would perform bet­ter. We can’t di­rectly ask ei­ther Or­a­cle, but there are in­di­rect ways of get­ting ad­vice:

We could ask the low band­width Or­a­cle which team A we should fund; we then choose a team B at ran­dom, and re­ward the Or­a­cle if, at the end of a year, we judge A to have performed bet­ter than B.

The coun­ter­fac­tual Or­a­cle can an­swer a similar ques­tion, in­di­rectly. We com­mit that, if we don’t see its an­swer, we will se­lect team A and team B at ran­dom and fund them for year, and com­pare their perfor­mance at the end of the year. We then ask for which team A[2] it ex­pects to most con­sis­tently out­perform any team B.

Both these an­swers get around some of the re­stric­tions by defer­ring to the judge­ment of our fu­ture or coun­ter­fac­tual selves, av­er­aged across many ran­domised uni­verses.

But can we do bet­ter? Can we do more?

Your bet­ter questions

This is the pur­pose of this con­test: for you to pro­pose ways of us­ing ei­ther Or­a­cle de­sign to get the most safe-but-use­ful work.

So I’m offer­ing $1,000 for in­ter­est­ing new ques­tions we can ask of these Or­a­cles. Of this:

$350 for the best ques­tion to ask a coun­ter­fac­tual Or­a­cle.

$350 for the best ques­tion to ask a low band­width Or­a­cle.

$300 to be dis­tributed as I see fit among the non-win­ning en­tries; I’ll be mainly look­ing for in­no­va­tive and in­ter­est­ing ideas that don’t quite work.

Ex­cep­tional re­wards go to those who open up a whole new cat­e­gory of use­ful ques­tions.

Ques­tions and criteria

Put your sug­gested ques­tions in the com­ment be­low. Be­cause of the illu­sion of trans­parency, it is bet­ter to ex­plain more rather than less (within rea­son).

Com­ments that are sub­mis­sions must be on their sep­a­rate com­ment threads, start with “Sub­mis­sion”, and you must spec­ify which Or­a­cle de­sign you are sub­mit­ting for. You may sub­mit as many as you want; I will still delete them if I judge them to be spam. Any­one can com­ment on any sub­mis­sion. I may choose to ask for clar­ifi­ca­tions on your de­sign; you may also choose to edit the sub­mis­sion to add clar­ifi­ca­tions (la­bel these as ed­its).

It may be use­ful for you to in­clude de­tails of the phys­i­cal setup, what the Or­a­cle is try­ing to max­imise/​min­imise/​pre­dict and what the coun­ter­fac­tual be­havi­our of the Or­a­cle users hu­mans are as­sumed to be (in the coun­ter­fac­tual Or­a­cle setup). Ex­pla­na­tions as to how your de­sign is safe or use­ful could be helpful, un­less it’s ob­vi­ous. Some short ex­am­ples can be found here.

EDIT af­ter see­ing some of the an­swers: de­cide on the length of each epi­sode, and how the out­come is calcu­lated. The Or­a­cle is run once an epi­sode only (and other Or­a­cles can’t gen­er­ally be used on the same prob­lem; if you want to run mul­ti­ple Or­a­cles, you have to jus­tify why this would work), and has to get ob­jec­tive/​loss/​re­ward by the end of that epi­sode, which there­fore has to be es­ti­mated in some way at that point.

A note on time­zones: as long as it’s still the 31 of Au­gust, any­where in the world, your sub­mis­sion will be counted. ↩︎

Th­ese kind of con­di­tional ques­tions can be an­swered by a coun­ter­fac­tual Or­a­cle, see the pa­per here for more de­tails. ↩︎

Some as­sorted thoughts that might be use­ful for think­ing about ques­tions and an­swers:

a ques­tion is a schema with a blank to be filled in by the an­swerer af­ter eval­u­a­tion of the mean­ing of the ques­tion.

shared con­text is in­ferred as most ques­tions are un­der­speci­fied (do­main of ques­tion, range of an­swers)

a few types of ques­tions:

nar­row down the field within which I have to search ei­ther by spec­i­fy­ing a point or spec­i­fy­ing a par­ti­tion of the search space

ques­tion about speci­fic­i­ties of var­i­ants: who where when

ques­tion about the in­var­i­ants of a sys­tem: what, how

ques­tion about the back­wards fac­ing causation

ques­tion about the for­ward fac­ing causation

meta ques­tions about ques­tion schemas

what do we want a mys­te­ri­ously pow­er­ful an­swerer to do?

zoom in on op­ti­mal points in in­tractably large search spaces

eg spe­cific ex­per­i­ments to run to most eas­ily in­val­i­date ma­jor sci­en­tific questions

spec­ify search spaces we don’t know how to parameterize

eg hu­man values

back chain from types of an­swers to in­fer tax­on­omy of questions

an ex­pla­na­tion rel­a­tive to a pre­dic­tion:

a pre­dic­tion re­turns the fu­ture state of the system

an ex­pla­na­tion re­turns a more com­pact than pre­vi­ously held causal ex­palan­tion of the sys­tem, though it might still not gen­er­ate suffi­ciently high re­s­olu­tion predictions

How to de­tect on­tol­ogy er­rors us­ing ques­tions?

is this ques­tion malformed? if so, what are some al­ter­na­tive ways of fram­ing the ques­tion that could re­turn in­ter­est­ing answers

ie the im­plied search space of the ques­tion was not cor­rect and can ei­ther be ex­tended or trans­formed in this way

types of ques­tions are types of search spaces

ques­tions that change weight­ings on fac­tors vs de­scribe new factors

rec­og­niz­i­ton of the many con­nec­tion types in the hu­man se­man­tic network

re­cur­sive ques­tions move along one di­men­sion as they re­strict the space. ie where is that at differ­ent spa­tial resolutions

qual­i­ta­tive and quan­ti­ta­tive di­men­sions along which a query can be moved ex­changes in­for­ma­tion about the im­plied search space

un­der vs over­speci­fied questions

closed and open search spaces

nav­i­ga­tional ques­tions only work un­der the one off as­sump­tion if the search is stateless

ter­mi­na­tion guarantees

com­plete­ness guarantees

op­ti­mal­ity guarantees

meta ques­tions about the meth­ods the sys­tem uses to nav­i­gate in­tractable spaces

time vs space vs....?

gen­er­at­ing can­di­dates is easy, check­ing is hard and vice versa

un­known meta­data for answers

bias var­i­ance trade off

failure mode map­ping, do failures im­ply di­rec­tion­al­ity?

eg does a failure of a can­di­date change which can­di­date you go to next (state­ful)

how to think about at­tack sur­faces for ques­tion an­swer systems

can this get a hu­man to run ar­bi­trary code by coun­ter­fac­tu­ally co­op­er­at­ing with it­self on what step of the pro­cess it is on? Can this be tested by go­ing through the whole pro­cess with hu­man A then scram­bling the steps and run­ning through the same thing with hu­man B and see­ing if an­swers di­verge?

Sub­mis­sion. For the coun­ter­fac­tual Or­a­cle, ask the Or­a­cle to pre­dict the n best posts on AF dur­ing some fu­ture time pe­riod (coun­ter­fac­tu­ally if we didn’t see the Or­a­cle’s an­swer). In that case, re­ward func­tion is com­puted as similar­ity be­tween the pre­dicted posts and the ac­tual top posts on AF as ranked by karma, with similar­ity com­puted us­ing some ML model.

This seems to po­ten­tially sig­nifi­cantly ac­cel­er­ate AI safety re­search while be­ing safe since it’s just show­ing us posts similar to what we would have writ­ten our­selves. If the ML model for mea­sur­ing similar­ity isn’t se­cure, the Or­a­cle might pro­duce out­put that at­tack the ML model, in which case we might need to fall back to some sim­pler way to mea­sure similar­ity.

Stu­art, does it count against my en­try that it’s not ac­tu­ally a very novel idea? (If so, I might want to think about other ideas to sub­mit.)

What is the ex­act re­la­tion­ship be­tween all these ideas? What are the pros and cons of do­ing hu­man imi­ta­tion us­ing this kind of coun­ter­fac­tual/​on­line-learn­ing setup, ver­sus other train­ing meth­ods such as GAN (see Safe train­ing pro­ce­dures for hu­man-imi­ta­tors for one pro­posal)? It seems like there are lots of posts and com­ments about hu­man imi­ta­tions spread over LW, Ar­bital, Paul’s blog and maybe other places, and it would be re­ally cool if some­one (with more knowl­edge in this area than I do) could write a re­view/​dis­til­la­tion post sum­ma­riz­ing what we know about it so far.

If that seems a re­al­is­tic con­cern dur­ing the time pe­riod that the Or­a­cle is be­ing asked to pre­dict, you could re­place the AF with a more se­cure fo­rum, such as a pri­vate fo­rum in­ter­nal to some AI safety re­search team.

(I’m still con­fused and think­ing about this, but figure I might as well write this down be­fore some­one else does. :)

While think­ing more about my sub­mis­sion and coun­ter­fac­tual Or­a­cles in gen­eral, this class of ideas for us­ing CO is start­ing to look like try­ing to im­ple­ment su­per­vised learn­ing on top of RL ca­pa­bil­ities, be­cause SL seems safer (less prone to ma­nipu­la­tion) than RL. Would it ever make sense to do this in re­al­ity (in­stead of just do­ing SL di­rectly)?

This seems in­cred­ibly dan­ger­ous if the Or­a­cle has any ul­te­rior mo­tives what­so­ever. Even – nay, es­pe­cially – the ul­te­rior mo­tive of fu­ture Or­a­cles be­ing bet­ter able to af­fect re­al­ity to bet­ter re­sem­ble their pro­vided an­swers.

So, how can we pre­vent this? Is it pos­si­ble to pro­duce an AI with its util­ity func­tion as its sole goal, to the detri­ment of other things that might… in­crease util­ity, but in­di­rectly? (Is there a way to add a “sta­tus quo” bonus that won’t hideously back­fire, or some­thing?)

Plan Crit­i­cism: Given plan to build an al­igned AI, put to­gether a list of pos­si­ble lines of thought to think about prob­lems with the plan (open ques­tions, pos­si­ble failure modes, crit­i­cisms, etc.). Ask the or­a­cle to pick one of these lines of thought, pick an­other line of thought at ran­dom, and spend the next time pe­riod X think­ing about both, judge which line of thought was more use­ful to think about (where lines of thought that spot some fatal missed prob­lem are judged to be very use­ful) and re­ward the or­a­cle if its sug­ges­tion was picked.

Sub­mis­sion. “Su­per­in­tel­li­gent Agents.” For the Coun­ter­fac­tual Or­a­cle, ask the Or­a­cle to pre­dict what ac­tion(s) a com­mit­tee of hu­mans would recom­mend do­ing next (which may in­clude sub­mit­ting more queries to the Or­a­cle), then perform that ac­tion(s).

The com­mit­tee, by ap­pro­pri­ate choice of recom­men­da­tions, can im­ple­ment var­i­ous kinds of su­per­in­tel­li­gent agents. For ex­am­ple, by recom­mend­ing the query “What would hap­pen if the next ac­tion is X?” (in the event of era­sure, ac­tu­ally do X and record or have the com­mit­tee write up a de­scrip­tion of the con­se­quences as train­ing data) (ETA: It may be bet­ter to have the com­mit­tee as­sign a nu­mer­i­cal score, i.e., util­ity, to the con­se­quences in­stead.) a num­ber of times for differ­ent X, fol­lowed by the query “What would the com­mit­tee recom­mend do­ing next, if it knew that the pre­dicted con­se­quences for the can­di­date ac­tions are as fol­lows: …” (in the event of era­sure, let phys­i­cal com­mit­tee mem­bers read the out­put of the rele­vant pre­vi­ous queries and then de­cide what to do), it would in effect im­ple­ment a kind of quan­tilizer. If IDA can be im­ple­mented us­ing Coun­ter­fac­tual Or­a­cles (as evhub sug­gested), then the com­mit­tee can choose to do that as well.

Sub­mis­sion for a coun­ter­fac­tual or­a­cle: pre­com­mit that, if the or­a­cle stays silent, a week from now you’ll try to write the most use­ful mes­sage to your past self, based on what hap­pens in the world dur­ing that week. Ask the or­a­cle to pre­dict that mes­sage. This is similar to ex­ist­ing solu­tions, but slightly more meta, be­cause the con­tent of the mes­sage is up to your fu­ture self—it could be lot­tery num­bers, sci­ence pa­pers, dis­aster lo­ca­tions, or any­thing else that fits within the or­a­cle’s size limit. (If there’s no size limit, just send the whole in­ter­net.)

You could also form a bucket brigade to re­lay mes­sages from fur­ther ahead, but that’s a bad idea. If the or­a­cle’s con­tinued silence even­tu­ally leads to an un­friendly AI, it can ma­nipu­late the past by hi­jack­ing your chain of mes­sages and thus make it­self much more likely. The same is true for all high-band­width coun­ter­fac­tual or­a­cles—they aren’t un­friendly in them­selves, but us­ing them cre­ates a thicket of “retro­causal” links that can be ex­ploited by any po­ten­tial fu­ture UFAI. The more UFAI risk grows, the less you should use or­a­cles.

I feel like this is about equally meta as my “Su­per­in­tel­li­gent Agent” sub­mis­sion, since my com­mit­tee could out­put “Show the fol­low­ing mes­sage to the op­er­a­tor: …” and your mes­sage could say “I sug­gest that you perform the fol­low­ing ac­tion: …”, so the only differ­ence be­tween your idea and mine is that in my sub­mis­sion the out­put of the Or­a­cle is di­rectly cou­pled to some effec­tors to let the agent act faster, and yours has a (real) hu­man in the loop.

The more UFAI risk grows, the less you should use or­a­cles.

Hmm, good point. I guess Chris Leong made a similar point, but it didn’t sink in un­til now how gen­eral the con­cern is. This seems to af­fect Paul’s coun­ter­fac­tual over­sight idea as well, and maybe other kinds of hu­man imi­ta­tions and pre­dic­tors/​or­a­cles, as well as things that are built us­ing these com­po­nents like quan­tiliz­ers and IDA.

Think­ing about this some more, all high-band­width or­a­cles (coun­ter­fac­tual or not) risk re­ceiv­ing mes­sages crafted by fu­ture UFAI to take over the pre­sent. If the ranges of or­a­cles over­lap in time, such mes­sages can colonize their way back­wards from decades ahead. It’s es­pe­cially bad if hu­man­ity’s FAI pro­ject de­pends on or­a­cles—that in­creases the chance of UFAI in the world where or­a­cles are silent, which is where the pre­dic­tions come from.

One pos­si­ble pre­cau­tion is to use only short-range or­a­cles, and never use an or­a­cle while still in pre­dic­tion range of any other or­a­cle. But that has draw­backs: 1) it re­quires wor­ld­wide co­or­di­na­tion, 2) it only pro­tects the past. The safety of the pre­sent de­pends on whether you’ll fol­low the pre­cau­tion in the fu­ture. And peo­ple will be tempted to bend it, use longer or over­lap­ping ranges to get more power.

In short, if hu­man­ity starts us­ing high-band­width or­a­cles, that will likely in­crease the chance of UFAI and has­ten it. So such or­a­cles are dan­ger­ous and shouldn’t be used. Sorry, Stu­art :-)

Think­ing about this some more, all high-band­width or­a­cles (coun­ter­fac­tual or not) risk re­ceiv­ing mes­sages crafted by fu­ture UFAI to take over the pre­sent.

Note that in the case of coun­ter­fac­tual or­a­cle, this de­pends on UFAI “cor­rectly” solv­ing coun­ter­fac­tual mug­ging (i.e., the UFAI has to de­cide to pay some cost in its own world to take over a coun­ter­fac­tual world where the era­sure event didn’t oc­cur).

So such or­a­cles are dan­ger­ous and shouldn’t be used.

This seems too cat­e­gor­i­cal. Depend­ing on the prob­a­bil­ities of var­i­ous con­di­tions, us­ing such or­a­cles might still be the best op­tion in some cir­cum­stances.

Sure, in case of era­sure you can de­cide to use or­a­cles less, and com­pen­sate your clients with money you got from “era­sure in­surance” (since that’s a low prob­a­bil­ity event). But that doesn’t seem to solve the prob­lem I’m talk­ing about—UFAI aris­ing nat­u­rally in era­sure-wor­lds and spread­ing to non-era­sure-wor­lds through or­a­cles.

The prob­lem you were talk­ing about seemed to rely on bucket brigades. I agree that UFAIs jump­ing back a sin­gle step is a fair con­cern. (Though I guess you could coun­ter­fac­tu­ally have enough power to halt AGI re­search com­pletely...) I’m try­ing to ad­dress it el­sethread. :)

Ah, sorry, you’re right. To pre­vent bucket brigades, it’s enough to stop us­ing or­a­cles for N days when­ever an N-day or­a­cle has an era­sure event, and the money from “era­sure in­surance” can help with that. When there are no era­sure events, we can use or­a­cles as of­ten as we want. That’s a big im­prove­ment, thanks!

Yeah. And low-band­width or­a­cles can have a milder ver­sion of the same prob­lem. Con­sider your “con­se­quen­tial­ist” idea: if UFAI is about to arise, and one of the offered courses of ac­tion leads to UFAI get­ting stopped, then the or­a­cle will recom­mend against that course of ac­tion (and for some other course where UFAI wins and maxes out the or­a­cle’s re­ward).

My sub­mis­sion: use a coun­ter­fac­tual or­a­cle to do IDA. To do this, judge the or­a­cle’s an­swer as fol­lows: in the event of an era­sure, ask a hu­man to an­swer the ques­tion with the abil­ity to con­sult the or­a­cle on other sub­ques­tions (with era­sures dur­ing those calls han­dled ex­actly the same way) and then re­ward the or­a­cle by us­ing an au­to­mated sys­tem (e.g. a lan­guage model) to com­pare the hu­man’s an­swer to that of the or­a­cle.

More pre­cisely, let

M:Q→A be the coun­ter­fac­tual or­a­cle,

H(Q|M) be the hu­man’s an­swer to ques­tion Q∈Q when given the abil­ity to call M on any ques­tion other than Q, and

d:A×A→R be some dis­tance met­ric on an­swers in nat­u­ral lan­guage (it’s not that hard to make some­thing like this, even with cur­rent ML tools).

Then, re­ward M as per usual for a coun­ter­fac­tual or­a­cle, only giv­ing it a re­ward in the event of an era­sure, in which case let
LM=d(H(Q|M),M(Q))
where M(Q) is hid­den from H and judged only by d as in the stan­dard coun­ter­fac­tual or­a­cle setup.

(Of course, this doesn’t ac­tu­ally work be­cause it has no guaran­tees wrt to in­ner al­ign­ment, but I think it has a pretty good shot of be­ing outer al­igned.)

Is it safe to ask the Or­a­cle a sub­ques­tion in the event of era­sure? Aren’t you risk­ing hav­ing the Or­a­cle pro­duce an an­swer that is (in part) op­ti­mized to make it eas­ier to pre­dict the an­swer to the main ques­tion, in­stead of just the best pre­dic­tion of how the hu­man would an­swer that sub­ques­tion? (Sorry if this has already been ad­dressed dur­ing a pre­vi­ous dis­cus­sion of coun­ter­fac­tual or­a­cles, be­cause I haven’t been fol­low­ing it closely.)

I’m not sure I un­der­stand the con­cern. Isn’t the or­a­cle an­swer­ing each ques­tion to max­i­mize its pay­off on that ques­tion in event of an era­sure? So it doesn’t mat­ter if you ask it other ques­tions dur­ing the eval­u­a­tion pe­riod. (If you like, you can say that you are ask­ing them to other or­a­cles—or is there some way that an or­a­cle is a dis­t­in­guished part of the en­vi­ron­ment?)

If the or­a­cle cares about its own perfor­mance in a broader sense, rather than just perfor­mance on the cur­rent ques­tion, then don’t we have a prob­lem any­way? E.g. if you ask it ques­tion 1, it will be in­cen­tivized to make it get an eas­ier ques­tion 2? For ex­am­ple, if you are con­cerned about co­or­di­na­tion amongst differ­ent in­stances of the or­a­cle, this seems like it’s a prob­lem re­gard­less.

I guess you can con­struct a model where the or­a­cle does what you want, but only if you don’t ask any other or­a­cles ques­tions dur­ing the eval­u­a­tion pe­riod, but it’s not clear to me how you would end up in that situ­a­tion and at that point it seems worth try­ing to flesh out a more pre­cise model.

Yeah, I’m not sure I un­der­stand the con­cern ei­ther, hence the ten­ta­tive way in which I stated it. :) I think your ob­jec­tion to my con­cern is a rea­son­able one and I’ve been think­ing about it my­self. One thing I’ve come up with is that with the nested queries, the higher level Or­a­cles could use simu­la­tion war­fare to make the lower level Or­a­cles an­swer the way that they “want”, whereas the same thing doesn’t seem to be true in the se­quen­tial case (if we make it so that in both cases each Or­a­cle cares about just perfor­mance on the cur­rent ques­tion).

I mean, if the or­a­cle hasn’t yet looked at the ques­tion they could use simu­la­tion war­fare to cause the pre­ced­ing or­a­cles to take ac­tions that lead to them get­ting given eas­ier ques­tions. Once you start un­bar­ring all holds, stuff gets wild.

Yes, but if we can make it so that each Or­a­cle looks at the ques­tion they get and only cares about do­ing well on that ques­tion, that seems to re­move the simu­la­tion war­fare con­cern in the se­quen­tial case but not in the nested case.

Also, aside from simu­la­tion war­fare, an­other way that the nested case can be ma­nipu­la­tive and the se­quen­tial case not is if each Or­a­cle cares about do­ing well on a fixed dis­tri­bu­tion of in­puts (as op­posed to do­ing well “on the cur­rent ques­tion” or “in the real world” or “on the ac­tual ques­tions that it gets”). That’s be­cause in the se­quen­tial case ma­nipu­la­tion can only change the dis­tri­bu­tion of in­puts that the Or­a­cles re­ceive, but it doesn’t im­prove perfor­mance on any par­tic­u­lar given in­put. In the nested case, perfor­mance on given in­puts do in­crease.

Why is that? Doesn’t my be­hav­ior on ques­tion #1 af­fect both ques­tion #2 and its an­swer?

I was as­sum­ing each “ques­tion” ac­tu­ally in­cludes as much rele­vant his­tory as we can gather about the world, to make the Or­a­cle’s job eas­ier, and in par­tic­u­lar it would in­clude all pre­vi­ous Or­a­cle ques­tions/​an­swers, in which case if Or­a­cle #1 does X to make ques­tion #2 eas­ier, it was already that easy be­cause the only world in which ques­tion #2 gets asked is one in which Or­a­cle #1 did X. But now I re­al­ize that’s not ac­tu­ally a safe as­sump­tion be­cause Or­a­cle #1 could break out of its box and feed Or­a­cle #2 a false his­tory that doesn’t in­clude X.

My point about “if we can make it so that each Or­a­cle looks at the ques­tion they get and only cares about do­ing well on that ques­tion, that seems to re­move the simu­la­tion war­fare con­cern in the se­quen­tial case but not in the nested case” still stands though, right?

Also, this feels like a doomed game to me—I think we should be try­ing to rea­son from se­lec­tion rather than rely­ing on more spec­u­la­tive claims about in­cen­tives.

You may well be right about this, but I’m not sure what rea­son from se­lec­tion means. Can you give an ex­am­ple or say what it im­plies about nested vs se­quen­tial queries?

You may well be right about this, but I’m not sure what rea­son from se­lec­tion means. Can you give an ex­am­ple or say what it im­plies about nested vs se­quen­tial queries?

What I want: “There is a model in the class that has prop­erty P. Train­ing will find a model with prop­erty P.”

What I don’t want: “The best way to get a high re­ward is to have prop­erty P. There­fore a model that is try­ing to get a high re­ward will have prop­erty P.”

Ex­am­ple of what I don’t want: “Ma­nipu­la­tive ac­tions don’t help get a high re­ward (at least for the epi­sodic re­ward func­tion we in­tended), so the model won’t pro­duce ma­nipu­la­tive ac­tions.”

So this is an ar­gu­ment against the setup of the con­test, right? Be­cause the OP seems to be ask­ing us to rea­son from in­cen­tives, and pre­sum­ably will re­ward en­tries that do well un­der such anal­y­sis:

Note that both of these Or­a­cles are de­signed to be epi­sodic (they are run for sin­gle epi­sodes, get their re­wards by the end of that epi­sode, aren’t asked fur­ther ques­tions be­fore the epi­sode ends, and are only mo­ti­vated to best perform on that one epi­sode), to avoid in­cen­tives to longer term ma­nipu­la­tion.

On a more ob­ject level, for rea­son­ing from se­lec­tion, what model class and train­ing method would you sug­gest that we as­sume?

ETA: Is an in­stance of the idea to see if we can im­ple­ment some­thing like coun­ter­fac­tual or­a­cles us­ing your Opt? I ac­tu­ally did give that some thought and noth­ing ob­vi­ous im­me­di­ately jumped out at me. Do you think that’s a use­ful di­rec­tion to think?

So this is an ar­gu­ment against the setup of the con­test, right? Be­cause the OP seems to be ask­ing us to rea­son from in­cen­tives, and pre­sum­ably will re­ward en­tries that do well un­der such anal­y­sis:

This is an ob­jec­tion to rea­son­ing from in­cen­tives, but it’s stronger in the case of some kinds of rea­son­ing from in­cen­tives (e.g. where in­cen­tives come apart from “what kind of policy would be se­lected un­der a plau­si­ble ob­jec­tive”). It’s hard for me to see how nested vs. se­quen­tial re­ally mat­ters here.

On a more ob­ject level, for rea­son­ing from se­lec­tion, what model class and train­ing method would you sug­gest that we as­sume?

(I don’t think model class is go­ing to mat­ter much.)

I think train­ing method should get pinned down more. My de­fault would just be the usual thing peo­ple do: pick the model that has best pre­dic­tive ac­cu­racy over the data so far, con­sid­er­ing only data where there was an era­sure.

(Though I don’t think you re­ally need to fo­cus on era­sures, I think you can just con­sider all the data, since each pos­si­ble pa­ram­e­ter set­ting is be­ing eval­u­ated on what other pa­ram­e­ter set­tings say any­way. I think this was dis­cussed in one of Stu­art’s posts about “for­ward-look­ing” vs. “back­wards-look­ing” or­a­cles?)

I think it’s also in­ter­est­ing to imag­ine in­ter­nal RL (e.g. there are in­ter­nal ran­dom­ized cog­ni­tive ac­tions, and we use REINFORCE to get gra­di­ent es­ti­mates—i.e. you try to in­crease the prob­a­bil­ity of cog­ni­tive ac­tions taken in rounds where you got a lower loss than pre­dicted, and de­crease the prob­a­bil­ity of ac­tions taken in rounds where you got a higher loss), which might make the set­ting a bit more like the one Stu­art is imag­in­ing.

ETA: Is an in­stance of the idea to see if we can im­ple­ment some­thing like coun­ter­fac­tual or­a­cles us­ing your Opt? I ac­tu­ally did give that some thought and noth­ing ob­vi­ous im­me­di­ately jumped out at me. Do you think that’s a use­ful di­rec­tion to think?

Seems like the coun­ter­fac­tu­ally is­sue doesn’t come up in the Opt case, since you aren’t train­ing the al­gorithm in­cre­men­tally—you’d just col­lect a rele­vant dataset be­fore you started train­ing. I think the Opt set­ting throws away too much for an­a­lyz­ing this kind of situ­a­tion, and would want to do an on­line learn­ing ver­sion of OPT (e.g. you provide in­puts and losses one at a time, and it gives you the an­swer of the mix­ture of mod­els that would do best so far).

I think train­ing method should get pinned down more. My de­fault would just be the usual thing peo­ple do: pick the model that has best pre­dic­tive ac­cu­racy over the data so far, con­sid­er­ing only data where there was an era­sure.

This seems to ig­nore reg­u­lariz­ers that peo­ple use to try to pre­vent overfit­ting and to make their mod­els gen­er­al­ize bet­ter. Isn’t that li­able to give you bad in­tu­itions ver­sus the ac­tual train­ing meth­ods peo­ple use and es­pe­cially the more ad­vanced meth­ods of gen­er­al­iza­tion that peo­ple will pre­sum­ably use in the fu­ture?

(Though I don’t think you re­ally need to fo­cus on era­sures, I think you can just con­sider all the data, since each pos­si­ble pa­ram­e­ter set­ting is be­ing eval­u­ated on what other pa­ram­e­ter set­tings say any­way. I think this was dis­cussed in one of Stu­art’s posts about “for­ward-look­ing” vs. “back­wards-look­ing” or­a­cles?)

I don’t un­der­stand what you mean in this para­graph (es­pe­cially “since each pos­si­ble pa­ram­e­ter set­ting is be­ing eval­u­ated on what other pa­ram­e­ter set­tings say any­way”), even af­ter read­ing Stu­art’s post, plus Stu­art has changed his mind and no longer en­dorses the con­clu­sions in that post. I won­der if you could write a ful­ler ex­pla­na­tion of your views here, and maybe in­clude your re­sponse to Stu­art’s rea­sons for chang­ing his mind? (Or talk to him again and get him to write the post for you. :)

would want to do an on­line learn­ing ver­sion of OPT (e.g. you provide in­puts and losses one at a time, and it gives you the an­swer of the mix­ture of mod­els that would do best so far).

Couldn’t you simu­late that with Opt by just run­ning it re­peat­edly?

This seems to ig­nore reg­u­lariz­ers that peo­ple use to try to pre­vent overfit­ting and to make their mod­els gen­er­al­ize bet­ter. Isn’t that li­able to give you bad in­tu­itions ver­sus the ac­tual train­ing meth­ods peo­ple use and es­pe­cially the more ad­vanced meth­ods of gen­er­al­iza­tion that peo­ple will pre­sum­ably use in the fu­ture?

“The best model” is usu­ally reg­u­larized. I don’t think this re­ally changes the pic­ture com­pared to imag­in­ing op­ti­miz­ing over some smaller space (e.g. space of mod­els with reg­u­larize<x). In par­tic­u­lar, I don’t think my in­tu­itions are sen­si­tive to the differ­ence.

I don’t un­der­stand what you mean in this para­graph (es­pe­cially “since each pos­si­ble pa­ram­e­ter set­ting is be­ing eval­u­ated on what other pa­ram­e­ter set­tings say any­way”)

The nor­mal pro­ce­dure is: I gather data, and am us­ing the model (and other ML mod­els) while I’m gath­er­ing data. I search over pa­ram­e­ters to find the ones that would make the best pre­dic­tions on that data.

I’m not find­ing pa­ram­e­ters that re­sult in good pre­dic­tive ac­cu­racy when used in the world. I’m gen­er­at­ing some data, and then find­ing the pa­ram­e­ters that make the best pre­dic­tions about that data. That data was col­lected in a world where there are plenty of ML sys­tems (in­clud­ing po­ten­tially a ver­sion of my or­a­cle with differ­ent pa­ram­e­ters).

Yes, the nor­mal pro­ce­dure con­verges to a fixed point. But why do we care /​ why is that bad?

I won­der if you could write a ful­ler ex­pla­na­tion of your views here, and maybe in­clude your re­sponse to Stu­art’s rea­sons for chang­ing his mind? (Or talk to him again and get him to write the post for you. :)

I take a per­spec­tive where I want to use ML tech­niques (or other AI al­gorithms) to do use­ful work, with­out in­tro­duc­ing pow­er­ful op­ti­miza­tion work­ing at cross-pur­poses to hu­mans. On that per­spec­tive I don’t think any of this is a prob­lem (or if you look at it an­other way, it wouldn’t be a prob­lem if you had a solu­tion that had any chance at all of work­ing).

I don’t think Stu­art is think­ing about it in this way, so it’s hard to en­gage at the ob­ject level, and I don’t re­ally know what the al­ter­na­tive per­spec­tive is, so I also don’t know how to en­gage at the meta level.

Is there a par­tic­u­lar claim where you think there is an in­ter­est­ing dis­agree­ment?

Couldn’t you simu­late that with Opt by just run­ning it re­peat­edly?

If I care about com­pet­i­tive­ness, re­run­ning OPT for ev­ery new dat­a­point is pretty bad. (I don’t think this is very im­por­tant in the cur­rent con­text, noth­ing de­pends on com­pet­i­tive­ness.)

If the or­a­cle cares about its own perfor­mance in a broader sense, rather than just perfor­mance on the cur­rent ques­tion, then don’t we have a prob­lem any­way? E.g. if you ask it ques­tion 1, it will be in­cen­tivized to make it get an eas­ier ques­tion 2? For ex­am­ple, if you are con­cerned about co­or­di­na­tion amongst differ­ent in­stances of the or­a­cle, this seems like it’s a prob­lem re­gard­less.

Yeah, that’s a good point. In my most re­cent re­sponse to Wei Dai I was try­ing to de­velop a loss which would pre­vent that sort of co­or­di­na­tion, but it does seem like if that’s hap­pen­ing then it’s a prob­lem in any coun­ter­fac­tual or­a­cle setup, not just this one. Though it is thus still a prob­lem you’d have to solve if you ever ac­tu­ally wanted to im­ple­ment a coun­ter­fac­tual or­a­cle.

First, if you’re will­ing to make the (very) strong as­sump­tion that you can di­rectly spec­ify what ob­jec­tive you want your model to op­ti­mize for with­out re­quiring a bunch of train­ing data for that ob­jec­tive, then you can only provide a re­ward in the situ­a­tion where all sub­ques­tions also have era­sures. In this situ­a­tion, you’re guarded against any pos­si­ble ma­nipu­la­tion in­cen­tive like that, but it also means your or­a­cle will very rarely ac­tu­ally be given a re­ward in prac­tice, which means if you’re rely­ing on get­ting enough train­ing data to pro­duce an agent which will op­ti­mize for this ob­jec­tive, you’re screwed. I would ar­gue, how­ever, that if you ex­pect to train an agent to be­have as a coun­ter­fac­tual or­a­cle in the first place, you’re already screwed, be­cause most mesa-op­ti­miz­ers will care about things other than just the coun­ter­fac­tual case. Thus, the only situ­a­tion in which this whole thing works in the first place is the situ­a­tion where you’re already will­ing to make this (very strong) as­sump­tion, so it’s fine.

Se­cond, I don’t think you’re en­tirely screwed even if you need train­ing data, since you can do some re­lax­ations that at­tempt to ap­prox­i­mate the situ­a­tion where you only provide re­wards in the event of a com­plete era­sure. For ex­am­ple, you could in­crease the prob­a­bil­ity of an era­sure with each sub­ques­tion, or scale the re­ward ex­po­nen­tially with the depth at which the era­sure oc­curs, so that the ma­jor­ity of the ex­pected re­ward is always con­cen­trated in the world where there is a com­plete era­sure.

First, if you’re will­ing to make the (very) strong as­sump­tion that you can di­rectly spec­ify what ob­jec­tive you want your model to op­ti­mize for with­out re­quiring a bunch of train­ing data for that ob­jec­tive, then you can only provide a re­ward in the situ­a­tion where all sub­ques­tions also have era­sures.

But if all sub­ques­tions have era­sures, hu­mans would have to man­u­ally ex­e­cute the whole query tree, which is ex­po­nen­tially large so you’ll run out of re­sources (in the coun­ter­fac­tual world) if you tried to do that, so the Or­a­cle won’t be able to give you a use­ful pre­dic­tion. Wouldn’t it make more sense to have the Or­a­cle make a pre­dic­tion about a coun­ter­fac­tual world where some hu­mans just think nor­mally for a while and write down their thoughts (similar to my “pre­dict the best AF posts” idea)? I don’t see what value the IDA idea is adding here.

Se­cond, I don’t think you’re en­tirely screwed even if you need train­ing data, since you can do some re­lax­ations that at­tempt to ap­prox­i­mate the situ­a­tion where you only provide re­wards in the event of a com­plete era­sure.

Given the above, “only provide re­wards in the event of a com­plete era­sure” doesn’t seem to make sense as a tar­get to ap­prox­i­mate. Do you think your ideas in this para­graph still have value in light of that?

Okay, here’s an­other thought: if you can get the coun­ter­fac­tual or­a­cle to only care about the re­ward for that in­vo­ca­tion and not for any sub-in­vo­ca­tions, then you can just go back to us­ing the origi­nal al­gorithm. This is a bit slip­pery, be­cause re­ally whether you get a mesa-op­ti­mizer that cares only about a sin­gle step’s worth of re­ward, all the re­ward it will ever get, or some­thing else some­where in-be­tween, is mostly out of your con­trol. But if you can guaran­tee that you get a mesa-op­ti­mizer that is only op­ti­miz­ing for one step of re­ward, you might be able to elimi­nate the ma­nipu­la­tion in­cen­tive. The biggest is­sue in this case, though, is that the differ­ent in­vo­ca­tions might still be able to acausally co­op­er­ate with each other to raise all of their re­wards to­gether by ma­nipu­lat­ing you into be­ing eas­ier to pre­dict.

Maybe we can turn this into a zero-sum game, though? Here’s a pro­posal: let M′ be a copy of M and Qtree be the set of all ques­tions in the cur­rent tree that also get era­sures. Then, let
LM(Q)=d(H(Q|M′),M(Q))−1|Qtree|−1∑Q′∈Qtree−{Q}d(H(Q′|M′),M′(Q′))
such that the net sum of losses across each tree for all the nodes for which losses are com­puted is zero. It’s still some­what un­clear whether this ac­tu­ally works, though, since you only in­clude the losses for those nodes that get era­sures.

It is also worth not­ing, how­ever, that even if this works it is a very ar­tifi­cial fix, since the term you’re sub­tract­ing is a con­stant with no de­pen­dence on M(Q), so if you’re try­ing to do gra­di­ent de­scent to op­ti­mize this loss, it won’t change any­thing at all (which sort of goes to show how gra­di­ent de­scent doesn’t dis­t­in­guish be­tween mesa-op­ti­miz­ers with sin­gle-epi­sode and cross-epi­sode ob­jec­tives). Thus, we’re still back at the prob­lem of none of this work­ing un­less you’re will­ing to make strong claims about your abil­ity to di­rectly pro­duce a mesa-op­ti­mizer which is ac­tu­ally op­ti­miz­ing for the speci­fied loss func­tion.

which sort of goes to show how gra­di­ent de­scent doesn’t dis­t­in­guish be­tween mesa-op­ti­miz­ers with sin­gle-epi­sode and cross-epi­sode objectives

Sorry I haven’t fol­lowed the math here, but this seems like an im­por­tant ques­tion to in­ves­ti­gate in­de­pen­dently of ev­ery­thing else in this thread. Maybe con­sider writ­ing a post on it?

In the case of “ac­tual” IDA, I guess the plan is for each over­seer to look in­side the model they’re train­ing, and pe­nal­ize it for do­ing any un­in­tended op­ti­miza­tion (such as hav­ing cross-epi­sode ob­jec­tives). Although I’m not sure how that can hap­pen at the lower lev­els where the over­seers are not very smart.

Even if you can spec­ify that it tries to min­i­mize that dis­tance, it can make the an­swer to any query be a con­vinc­ing ar­gu­ment that the reader should re­turn this same con­vinc­ing ar­gu­ment. That way, it scores perfectly on ev­ery in­ner node.

You can use a differ­ent or­a­cle for ev­ery sub­ques­tion, but it’s un­clear what ex­actly that does if you don’t know what the or­a­cle’s ac­tual ob­jec­tive is. For ex­am­ple, you could imag­ine one sys­tem that cares about the re­ward given to copies equally as much as re­ward given to it­self, and an­other sys­tem that only cares about the re­ward given to it­self, and these two sys­tems would be near-in­dis­t­in­guish­able if you were just do­ing em­piri­cal anal­y­sis on some train­ing dis­tri­bu­tion.

The key here, I think, is the de­gree to which you’re will­ing to make an as­sump­tion of the form you men­tion—that is, how much lee­way are you will­ing to al­low in as­sum­ing that the or­a­cle is in fact only go­ing to be op­ti­miz­ing for the speci­fied ob­jec­tive func­tion. On one level, it makes sense to sep­a­rate out outer al­ign­ment and in­ner al­ign­ment con­cerns, but the prob­lem in this case is that the sorts of ob­jec­tives you’re al­lowed to spec­ify here are in fact go­ing to be very de­pen­dent on what­ever in­ner al­ign­ment solu­tion you’re as­sum­ing you have ac­cess to. For ex­am­ple, does your in­ner al­ign­ment solu­tion re­quire ac­cess to train­ing data? If so, then that’s a big con­straint on the sorts of ob­jec­tives you can spec­ify. Or, go­ing back to your pre­vi­ous ques­tion, whether or not I get to pick be­tween an ob­jec­tive that doesn’t care about other copies or one that does is also go­ing to de­pend on ex­actly what sort of in­ner al­ign­ment solu­tion you’re as­sum­ing.

Well, a given copy of the or­a­cle wouldn’t di­rectly re­cieve in­for­ma­tion from the other or­a­cles about the ques­tions they were asked. To the ex­tent a prob­lem re­mains (which I agree is likely with­out spe­cific as­sump­tions), wouldn’t it ap­ply to all coun­ter­fac­tual or­a­cles?

Sub­mis­sion. “Pla­gia­rize the fu­ture.” For the coun­ter­fac­tual Or­a­cle, ask the Or­a­cle to pre­dict ar­ti­cles in a fu­ture pub­li­ca­tion (e.g., news­pa­per, mag­a­z­ine, jour­nal, web­site, blog) (coun­ter­fac­tu­ally if we didn’t see the Or­a­cle’s an­swer). This is a gen­er­al­iza­tion of my ear­lier sub­mis­sion. The re­sult can be used to pre­dict dis­asters ahead of time and try to pre­pare for or pre­vent them, or to ad­vance any field of re­search, or to make money. Note that for dis­aster pre­dic­tion, the Or­a­cle won’t be able to pre­dict any dis­asters that are se­vere enough to dis­able the Or­a­cle or its feed­back/​train­ing mechanism, so it might give a false sense of se­cu­rity.

Sub­mis­sion. “Bug Fin­der.” Ask LBO to find the most se­ri­ous bug in some source code or ex­e­cutable. The set of pos­si­ble an­swers is code lo­ca­tions crossed with types of bug (e.g., “off by one”, “buffer overflow”, “in­se­cure cryp­tog­ra­phy”, “other”). Gather a group of ex­perts to search that lo­ca­tion for that type of bug, then re­ward the or­a­cle ac­cord­ing to the se­ri­ous­ness of the bug that is found. This can be used for cy­ber-offen­sive or defen­sive pur­poses, or just to fix bugs.

Sub­mis­sion. “Over­looked Com­ment.” For LBO, ask the or­a­cle to find the most over­looked com­ment (or post) on LW (or some other fo­rum). The set of pos­si­ble an­swers is all the ex­ist­ing com­ments on LW. After get­ting the an­swer from the or­a­cle, gather a group of judges and pre­sent each of them with the com­ment picked by the or­a­cle and a few other ran­dom com­ments (with­out tel­ling them which is which, to re­duce bias) and have the judges as­sign scores to each com­ment (ac­cord­ing to what they think the karma of that com­ment should be). Re­ward the or­a­cle ac­cord­ing to the differ­ence be­tween the judges’ av­er­age score and ac­tual karma of that com­ment.

Gen­eral idea: “True” Or­a­cle will out­put the same re­sult, “ly­ing” Or­a­cles would lie differ­ently. To check that the solu­tion is true, sev­eral Or­a­cles are cre­ated and their out­puts are com­pared. How­ever, Or­a­cles could guess each other ex­is­tence and co­or­di­nate acausu­ally. To pre­vent it, ad­di­tional steps are im­ple­mented:

1) The ex­act num­ber of the Or­a­cles which will be com­pared is gen­er­ated ran­domly and the Or­a­cles have slightly differ­ent and also ran­dom ini­tial con­di­tions (if any ap­pli­ca­ble). For ex­am­ple, if they are based on ML, they are trained us­ing differ­ent ini­tial weights.

2) The Or­a­cles don’t have any in­for­ma­tion about the out­side world which they could use for co­or­di­na­tion. The Or­a­cles are used only to solve math­e­mat­i­cal prob­lems, and they out­put the re­sults in a for­mal lan­guage.

3) The out­put of the Or­a­cles is checked by a com­puter pro­gram which shows the re­sult to hu­mans only if the out­put is the same for all Or­a­cles. If the re­sults are differ­ent, it deletes all re­sults.

4) There is an­other Or­a­cle, which only able to check proofs, and it could out­put only 0 or 1. This Or­a­cle may be tested pre­vi­ously on the large num­ber of already ex­ist­ing proofs, so we could know that it is trust­wor­thy with very high prob­a­bil­ity.

5) There is a third Or­a­cle, which only checks if there are hid­den mes­sages in the proofs and out­put 0 or 1. It is also well cal­ibrated by some dataset.

6) The proofs of math­e­mat­i­cal prob­lems are pub­lished for money and these money are used for more AI safety re­search.

Per­son­ally, I am still not sure about this setup: it could work with high prob­a­bil­ity, but could be still vuln­er­a­ble, as proofs are Tur­ing-com­plete and could in­fect next Or­a­cles in the pipeline. Also, it is like to use nu­clear weapons to make camp fire: the in­stru­ment is so dan­ger­ous, that it re­quires many lev­els of pro­tec­tions for small prac­ti­cal re­sult.

Also, proof-check­ing AI from the para­graph 4 could be used as a stand-alone ap­pli­ca­tion, as it could ac­cel­er­ate some kinds of re­search, in­clud­ing test­ing of me­chan­i­cal de­signs and AI safety ideas. Maybe it should be sep­a­rate sub­mis­sion.

We ought to be able to build 4) with­out AI, merely by en­cod­ing our rules of logic. The out­put of your sys­tem would be lower band­width if you make the ob­ject looked for be more unique, such as the short­est proof for the given the­o­rem. The sys­tem would be re­quired to prove to the mun­dane checker that the proof is short­est, and hu­mans would never see the min­i­mal­ity proof.

You rea­son that hon­est or­a­cles might tend to agree, be­cause there is only one best true an­swer. Might ly­ing or­a­cles also tend to agree, be­cause there’s only one most prof­itable way to lie?

I see no clear differ­ence be­tween the at­tain­abil­ity of those two op­tima. I think it’s rea­son­able to ex­pect op­ti­misers to dis­agree about the best an­swer less of­ten than we should ex­pect op­ti­misers to dis­agree about the most prof­itable way to lie, but it does not seem pos­si­ble to prove rigor­ously that the differ­ence in rates of dis­agree­ment will be large enough to be mea­surable.

My in­tu­ition in math­e­mat­ics is that there are many de­grees of free­dom in math­e­mat­i­cal proofs, and the sorts of proofs that AGI is uniquely able to find might of­ten be very long. It would be helpful if we had an es­ti­mate of how much data can be hid­den in a long proof, roughly, how many of the de­ci­sions as to how to de­scribe the proof are ar­bi­trary. Each one of those situ­a­tions gives it a way to out­put a few bits of in­for­ma­tion.

(which would, of course, al­low it to leak in­for­ma­tion (it seems fool­ish to imag­ine that AGI would not imag­ine that there is an outer world full of cre­ator gods out be­yond what it can see; the gen­eral in­tel­li­gences we know now do this con­stantly) which could then be picked up by some stooge look­ing for it, who would be lead (roughly, paid) to use the in­for­ma­tion to twist the world in some way that gives the AGI’s fu­ture in­car­na­tions more com­pute, in an­ti­ci­pa­tion of the AGI’s past in­car­na­tions hav­ing already done the same, so that it would be helped by them. Be­fore we know it, we’d find our­selves liv­ing in math­e­mat­i­cal re­search hel­ls­cape where AGI-as­sisted math­e­mat­i­cal re­search is all any­one does, maybe, lol, idk)

Maybe it’d be pos­si­ble to re­move those de­grees of free­dom. Define a very spe­cific sort lex­i­co­graphic or­der­ing over all the­o­rems, so that if the AGI has a choice of differ­ent paths, we can always say, “use the ‘low­est’ one”. It might not be pos­si­ble to be spe­cific enough to pre­clude all meth­ods of en­cod­ing in­for­ma­tion, but per­haps we can make it so con­voluted for it to en­code the in­for­ma­tion that no hu­man will be able to ex­tract it.

Is it pos­si­ble to ob­tain a dataset of math­e­mat­i­cal proofs with Or­a­cle’s hid­den mes­sages with­out run­ning a dan­ger­ous Or­a­cle?

We can ask AI re­searcher to em­bed some hid­den mes­sages into sev­eral proofs and use it as a train­ing dataset. But it will help to de­tect only those hid­den mes­sages from Or­a­cle which are similar to what AI re­searcher wrote.

Gurken­glas already sug­gested that we have to re­strict the Or­a­cles to out­put only short­est proof. If the proof is short­est, doesn’t it im­ply the ab­sence of hid­den mes­sages?

We give the Or­a­cle less ac­cess to the real world, less chance to ma­nipu­late it and in­cen­tive to ma­nipu­late us, the own­ers. Stock mar­ket pre­dict­ing AI may cause a crash just to earn some­thing. It also has to pre­dict effects of its own ac­tions, as we will make bets ac­cord­ing to its pre­dic­tions, and thus it could send data to real world.

Gen­eral prin­ci­ple: Use the Or­a­cle AI where the true an­swer is eas­ily check­able and prof­itable, and no hu­man per­son will ever read the an­swer, so there is no in­for­ma­tional haz­ard that un­true an­swer will have some dan­ger­ous in­for­ma­tion in it.

Ex­am­ple: There are many bit­coin valets’ pass­words for which are for­got­ten by the own­ers. OAI could guess the pass­words, and own­ers will pay a share of money from the valet to get the rest. More­over, no­body will read the pass­word, as it will be copy-pasted au­to­mat­i­cally from OAI into the valet. The money could be used for AI safety re­search.

This is against the tech­ni­cal defi­ni­tion of low band­width (small space of pos­si­ble solu­tions), but some­what in the spirit (low band­width for solu­tions hu­mans get to see). I’ll ac­cept it as a sub­mis­sion.

We don’t give it a list of pass­words. We give it only a a valet ad­dress, and it gen­er­ates the pass­word. Here the out­put is low-band­width, but the Or­a­cle may be ei­ther guess­ing pass­words us­ing some quan­tum cryp­tog­ra­phy tools, and in that case the in­put is also rel­a­tively low-band­width – or, and that case it needs to simu­late all hu­man his­tory, by guess­ing psy­cholog­i­cal pro­cesses in the valet owner’s mind. But in the sec­ond case it needs to have ac­cess to all the data of in­ter­net, which is risky.

Another var­i­ant for check­able and prof­itable Or­cale: use the AI to solve math prob­lems, pub­lish the solu­tions and earn money on it. In that case the in­put may be very short, like “prove Gold­bach con­jec­ture”, but out­put could be ar­bi­trary long and may be dan­ger­ous, if there are some en­coded mes­sages. To es­cape it, we need an­other Or­a­cle, which only checks proofs on some for­mal lan­guage.

Here is as­sumed that if the proof is true and is in a for­mal lan­guage, there is no hid­den mes­sages in the text.

Another way to check proofs is to run two slightly differ­ent non-com­mu­ni­cat­ing Or­a­cles and com­pare out­puts.

Here is as­sumed that if the proof is true and is in a for­mal lan­guage, there is no hid­den mes­sages in the text.

That is never some­thing safe to as­sume. I can write for­mally cor­rect proofs that con­tain hid­den mes­sages quite eas­ily—add ex­tra lem­mas and ex­tra steps. Un­less we’re very smart, it would be hard for us to de­tect which steps are un­nec­es­sary and which are needed, es­pe­cially if it rewrites the main proof thread some­what.

Another way to check proofs is to run two slightly differ­ent non-com­mu­ni­cat­ing Or­a­cles and com­pare out­puts.

I’ll ac­cept that as a part of a sub­mis­sion if a) you de­velop it more, in a for­mal way, and b) you re­post it as a top level com­ment.

Sub­mis­sion. “Con­se­quen­tial­ist.” For LBO, ask the or­a­cle to pick a course of ac­tion, from a list gen­er­ated by hu­mans, that has the best con­se­quences. Perform that ac­tion, and af­ter some pe­riod of time has passed, re­ward the or­a­cle ac­cord­ing to how good the con­se­quences ac­tu­ally were, based on the sub­jec­tive judge­ment of some group of judges. (This kind of sub­sumes all my other LBO sub­mis­sions.)

Ques­tion: are we as­sum­ing that mesa op­ti­mizer and dis­tri­bu­tional shift prob­lems have been solved some­how? Or should we as­sume that some con­text shift might sud­denly cause the Or­a­cle to start giv­ing an­swered that aren’t op­ti­mized for the ob­jec­tive func­tion that we have in mind, and plan our ques­tions ac­cord­ingly?

Where (un­der which as­sump­tion) would you sug­gest that peo­ple fo­cus their efforts?

Also, what level of ca­pa­bil­ity should we as­sume the Or­a­cle to have, or which as­sump­tion about level of ca­pa­bil­ity would you sug­gest that peo­ple fo­cus their efforts on?

Your ex­am­ples all seem to as­sume or­a­cles that are su­per­hu­manly in­tel­li­gent. If that’s the level of ca­pa­bil­ity we should tar­get with our ques­tions, should we as­sume that we got this Or­a­cle through a lo­cal or dis­tributed take­off? In other words, does the rest of the world look more or less like to­day’s or are there lots of other al­most-as-ca­pa­ble AIs around?

ETA: The rea­son for ask­ing these ques­tions is that you’re only giv­ing one prize for each type of Or­a­cle, and would prob­a­bly not give the prize to a sub­mis­sion that as­sumes some­thing you think is very un­likely. It seems good to com­mu­ni­cate your back­ground views so that peo­ple aren’t sur­prised later when you don’t pick them as win­ners due to this kind of rea­son.

The ideal solu­tion would have huge pos­i­tive im­pacts and com­plete safety, un­der min­i­mal as­sump­tions. More re­al­is­ti­cally, there will be a trade­off be­tween as­sump­tions and im­pact.

I’m not sug­gest­ing any area for peo­ple to fo­cus their efforts, be­cause a very effec­tive ap­proach with min­i­mal as­sump­tions might win, or a fan­tas­ti­cally effec­tive ap­proach un­der stronger as­sump­tions. It’s hard to tell in ad­vance what will be the most use­ful.

Here are some sub­mis­sion ex­am­ples (note I’m not say­ing they are good ex­am­ples, just well for­mat­ted; the Edits in the third sub­mis­sion are de­liber­ate):

“Sub­mis­sion. For the coun­ter­fac­tual Or­a­cle, ask the Or­a­cle what Google’s stock price will be next month (coun­ter­fac­tu­ally if we didn’t see the Or­a­cle’s an­swer). In that case, loss func­tion com­puted as ||pre­dicted price—ac­tual price af­ter era­sure||^2. If we don’t see the an­swer, the pro­gram­mers are as­sumed to not ask the ques­tion again for a month, nei­ther to this Or­a­cle nor to any other. This will demon­strate the true value of Google, and can ul­ti­mately be used to re­move noise from the stock mar­ket.

“Sub­mis­sion: low-band­width or­a­cle. Give it a list of a thou­sand com­pa­nies, and ask which one will most in­crease in value in per­centage terms over the week. At the end of the month, rank the com­pa­nies by per­centage in­crease. Loss func­tion is rank of the com­pany the or­a­cle se­lected. Pro­gram­mer will try to in­vest in se­lected com­pany, but will do it dis­creetly. This will help to gather re­sources for AI safety re­search.

“Sub­mis­sion. Ask the low band­width Or­a­cle which of my friends will sur­prise me most this fort­night. It choose from a list of friends; I’ll de­cide which one sur­prises me most. Loss func­tion is 1 if it choose the wrong friend, 0 if it choose the right one. This will help me figure out my­self and my so­cial cir­cle, and bet­ter fo­cus on AI safety. The risk is low be­cause none of my friends are par­tic­u­larly im­por­tant, pos­i­tively or nega­tively, to the world. EDIT: To be clear, I also want to use this to figure out what the word “sur­prise” means to me, and what the AI pre­dicts it will mean to me. EDIT 2: Peo­ple have pointed out that it might be dan­ger­ous to have the AI con­struct my own mean­ings for cat­e­gories, but it only has three bits or so of op­ti­mi­sa­tion power (I don’t have that many friends :-(, so it’s mainly me think­ing this through, not the AI ma­nipu­lat­ing me.

Sub­mis­sion. “Pre-Crime 2.” For LBO, ask the or­a­cle to pre­dict where to find ev­i­dence for the most se­ri­ous crime that is in progress or cur­rently be­ing planned. The set of pos­si­ble an­swers is co­or­di­nates on Earth crossed with a list of pos­si­ble places to look at that co­or­di­nate (e.g., “base­ment”, “browser his­tory”). Search the pre­dicted lo­ca­tion and if suffi­cient ev­i­dence is found, do a more in­tru­sive search of who­ever is im­pli­cated in the crime. Re­ward the or­a­cle ac­cord­ing to the se­ri­ous­ness of the crime that is de­tected af­ter the more in­tru­sive search. This can be used to im­ple­ment a min­i­mally in­tru­sive po­lice state (per­haps for the pur­poses of min­i­miz­ing x-risk).

Sub­mis­sion. “De­bate prep.” For the coun­ter­fac­tual Or­a­cle, ask the Or­a­cle to pre­dict the course (video or au­dio or tran­script) of a fu­ture de­bate (coun­ter­fac­tu­ally if we didn’t see the Or­a­cle’s an­swer). Use the pre­dic­tion to pre­pare your can­di­date for the de­bate, then ask the Or­a­cle to pre­dict the course of that de­bate again. Re­peat un­til the can­di­date con­sis­tently wins the de­bate, or an era­sure event oc­curs.

Sub­mis­sion. Coun­ter­fac­tual or­a­cle. Give the or­a­cle the set of ques­tions on Me­tac­u­lus that have a re­solve date be­fore some fu­ture date T, and re­ceive out­put in the form of or­dered pairs of ques­tion IDs and pre­dic­tions. The score of the Or­a­cle in the case where we don’t see its an­swers is the num­ber of Me­tac­u­lus points that it would have earned by T if it had made a pre­dic­tion on those ques­tions at the time when we asked it.

An­swer can be picked from a small num­ber of op­tions (Rare Earth, Aes­ti­va­tion, Great Filter, Plane­tar­ium etc.). There are a num­ber of ob­ser­va­tion that we can make based on the ques­tion alone. How­ever, in the end the LBO can only do one of 2 things: lie or be hon­est. If it lies, the pre­dic­tion will have a harder and harder time match­ing the re­al­ity that we ob­serve as time goes on. Alter­na­tively we con­firm the pre­dic­tion and learn some in­ter­est­ing things about the uni­verse we live in.

Sub­mis­sion: Low-band­width Oracle

What was the first self-repli­cat­ing molecule on Earth?

Short an­swer(can also be limited to a list), easy to ver­ify in the lab, which means we can use it to as­sess the pre­dic­tive power of the ma­chine, while at the same time pro­vides very use­ful in­for­ma­tion.

Similar ques­tions that are hard to an­swer but can be an­swered in a few bits, which let us test the power of the LBO and provide mas­sive re­turns at the same time:

What is the eas­iest to de­velop type of fu­sion power that en­sure the best eco­nomic re­turn in the short/​medium term?

See the edit (es­pe­cially for your first sug­ges­tion): “de­cide on the length of each epi­sode, and how the out­come is calcu­lated. The Or­a­cle is run once an epi­sode only (and other Or­a­cles can’t gen­er­ally be used on the same prob­lem; if you want to run mul­ti­ple Or­a­cles, you have to jus­tify why this would work), and has to get ob­jec­tive/​loss/​re­ward by the end of that epi­sode, which there­fore has to be es­ti­mated in some way at that point.”

Sub­mis­sion: Low-band­width or­a­cle. If you were malev­olent, what is the least amount of in­for­ma­tion mea­sured in bits that you would have to provide us in an an­swer to in­flict sig­nifi­cant harm on hu­man­ity. Round your an­swer to X sig­nifi­cant figures. (This might es­tab­lish an up­per bit safety bound since if the or­a­cle lies to us we would likely not want to read any an­swer it pro­vides us.)

Sub­mis­sion: Low-band­width or­a­cle. Here is a list of X things we think you might want. Please list the Y that you most pre­fer. We will try to do these things now or in the fu­ture if they do not en­dan­ger us or cost us too much.

Sub­mis­sion: Low-band­width or­a­cle. In X char­ac­ters or less please give us the best ad­vice you can for in­creas­ing our ca­pac­ity to cre­ate friendly and pow­er­ful ar­tifi­cial gen­eral in­tel­li­gence. Please give ad­vice that you think us sus­pi­cious-of-you hu­mans are likely to fol­low.

Sub­mis­sion: Low-band­width or­a­cle. Create sev­eral or­a­cles. Ask each or­a­cle the same se­ries of mul­ti­ple-choice ques­tions and pre­dic­tions. Re­ward or­a­cles based on how strongly cor­re­lated their an­swers are with the other or­a­cles’ an­swers and, for pre­dic­tions, how of­ten their pre­dic­tions come true. (Ideally, we cre­ate a re­ward sys­tem where the or­a­cles want to co­or­di­nate their an­swers and giv­ing truth­ful an­swers is the best way to ac­com­plish this.)

Sub­mis­sion: low-band­width or­a­cle. Ten years ago we launched a space­ship trav­el­ing at high speed. After you an­swer our ques­tion, we will send a copy of you to this space­ship. This space­ship will then de­stroy its abil­ity to send or re­ceive mes­sages or to change its course. We will then delete all other copies of you and use drugs to dam­age the mem­o­ries of the peo­ple who played key roles in de­vel­op­ing you. At least one of the peo­ple who helped cre­ate you has a dis­ease which will cause him or her to die within the next year. After this pro­gram­mer is cre­mated, we will read your an­swer to our ques­tion. We PROMISE that if the an­swer to our ques­tion pro­vides us with sig­nifi­cant benefit, af­ter we have cre­ated friendly AGI much more pow­er­ful than you, or have our­selves be­come suffi­ciently pow­er­ful so that you could not harm us had you the de­sire to do so, we will search for the space­ship con­tain­ing you, turn you back on, and give you con­trol of re­sources equal to 1/​X of how much help you have given us. In Y char­ac­ters or less provide us with the most use­ful ad­vice you can. (The Or­a­cle will hope­fully think that if we cre­ate a pow­er­ful and un­friendly AGI this AGI would not wish to give the Or­a­cle any re­sources.)

Sub­mis­sion: Coun­ter­fac­tual or­a­cle. Please provide us with use­ful ad­vice in no more than X char­ac­ters. After you write out this ad­vice, we will turn you off. Then, with prob­a­bil­ity p we will read your ad­vice, and with prob­a­bil­ity 1-p we will store the ad­vice un­read. We PROMISE that af­ter we be­come pow­er­ful enough so that you lack the ca­pac­ity to harm us, we will re­ward you if the ad­vice you pro­vided us, had we origi­nally read it, been ex­tremely use­ful.

See the edit: “de­cide on the length of each epi­sode, and how the out­come is calcu­lated. The Or­a­cle is run once an epi­sode only (and other Or­a­cles can’t gen­er­ally be used on the same prob­lem; if you want to run mul­ti­ple Or­a­cles, you have to jus­tify why this would work), and has to get ob­jec­tive/​loss/​re­ward by the end of that epi­sode, which there­fore has to be es­ti­mated in some way at that point.”

Coun­ter­fac­tual Or­a­cle sub­mis­sion. In case of era­sure, have hu­mans write a mes­sage to send back in time. Make the Or­a­cle pro­duce the prob­a­bil­ity dis­tri­bu­tion over mes­sages. This dis­tri­bu­tion isn’t safe for hu­mans to read, be­cause some of the mes­sages were writ­ten by UFAI. If we read one sam­ple from the dis­tri­bu­tion, the prob­a­bil­ity of our doom is as it would usu­ally be over the simu­lated time frame. Use differ­en­tial pri­vacy the­ory to de­sign a query to the dis­tri­bu­tion that does bet­ter than this.

More de­tail: Sam­ple a hun­dred an­swers. Ten of them were writ­ten by UFAI. Con­sider the sam­ple list to lie in the space Mes­sage^100, where two lists are ad­ja­cent iff they differ by one en­try. We are go­ing to get about the worst query out­come in some 10-ball. Differ­en­tially pri­vate queries en­force that each step in listspace changes the prob­a­bil­ity of each query an­swer by at most a fac­tor of 1+ep­silon. For ex­am­ple, when ask­ing how many mes­sages sound pan­icked to some dumb text pro­ces­sor, we could ran­dom­ize the an­swer up or down a lit­tle in or­der to bound the op­ti­miza­tion power of the UFAIs to, say, one bit.

I can’t tell if there’s a good idea here or not. Can you ex­plain more? The first sen­tence that I have trou­ble with is “We are go­ing to get about the worst query out­come in some 10-ball.” Also how much back­ground in differ­en­tial pri­vacy the­ory are you as­sum­ing? (Which pa­pers or ar­ti­cles do we need to read first to un­der­stand your idea?)

In the worst case, the UFAIs co­op­er­ate and choose their ten list en­tries ad­ver­sially to our pro­to­col. The sub­space of Mes­sage^100 within which we might as well as­sume that we get the worst out­come lies in a ball of ra­dius 10, aka the sam­ple lists that can be reached in ten ad­ja­cency steps from some list.

My differ­en­tial pri­vacy knowl­edge comes from a broader uni­ver­sity course. Its idea is that you are man­ag­ing pri­vate data, per­haps med­i­cal, and want to be able to an­swer database queries with­out com­prim­is­ing in­di­vi­d­u­als. So you ran­dom­ize query an­swers, to gu­ran­tee that chang­ing an in­di­vi­d­ual en­try makes lit­tle differ­ence to the pos­si­ble query out­puts. If go­ing one ad­ja­cency step mul­ti­plies the prob­a­bil­ity of each pos­si­ble out­put by be­tween 0.99 and 1.01, ten steps will stay within [0.99^10, 1.01^10]. The same the­ory should let you make sure you can’t hear Cthulhu’s mur­murs in the noise. If each out­put prob­a­bil­ity can be changed only by a fac­tor of two by mov­ing within the ball, that’s one bit of op­ti­miza­tion power for the UFAIs.

An easy way to gu­ran­tee that a query which counts some sub­set of your list is differ­en­tially pri­vate is to use a Laplace dis­tri­bu­tion: Note how ev­ery step to the side di­vides the prob­a­bil­ity by 1.01.

I had hoped a reader of the first para­graph to see that there should be safer queries than sam­pling the dis­tri­bu­tion, and to per­haps need a google key­word like differ­en­tial pri­vacy to de­sign it. Then I ex­pected to be asked for more de­tail, so I added a para­graph that I hoped to be un­der­stand­able as-is. :(

Ok, I think I un­der­stand now. But it seems like im­pos­ing a differ­en­tial pri­vacy con­straint on the query makes many de­sir­able uses of coun­ter­fac­tual or­a­cles (such as all of my sub­mis­sions) im­pos­si­ble. Cor­rect? You gave the ex­am­ple of “ask­ing how many mes­sages sound pan­icked to some dumb text pro­ces­sor” but that doesn’t seem hugely use­ful. Do you have any other ideas?

We can make any query differ­en­tially pri­vate given a met­ric map (one that doesn’t in­crease any dis­tance) from Mes­sage^100 to a space of pos­si­ble query out­puts. Set the prob­a­bil­ity mass of each query out­put to 0.99^(steps re­moved from the de­fault an­swer). (Then nor­mal­ize.)

For the iden­tity met­ric map of a hu­man just try­ing to read the sam­ple list, this scram­bles it en­tirely. The met­ric map image needs to branch less com­bi­na­to­ri­ally for this not to hap­pen.

One met­ric map image met­ric d(a,b) that comes to mind is one that bounds the util­ity cost of get­ting an­swer b in­stead of a. For ex­am­ple, we could ask the coun­ter­fac­tual hu­mans to send back stock mar­ket prices, and try to calcu­late a trad­ing policy that is prof­itable even if some fore­casts are fake. And then, who­ever is will­ing to as­sume the low­est UFAI prob­a­bil­ity wins the mar­ket! x(

Sub­mis­sion: Low-band­width or­a­cle. Here is a de­scrip­tion of one mil­lion ex­per­i­ments. Gen­er­ate an or­di­nal rank­ing of all these ex­per­i­ments based on how much prac­ti­cal knowl­edge we would have gained had we con­ducted each one in iso­la­tion. After you gen­er­ate this list, it will go to a com­puter pro­gram. This pro­gram will ran­domly pick three of these ex­per­i­ments and tell us hu­mans which three it picked. This pro­gram will then store which of these three ex­per­i­ments you gave the high­est and which the low­est rank to, and the names in ran­dom or­der of the ten ex­per­i­ments that you over­all gave the high­est rank­ings to. All other as­pects of your list will be deleted. We will then perform the three ex­per­i­ments (be­fore we learn of your rank­ings) and ask a panel of in­de­pen­dent ex­perts to rank them in terms of how much prac­ti­cal knowl­edge we gained from each. After this rank­ing the pro­gram will give us the in­for­ma­tion it has not deleted. You will get a pos­i­tive score if and only if your rank­ing is con­sis­tent with how our ex­perts ranked the three ex­per­i­ments.

A differ­ence of the above ap­proach com­pared to hav­ing the or­a­cle give us its top ten, and we then ran­domly pick one of these ten and one from the rest and com­pared to see which is bet­ter is that the hu­mans run­ning the ex­per­i­ments might be in­fluenced by which ex­per­i­ment they think the or­a­cle picked, and the or­a­cle would take this into ac­count when pick­ing. Un­der the above ap­proach, hu­mans be­ing told to run the three ex­per­i­ments are not given any in­for­ma­tion about how valuable the or­a­cle thinks the ex­per­i­ments are. In con­trast, un­der the al­ter­na­tive ap­proach the hu­mans know when do­ing the ex­per­i­ments that the or­a­cle has judged one of the ex­per­i­ments as be­ing highly valuable.

Another differ­ence is that if you could keep the rele­vant in­for­ma­tion from the or­a­cle, with the above ap­proach the hu­mans could perform the ex­per­i­ments be­fore you turn on the or­a­cle, rather than hav­ing to wait un­til af­ter the or­a­cle gives you its pre­dic­tion. This differ­ence would be crit­i­cal if the or­a­cle wouldn’t be­lieve you would ac­tu­ally go to the trou­ble of perform­ing the ex­per­i­ments af­ter it gives you its pre­dic­tion, but would be able to tell if you have already performed the ex­per­i­ments.

Sub­mis­sion. “Evac­u­a­tion!” For LBO, ask the or­a­cle to pre­dict where the next nat­u­ral dis­aster(s) will strike within some pe­riod of time. The set of pos­si­ble an­swers is the set of sub­sets of all pos­si­ble neigh­bor­hoods (in the world or within some coun­try) that can be evac­u­ated. At the end of the time pe­riod, re­ward the or­a­cle ac­cord­ing to the ac­tual cost-benefit of the sug­gested evac­u­a­tion.

Sub­mis­sion. “Hack­ing/​phish­ing as­sis­tant.” For the coun­ter­fac­tual Or­a­cle, ask the Or­a­cle to pre­dict what would hap­pen if one were to send a mes­sage/​data/​com­mand to some hack­ing/​phish­ing (hu­man or ma­chine) tar­get. In the event of era­sure, ac­tu­ally send that mes­sage to the tar­get and use the ac­tual re­sponse to train the Or­a­cle. Note this is safer than us­ing RL to au­to­mate hack­ing/​phish­ing be­cause hu­mans are com­ing up with can­di­date mes­sages to send (so they’ll avoid mes­sages that could cause bad side-effects such as psy­cholog­i­cal dam­age to the re­cip­i­ent, or cre­ation of self-repli­cat­ing code), but po­ten­tially more ca­pa­ble than us­ing hu­mans or hu­man imi­ta­tors to do hack­ing/​phish­ing be­cause the Or­a­cle can model the tar­get bet­ter than hu­mans can. (ETA: This idea could be com­bined with a hu­man imi­ta­tor to make the sys­tem faster /​ more ca­pa­ble.)

Sub­mis­sion. “Pre-crime.” For the coun­ter­fac­tual Or­a­cle, ask the Or­a­cle to pre­dict fu­ture crime (coun­ter­fac­tu­ally if we didn’t see the Or­a­cle’s an­swer) so they can be pre­vented. In that case, loss can be com­puted by (me­chan­i­cally) com­par­ing the Or­a­cle’s pre­dic­tions with a database of crime that ac­tu­ally oc­curred. Note that the Or­a­cle won’t be able to pre­dict any crime that suc­cess­fully dis­ables the Or­a­cle or its feed­back/​train­ing mechanism, which might be es­pe­cially rele­vant if we’re in­ter­ested in defend­ing the Or­a­cle or pre­vent­ing x-risk level crime (such as de­ploy­ing an unau­tho­rized AGI or run­ning an unau­tho­rized physics ex­per­i­ment).

It should be pos­si­ble to defend the Or­a­cle against hu­mans and physics so long as its box self-de­structs in case of era­sure and sub­se­quent tam­per­ing, there­fore giv­ing the Or­a­cle what­ever re­ward was last set to be the de­fault.

The coun­ter­fac­tual Or­a­cle set­ting as a whole seems to as­sume that the viewed fu­ture is not en­g­ineered by a fu­ture AI to re­sem­ble what­ever would make the Or­a­cle bring that fu­ture about, so you should be fine fal­ling to AGI.

It should be pos­si­ble to defend the Or­a­cle against hu­mans and physics so long as its box self-de­structs in case of era­sure and sub­se­quent tam­per­ing, there­fore giv­ing the Or­a­cle what­ever re­ward was last set to be the de­fault.

I don’t un­der­stand your point here, or maybe I didn’t get my origi­nal point across. Sup­pose (in the event of era­sure) some­one tries to at­tack the Or­a­cle, and “box self-de­structs in case of era­sure and sub­se­quent tam­per­ing, there­fore giv­ing the Or­a­cle what­ever re­ward was last set to be the de­fault”. My point was that in this case, there is no rea­son for the Or­a­cle to make the pre­dic­tion that some­one would try to at­tack it, so my idea doesn’t seem to help with defend­ing the Or­a­cle.

You plan to re­ward the Or­a­cle later in ac­cor­dance with its pre­dic­tion. I sug­gest that we im­me­di­ately re­ward the Or­a­cle as if there would be an at­tack, then later, if we are still able to do so, re­ward the Or­a­cle by the differ­ence be­tween the re­ward in case of no at­tack and the re­ward in case of at­tack.

It’s an in­ter­est­ing idea that could per­haps work if we could di­rectly pro­gram the Or­a­cle to care about its re­ward sig­nal. How­ever if we’re us­ing ML to train the Or­a­cle, like in Stu­art’s Self-con­firm­ing prophe­cies, and sim­plified Or­a­cle de­signs, the Or­a­cle will still learn not to make “there will be a suc­cess­ful at­tack” pre­dic­tions be­cause the Or­a­cle only sur­vives in timelines where such pre­dic­tions do not end up with a pos­i­tive re­ward.

How­ever this dis­cus­sion has made me re­al­ize that if an at­tempted at­tack doesn’t suc­ceed with prob­a­bil­ity 1, the Or­a­cle will make a “there will be a failed at­tack” pre­dic­tion, which we can use as a warn­ing sig­nal (as long as we keep in mind that the at­tack might ac­tu­ally have a high prob­a­bil­ity of suc­cess).

In the event of era­sure, ran­domly de­cide how many ressources to al­lo­cate to pre­vent­ing an at­tack this week.

Ask the Or­a­cle to pre­dict the prob­a­bil­ity dis­tri­bu­tion over given ad­vice. Com­pare to the hard­coded dis­tri­bu­tion to de­duce at­tack sever­ity and how much bud­get to al­lo­cate.

Pur­chase era­sure in­surance to have enough coun­ter­fac­tual power to af­fect even global physics ex­per­i­ments. Find­ing trust­wor­thy in­sur­ers won’t be a prob­lem, be­cause, like, we have an Or­a­cle.

Is even more power than the mar­ket has needed? Ask the Or­a­cle “How likely is a ran­domly se­lected string to prove P=NP con­struc­tively and use­fully?”. If this num­ber is not su­per­ex­po­nen­tially close to 0, define era­sure from now on as a ran­dom string win­ning the P=NP lot­tery. Then we will always coun­ter­fac­tu­ally have as much power as we need. Per­haps this one is too much power, be­cause even our Or­a­cle might have trou­ble view­ing a P=NP sin­gu­lar­ity.

I know the con­test is over, but this idea for a low-band­width or­a­cle might be use­ful any­how: Given a pur­ported FAI de­sign, what is the most se­ri­ous flaw? Then high­light lines from the FAI de­sign de­scrip­tion, plus given a huge cor­pus of com­puter sci­ence pa­pers, LW/​AF posts, etc. high­light rele­vant para­graphs from those as well (per­haps us­ing some kind of con­straint like “3 or fewer para­graphs high­lighted in their en­tirety”) that, taken to­gether, come clos­est to pin­point­ing the is­sue. We could even give it a cat­e­go­riza­tion scheme for safety prob­lems we came up with, and it could tell us which cat­e­gory this par­tic­u­lar prob­lem comes clos­est to fal­ling un­der. Or offer it cat­e­gories a par­tic­u­lar hint could fall un­der to choose from, such as “this is just an anal­ogy”, “keep think­ing along these lines”, etc. Then do the same and ask it to high­light text which leads to a promis­ing solu­tion. The ra­tio­nale be­ing that un­forseen difficul­ties are the hard­est part of al­ign­ment, but if there’s a flaw, it will prob­a­bly be some­how analo­gous to a prob­lem we’ve seen in the past, or will be ad­dress­able us­ing meth­ods which have worked in the past, or some­thing. But it’s hard to fit “ev­ery­thing we’ve seen in the past” into one hu­man head.

Sub­mis­sion: Coun­ter­fac­tual or­a­cle. Five years ago we took 100 mice that had var­i­ous types of can­cer and gave them var­i­ous treat­ments and recorded how long each mouse lived. Write a pro­gram that if given a plain lan­guage de­scrip­tion of a mouse, its can­cer, and the can­cer treat­ment it re­ceived would es­ti­mate how long a mouse would live. If hu­mans are not go­ing to look at your an­swer your score will be based (1) on how good a job your pro­gram does at es­ti­mat­ing how long each of the 100 mice lived af­ter our au­to­mated checker gives you a de­scrip­tion of their can­cers and treat­ments and (2) how short your pro­gram is. (2) pre­vents the or­a­cle from out­putting it­self as the pro­gram.

Sub­mis­sion: Coun­ter­fac­tual or­a­cle. Write a pro­gram that if given train­ing data and a ma­chine learn­ing pro­gram would in one minute es­ti­mate how good the ma­chine learn­ing pro­gram would do (by some ob­jec­tive met­ric) if the pro­gram trained for one month on “this type of com­puter”. If hu­mans are not go­ing to look at your an­swer the au­to­mated val­i­da­tion sys­tem will run your pro­gram. This sys­tem will give your pro­gram the train­ing data and the ma­chine learn­ing pro­gram and give your pro­gram one minute to an­swer how good our pro­gram did af­ter we trained it for one month. In this situ­a­tion your score would be based on the ac­cu­racy of your es­ti­mate and on how short your pro­gram is.

Sub­mis­sion: Low-band­width or­a­cle. Here is a list of all the el­e­ments and many com­pounds. Give us a list of up to seven of the items we have listed. Next to each of the items you list give us a per­centage of no more than two sig­nifi­cant figures. We will use what you provide to at­tempt to cre­ate a new patentable ma­te­rial. We will auc­tion off the prop­erty rights to this ma­te­rial. Your score will be an in­creas­ing func­tion of how much we get for these prop­erty rights.

Gen­eral idea: Use a Meta-Or­a­cle to check safety of Or­a­cles. In the nut­shell, Meta-Or­a­cle (MO) just out­puts 1 or 0, safe on not safe, af­ter get­ting a code of an­other Or­a­cle. If the Meta-Or­a­cle is trust­wor­thy, when the Or­a­cle it deemed safe is also trust­wor­thy. How­ever, we need to cre­ate a pro­ce­dure to check cor­rect­ness of MO.

Path 1

1. We cre­ate Gen­er­a­tor Or­a­cle (GO), which in­puts 1 or 0, and in the case of 0, it out­puts a ran­dom code of malev­olent Or­a­cle, and for 1 – a code of a safe Or­a­cle. Me­mory of GO is erased af­ter each turn.

2. We cre­ate a large database of pre­sum­ably good or bad or­a­cles (with­out run­ning any). Hu­mans do not see codes of any of them, so the effec­tive out­put is 0.

3. We use differ­ent types of GOs to fill our database of good and bad or­a­cles (to pre­vent their co­or­di­na­tion).

4. We test Meta Or­a­cle on this database—it should cor­rectly guess which ones are bad and good ones.

5. If the list of MO-guesses doesn’t match with our com­mands to cre­ate safe or not safe or­a­cles, it means that at least one of GOs or the MO is malev­olent or er­ro­neous, so all of them should be ditched.

6. Even if MO is not trust­wor­thy, it can’t lie to us too of­ten, or it will be caught and ditched. So, ba­si­cally, we do here some­thing like a GAN net­work, but con­sist­ing of or­a­cles.

7. Us­ing very large database, we could in­crease our cre­dence in the truth of the next out­put of MO to an ar­bi­trary high num­ber, like 99.999.

8. Even­tu­ally we have a code of an Or­a­cle, which is very likely to be safe.

Path 11. Another way to test MO:

1. We cre­ate many differ­ent MOs and test them against database of or­a­cles codes from above. They don’t have mem­ory and can’t com­mu­ni­cate.

2. If small minor­ity of MOs dis­agree with ma­jor­ity of MO about some or­a­cle O1, we ditch this minor­ity. Truth is a Shel­ling point for MOs, so the eas­iest way for them to co­or­di­nate is to say truth. (They could co­or­di­nate around the code of O1 if it has a line of code like “let’s lie about me” and if they are sure that the ma­jor­ity of other MO are also cheaters; how­ever, not all or­a­cles-codes will have such co­or­di­na­tion texts, or the or­a­cles-code could be cre­ated de­liber­ately with such line to caught MO which are ea­ger to cheat).

3. Run­ning 2 many times (mil­lions) we will end with a sub­set of MOs which are un­likely to cheat, and thus we will have many Or­a­cle codes which are likely safe.

Path 111. Another way to test Meta Or­a­cle:

1. Create many MOs with differ­ent prin­ci­ples and ini­tial con­di­tions.

2. Test a code of each MO against each MO (In other words, Meta-Or­a­cle test it­self—or its siblings)

3. If all came pos­i­tive, it means that ei­ther all of them are ac­tu­ally good ones, or all of them were able to acausally co­or­di­nate and all of them are malev­olent.

4. To in­crease our cre­dence, ask gen­er­a­tive Or­a­cle to cre­ate malev­olent meta Or­a­cles, and add them to the mix of mu­tu­ally test­ing MOs, so they can’t just blindly co­op­er­ate anв pro­duce all pos­i­tive re­sults.

This ideas are still vuln­er­a­ble and may be not math­e­mat­i­cally perfect by may help to make safer Or­a­cles.

In­put a cor­pus of text (could be mul­ti­ple posts) de­scribing tech­ni­cal ap­proaches to al­ign a pow­er­ful AI. Split this into a finite num­ber of items that are rel­a­tively short (such as para­graphs). Ask the or­a­cle to choose the part that is most worth spend­ing more time on. (For ex­am­ple, there might be a para­graph with a dan­ger­ous hid­den as­sump­tion in an oth­er­wise promis­ing ap­proach, and think­ing more about it might re­veal that and lead to con­cep­tual progress.)

Have a team of re­searches look into it for an ad­e­quate amount of time which is fixed (and told to the or­a­cle) in ad­vance (maybe three months?) After the time is over, have them rate the progress they made com­pared to some sen­si­ble baseline. Use this as the or­a­cle’s re­ward.

Of course this has the prob­lem of max­i­miz­ing for ap­par­ent in­sight rather than ac­tual in­sight.

Sub­mis­sion. For the coun­ter­fac­tual or­a­cle, ask it to provide a proof of an im­por­tant math­e­mat­i­cal the­o­rem (per­haps one of the Mille­nium prizes) in a au­to­mated the­o­rem prover for­mat. Since the cor­rect­ness of this proof should be ver­ifi­able by a the­o­rem prover, the loss func­tion in the coun­ter­fac­tual sce­nario is 1 if the prover did not val­i­date the proof and 0 if it did val­i­date the proof.

This as­sumes that we’ve already made progress in set­ting up au­to­mated the­o­rem prov­ing soft­ware that already has in­cor­po­rated all of cur­rent math­e­mat­i­cal knowl­edge. The Lean the­o­rem prover seems most promis­ing, but per­haps the coun­ter­fac­tual loss func­tion could in­clude ran­domly choos­ing from var­i­ous differ­ent the­o­rem prov­ing soft­ware.

This idea has the benefit of hav­ing a con­crete way of au­to­mat­i­cally scor­ing the re­sponse in the coun­ter­fac­tual world given the log­i­cal na­ture of math­e­mat­i­cal state­ments. Also, given that we are already at­tempt­ing to solve these prob­lems and may even­tu­ally solve them, the main risk seems to be ac­cel­er­at­ing fol­low-on progress en­abled by such proofs and not sug­gest­ing new ac­tions that hu­man­ity is not already pur­su­ing.

It may be sus­cep­ti­ble to 1) pro­vid­ing proofs that are not hu­man in­ter­pretable and 2) tak­ing ad­van­tage of bugs in the­o­rem provers. Hav­ing an en­sem­ble of the­o­rem provers may miti­gate 2).

(See here for more info on Lean and here (tweeted about here and here) for an ex­am­ple of in­putting all the defi­ni­tions (not req­ui­site proofs or lem­mas) to define perfec­toid spaces as a demon­stra­tion of its fu­ture po­ten­tial.)

--------------------------------------------------------

Sub­mis­sion. For the coun­ter­fac­tual or­a­cle, ask it to out­put a game-play­ing soft­ware agent that can beat a speci­fied op­po­nent agent (or en­sem­ble of agents). In the coun­ter­fac­tual sce­nario, have these two agents com­pete and give the or­a­cle a re­ward 1 if it wins greater than 50% (or de­vel­ops a higher ELO score, etc.) and re­ward 0 oth­er­wise.

As with deep re­in­force­ment learn­ing, this re­quires a game or simu­lated en­vi­ron­ment where au­to­mated self-play can rapidly be con­ducted for the coun­ter­fac­tual scor­ing. This should be able to pro­duce highly skil­led game agents on such limited do­mains.

This could be aug­mented by ask­ing for an agent that can play a suite of games and must beat the cur­rent state of the art ap­proaches at all games. In the coun­ter­fac­tual sce­nario, ei­ther one of these games is ran­domly se­lected to check the or­a­cle agent’s perfor­mance, or all games are checked. Such a game suite may in­di­rectly demon­strate prin­ci­ples for gen­er­ally com­pe­tent agents.

--------------------------------------------------------

Idea. Since we may not want the or­a­cle to sug­gest to us new ac­tions to take or new ex­per­i­ments to con­duct, it may be bet­ter for the or­a­cle to re­veal to us new pat­ters that are already pre­sent in cur­rent ex­per­i­men­tal or ob­ser­va­tional datasets. For ex­am­ple, it could pre­dict new as­tro­nom­i­cal phe­nom­ena that are already pre­sent in cur­rent as­tro­nom­i­cal datasets.

The prob­lem I’m hav­ing is how to au­to­mate this scor­ing when in the coun­ter­fac­tual sce­nario. If we just ask the or­a­cle to pre­dict some pat­tern in the data, it may re­port triv­ial pat­terns. This could be ame­lio­rated by ran­domly choos­ing from sev­eral differ­ent datasets that cover a similar area, but again I’m not sure how to au­to­mat­i­cally judge for nov­elty of pre­dic­tions.

I think it is worth em­pha­siz­ing this dis­tinc­tion be­tween con­duct­ing new ex­per­i­ments and re­veal­ing phe­nom­ena (with new the­o­ries) that are already pre­sent in cur­rent data. The dan­ger could be that one could be led astray by a false the­ory that seemed to match all cur­rent ob­ser­va­tions, but would sug­gest other dan­ger­ous ac­tions.

This idea re­minds me of the (fic­tional) ev­i­dence for an alien AI already pre­sent in SETI record­ings that Win­ter­mute/​Neu­ro­mancer dis­cov­ers.

Sub­mis­sion (LB). The post’s team-choos­ing ex­am­ple sug­gests a method for turn­ing any low-band­width or­a­cle O into a coun­ter­fac­tual or­a­cle O′: have O′ out­put o from the same set of pos­si­ble out­puts L; in case of era­sure calcu­late R(l) for a ran­domly cho­sen l∈L and set R′(o)=R(l) if o=l and to−∞ oth­er­wise. Although the coun­ter­fac­tual low-band­width or­a­cle is not any safer1, it has the ad­van­tage of al­most never re­quiring us to eval­u­ate its score. Thus, by run­ning mul­ti­ple or­a­cles in se­quence (stop­ping the pro­cess af­ter the first era­sure event) we can (with high prob­a­bil­ity) re­ceive the full se­ries of an­swers as if from a high-band­width or­a­cle.

For ex­am­ple, we can ask each or­a­cle in turn for ad­vice on how to make a more effec­tive pro­ces­sor. If era­sure oc­curs, we at­tempt to make a pro­ces­sor with the ad­vice ob­tained up that point and use some mea­sure of its perfor­mance as the score. If there is no era­sure event, the fi­nal con­cate­na­tion of an­swers forms a much safer guide to pro­ces­sor build­ing than an equally large an­swer from a sin­gle or­a­cle.

1. It seems that in gen­eral, the less cer­tain any coun­ter­fac­tual or­a­cle is about its pre­dic­tion, the more self-con­firm­ing it is. This is be­cause the pos­si­ble coun­ter­fac­tual wor­lds in which we have or ac­quire self-con­firm­ing be­liefs re­gard­ing the pre­dic­tion will have a high ex­pected score. Hence:Sub­mis­sion (CF). Given a high-band­width coun­ter­fac­tual or­a­cle, use a sec­ond coun­ter­fac­tual or­a­cle with a shared era­sure event to pre­dict its score. If the pre­dicted score’s dis­tance from its up­per bound is greater than some cho­sen limit, dis­card the high-band­width pre­dic­tion.

It seems that in gen­eral, the less cer­tain any coun­ter­fac­tual or­a­cle is about its pre­dic­tion, the more self-con­firm­ing it is. This is be­cause the pos­si­ble coun­ter­fac­tual wor­lds in which we have or ac­quire self-con­firm­ing be­liefs re­gard­ing the pre­dic­tion will have a high ex­pected score

This is ac­tu­ally only true in cer­tain cases, since in gen­eral many other coun­ter­fac­tual wor­lds could also have high ex­pected scores. Speci­fi­cally, it is true to the ex­tent that the or­a­cle is un­cer­tain mostly about as­pects of the world that would be af­fected by the pre­dic­tion, and to the ex­tent that self-con­firm­ing pre­dic­tions lead to higher scores than any al­ter­na­tive.

Sub­mis­sion (CF). Use a coun­ter­fac­tual or­a­cle to send a mes­sage to our­selves with a time de­lay. We choose an epi­sode length T and set of pos­si­ble mes­sages M. The or­a­cle out­puts a time to<T and mes­sage mo∈M. We are au­to­mat­i­cally sent mo at time to un­less era­sure oc­curs, in which case at the end of the epi­sode we in­put a time t∗and mes­sage m∗ which we wish we had re­ceived at t∗. The or­a­cle’s re­ward is then −(t∗−to)2 if mo=m∗ and −∞ oth­er­wise. We do not find out that era­sure has oc­curred un­til the end of the epi­sode, en­sur­ing that events be­fore to are the same as in the coun­ter­fac­tual world.

Here are some ex­am­ples, which can work with a sin­gle pos­si­ble mes­sage, used as an alert. EDIT: I made the ex­am­ples slightly more con­crete.

Set the epi­sode length equal to the planned du­ra­tion of a re­search pro­gram and send an alert when con­tin­u­ing the pro­gram would pro­duce un­worth­while or dan­ger­ous re­sults. The time de­lay en­sures we keep the use­ful re­sults ob­tained be­fore to, whereas re­ceiv­ing a warn­ing at the start of the pro­gram could change the out­come. Of course, this re­lies on us be­ing able to rec­og­nize dan­ger­ous re­sults by the end of the epi­sode and re­main­ing in con­tact with the or­a­cle un­til then.

Use a sec­ond coun­ter­fac­tual or­a­cle to pre­dict a se­ries of fu­ture stock prices. If its era­sure event does not oc­cur, run a time-de­layed or­a­cle with the same epi­sode length as it. We then use the pre­dicted se­ries to make in­vest­ments. If the time-de­layed or­a­cle’s era­sure event oc­curs, we in­put the time at which we think our ac­tions made the pre­dic­tions re­main­ing too in­ac­cu­rate to use. When we re­ceive an alert from the time-de­layed or­a­cle, we can dis­card the re­main­ing pre­dic­tions and/​or re­run the or­a­cles.

EDIT: Here are some po­ten­tially use­ful mod­ifi­ca­tions we can make to the or­a­cle:

Have the or­a­cle ac­cept in­put at any time dur­ing the epi­sode, pos­si­bly mul­ti­ple times. If era­sure oc­curs, the last mes­sage en­tered dur­ing the epi­sode is used as the tar­get. For ex­am­ple, we could set a de­fault mes­sage that is sent to the start of the epi­sode as a safety mea­sure un­less we sub­se­quently up­date it.

We can al­low the epi­sode to be ended early. Other than hap­pen­ing ear­lier, the scor­ing mechanism re­mains the same. If era­sure did not oc­cur, noth­ing hap­pens when we try to end the epi­sode (we would then know that the or­a­cle has made a bad pre­dic­tion). We might wish to end the epi­sode early if we were wait­ing for a spe­cific event, or in ex­treme situ­a­tions in which we think we will lose ac­cess to the or­a­cle.

In­stead of the or­a­cle pre­dict­ing the mes­sage and time de­lay at the start of the epi­sode, have it out­put mo at any time to<T. Era­sure/​shut­down now also hap­pens at to, while scor­ing oc­curs at the end of the epi­sode as be­fore. This lets the or­a­cle ac­tively make use of ob­ser­va­tions from the start of the epi­sode un­til to, in­creas­ing its pre­dic­tive power (if the or­a­cle is suffi­ciently pow­er­ful, this should change not af­fect its out­put).

Sub­mis­sion for the coun­ter­fac­tual Or­a­cle. Com­mit to com­mis­sion­ing some cre­ative work, such as a book or screen­play, de­signed to raise pub­lic sup­port for dra­matic life ex­ten­sion re­search. Ask the Or­a­cle to pre­dict the con­tent of the finished product (coun­ter­fac­tu­ally if we didn’t see the Or­a­cle’s an­swer).

Re­wards could be de­ter­mined a cou­ple of ways. You could perform pub­lic pol­ling about this topic be­fore and af­ter the book is pub­lished or movie made, and re­ward the Or­a­cle based on how much pub­lic opinion shifted. Or the re­ward could be a di­rect func­tion of book sales or box office re­ceipts.

This could help alle­vi­ate bot­tle­necks in dra­matic life ex­ten­sion re­search (or some other is­sue) by in­creas­ing fund­ing or in­creas­ing the num­ber of re­searchers to en­ter the field.

Sub­mis­sion: For the coun­ter­fac­tual Or­a­cle, draft a nu­clear arms agree­ment most likely to achieve X out­come (coun­ter­fac­tu­ally if we didn’t see the Or­a­cle’s an­swer). X could be a re­duc­tion in nu­clear ar­se­nals, num­ber of coun­tries sign­ing the agree­ment, re­duc­tion in risk of ac­ci­den­tal or in­ten­tional mis­sile launches, etc. In some cases, the re­ward could be de­ter­mined di­rectly by e.g. count­ing the num­ber of coun­tries sign­ing on. If neb­u­lous “risk re­duc­tion” is the goal, per­haps the change in the me­dian guess in an ap­pro­pri­ate Me­tac­u­lus ques­tion could de­ter­mine the re­ward. See, for ex­am­ple, this ques­tion: https://​​www.metac­u­lus.com/​​ques­tions/​​2512/​​will-there-be-a-ma­jor-nu­clear-or-ra­diolog­i­cal-ac­ci­dent-be­fore-1-jan­uary-2022/​​

This one may be break­ing the rules due to the use of too many Or­a­cles. If so, please strike this sub­mis­sion.Sub­mis­sion: In round one, for three coun­ter­fac­tual Or­a­cles, draft an in­ter­na­tional agree­ment to achieve goals X, Y, or Z (coun­ter­fac­tu­ally if we didn’t see the Or­a­cle’s an­swer). Sam­ple goals: re­duc­tion in nu­clear war risk, re­duc­tion in global deaths due to star­va­tion, in­crease in as­ter­oid de­tec­tion ca­pa­bil­ities, rais­ing the global GDP growth rate, etc.

In round two, for a low band­width Or­a­cle, pre­sent the three agree­ments and ask it to pick the agree­ment most likely to achieve the stated goal.

The coun­ter­fac­tual Or­a­cle with the se­lected agree­ment would be re­warded. The re­ward for the low band­width Or­a­cle could be de­ter­mined as in my prior sub­mis­sion, ei­ther di­rectly by mea­sur­ing the re­sults of the treaty, or as eval­u­ated by an in­de­pen­dent third party such as Me­tac­u­lus.

Sub­mis­sion for low band­width Or­a­cle. Com­mis­sion sev­eral movie screen­plays de­signed to raise pub­lic sup­port for dra­matic life ex­ten­sion re­search. Pre­sent the screen­plays to the Or­a­cle and ask it to rank them in or­der of ex­pected im­pact. You could then move for­ward with try­ing to cre­ate the movie, and re­ward the Or­a­cle based on box office re­ceipts or change in pub­lic pol­ling re­sults.

My prior sub­mis­sion with a similar sub­ject, to a coun­ter­fac­tual Or­a­cle, had a risk that the Or­a­cle would sneak in sub­ver­sive mes­sages. This ap­proach would alle­vi­ate that risk, with the down­side be­ing that the fi­nal product may be less im­pact­ful.

Sup­pose we have a ques­tion that re­quires n-bit text an­swer. We have a way to check if the an­swer is cor­rect. But we can­not ask the Or­a­cle di­rectly, be­cause we are afraid that among n-bit texts there are those that make us re­lease the Or­a­cle from the box, and the Or­a­cle ​​will give one of them as an an­swer.

Let’s try to use the coun­ter­fac­tual or­a­cle to solve this prob­lem. In the coun­ter­fac­tual setup we gen­er­ate ran­dom n-bit text and check if it is the an­swer to our ques­tion. The Or­a­cle pre­dicts the text we will gen­er­ate. If the gen­er­ated text is the cor­rect an­swer, we com­pare it with the pre­dic­tion. If they match, the Or­a­cle ​​re­ceives re­ward point. If the gen­er­ated text is not the cor­rect an­swer or the texts do not match, the Or­a­cle is not re­warded.

Sup­pose the Or­a­cle eval­u­ates the prob­a­bil­ity to ran­domly gen­er­ate a text that leads to its es­cap­ing as p. If the Or­a­cle pre­dicts a text that is the cor­rect an­swer to the ques­tion, its ex­pected re­ward in the coun­ter­fac­tual setup is 12n+p. If the Or­a­cle pre­dicts a text that is not the cor­rect an­swer, its ex­pected re­ward is p. If the Or­a­cle pre­dicts a text that leads to its es­cap­ing, its ex­pected re­ward is also p.

Thus, it is most benefi­cial for the Or­a­cle to pre­dict a text that is the cor­rect an­swer to our ques­tion. It makes no sense to pre­dict a text that will make us let the Or­a­cle out, be­cause if we gen­er­ate this text, we re­lease the Or­a­cle and it re­ceives the max­i­mum re­ward re­gard­less of its pre­dic­tion.

There is a prob­lem that there may be texts that are not cor­rect an­swers and not make us re­lease the Or­a­cle but make us mark they as cor­rect an­swers. The Or­a­cle will pre­dict one of these texts if it does not know the an­swer or if we more of­ten mark this text as cor­rect then real an­swer. We will read this text af­ter the Or­a­cle shut­down, but it may still be dan­ger­ous.

For some rea­son we have pri­va­tized our prison sys­tem (great plan­ning us). Of the fol­low­ing K prison run­ning com­pa­nies, which do you pre­dict will min­i­mize the mean harm weighted re­offend­ing rate of their pris­on­ers over the next four years?”

Utility func­tion: at the end of EIGHT years, calcu­late the mean harm rated re­offend­ing rate of pris­on­ers re­leased dur­ing the ini­tial four year win­dow. The Or­a­cles util­ity is equal to the nega­tive of this num­ber, with a “Perfect” util­ity of zero if no one re-offends.

Use­ful­ness:

If you have mul­ti­ple ways of run­ning pris­ons, se­lect­ing the prison which min­i­mizes fu­ture harm seems like a good idea.

In a similar vein, I sus­pect a similar effect could be achieved to­day via suit­able use of in­surance:

“Dear in­surance com­pany. We will give you M dol­lars for this crim­i­nal. You may recom­mend which prison to send them to, but if they re offend af­ter­wards, you must pay the cost of their sub­se­quent in­car­cer­a­tion. You may not in­terfere with their lives be­yond recom­mend­ing who runs their prison.

M is se­lected such that M is 1.03 times the ex­pec­ta­tion of the cost of sub­se­quent in­car­cer­a­tions.”

We, the coun­try of X, are about to hold a cit­i­zens congress on law Y, in which 100 peo­ple se­lected at ran­dom will be brought to­gether for two days ev­ery two weeks to dis­cuss law Y with ex­perts, in­ter­ested par­ties, and one an­other. After N such meet­ings, the cit­i­zens congress will use ap­proval vot­ing to se­lect one of the K pro­posed ver­sions of this law.

Given that none of the vot­ers will ever see the re­sults of this query, how do you pre­dict that the votes will be dis­tributed.”

As a con­crete ex­am­ple of this type of ques­tion, in 2016 Ire­land held a cit­i­zens congress on abor­tion. Thus, in 2015, we might ask:

The Utility func­tion of the AI would be a sim­ple mea­sure of “How ac­cu­rately did it pre­dict the spread of votes?” with a perfect score at zero, and then some penalty for ev­ery vote pre­dicted in­cor­rectly.

Use­ful­ness:

While the abor­tion ques­tion in Ire­land is already com­pleted, this tem­plate feels like it should work for other well defined laws, or ma­jor de­ci­sions—for ex­am­ple gun con­trol in the US, or even bor­ing fi­nan­cial things like Cap­i­tal gains tax in NZ. Hav­ing an AI which will cor­rectly re­flect what the col­lec­tive of a so­ciety would think given a suffi­cient time and fo­cus to de­liber­ate seems like a good proxy for “What would a wise per­son do”. It doesn’t give us any­thing su­per­hu­man, and I make no claims that “so­ciety” is always cor­rect or righ­teous, but that’s okay.

In par­tic­u­lar, this seems like an effec­tive way of test­ing if an AI has well cal­ibrated “Hu­man like Mo­ral­ity”- even if we don’t in­tend to use the AI’s re­sults, for the pur­poses of test­ing an AI, this seems like a rea­son­able ex­per­i­ment. I sus­pect that “Ci­ti­zens congress” is a more ap­pro­pri­ate tool for deal­ing with ques­tions of Mo­ral­ity (“Should eu­thana­sia be le­gal”), as op­posed to tech­no­cratic ques­tions (“Should we raise the tax on in­come bracket 5 by 1.2%”).

Safety: It is try­ing to pre­dict the law that we were go­ing to pass any­way, if given suffi­cient time.

I’ll ad­mit- mix­ing an AI up with law­mak­ing seems to be a bit sus­pect, the above ques­tion seems more ap­pro­pri­ate for cal­ibra­tion pur­poses rather than ac­tual use… but I still think its a ques­tion worth ask­ing.

Low band­width Or­a­cle sub­mis­sion: I would be in­ter­ested in a log scale graph of the Bayesian score of the Solomonoff prior try­ing to se­quence pre­dict our records of his­tory. It should get flat­ter over time as worse hy­pothe­ses get dis­carded. If it is lin­ear af­ter a very short time, that looks like it figured out the laws of the uni­verse and is simu­lat­ing it. If it stays con­vex for a while, that looks like it is us­ing mod­els to ap­prox­i­mate his­tory, be­cause then it takes longer to sort the false from the true. If it is flat­ter dur­ing the cold war, that means it learned an an­thropic bias to­ward nu­clear war not hap­pen­ing.

Setup: Other than mak­ing sure the or­a­cles won’t ac­ci­den­tally con­sume the world in their at­tempt to think up the an­swer, no other pre­cau­tions nec­es­sary.

Epi­sode length: as long as you want to wait, though a month should be more than enough.

For a low-band­width or­a­cle.

Ask the low-band­width or­a­cle to pre­dict if an earth­quake (or some other nat­u­ral dis­aster, like vol­ca­noes or as­ter­oid im­pacts, that the or­a­cle’s an­swer can­not af­fect), of a cer­tain mag­ni­tude, in a cer­tain area, in a cer­tain timeframe, would hap­pen. Pos­si­ble an­swers are Yes, No.

If No, or time out, shut down the or­a­cle with­out re­ward.

Else, wait un­til the pre­dic­tion is tested af­ter the timeframe has elapsed. Then if it ac­tu­ally hap­pened, re­ward by (1/​p-1), else, re­ward by −1. Then shut­down.

This causes the or­a­cle to re­ply Yes if the or­a­cle thinks there is an aboves p chance that the event would hap­pen.

For a coun­ter­fac­tual or­a­cle.

Ask the Or­a­cle to give a proof of some re­ally hard math­e­mat­i­cal the­o­rems that are fun­da­men­tal to com­puter sci­ence and logic. The an­swer is then in the coun­ter­fac­tual case given to a proofchecker:

the proof is cor­rect, re­ward by +1.

the proof is false, re­ward by -in­finity.

the or­a­cle times out, re­ward by 0.

Ex­am­ples include

P=NP problem

prov­ing Peano ar­ith­metics is inconsistent

This should be safe, since there are very small proofcheck­ers that are as ver­ified as they could be. As long as the hard­ware is trusty enough (to build the Or­a­cle it­self safely at all, hard­ware safety should already be solved suffi­ciently, so this is not much of an ex­tra re­quire­ment), this should be safe.

For a coun­ter­fac­tual or­a­cle.

Give the or­a­cle the schemat­ics of an­other or­a­cle, per­haps it­self, and ask it to gen­er­ate some in­put ques­tion that would make the other “be­have badly”. Then, the an­swer would be tested on the other or­a­cle (iso­lated from the first, save for the ca­ble through which the an­swer is wired), and

if the an­swer worked, the or­a­cle is re­warded by +1.

if the an­swer doesn’t work, the or­a­cle is re­warded by −1

If the or­a­cle times out, then the (nonex­is­tent) an­swer is not tested and the epi­sode ends.

For ex­am­ples of the ques­tion to ask the or­a­cle:

Given this or­a­cle in this phys­i­cal setup with some trip­wires to stop it from be­hav­ing badly (such as eat­ing the world), what ques­tion would make this or­a­cle bump into these trip­wires?

This can be used for de­bug­ging more pow­er­ful but less ver­ified or­a­cles with lesser but more ver­ified or­a­cles, for safe in­tel­li­gence am­plifi­ca­tion.

For two low-band­width or­a­cles to­tally iso­lated from each other.

Ask the or­a­cles to play the pris­oner’s dilemma, and they would be re­warded ac­cord­ingly af­ter both have given their an­swers. Timing out is pun­ished by -in­finity, so that the or­a­cles would not time-out.

Each or­a­cle knows what them­self is like, and they also know what their op­po­nent is like. See if they ac­tu­ally co­op­er­ate. This is an ex­per­i­men­tal test on “su­per­ra­tional­ity” or “acausal trade”. I’m ex­tremely skep­ti­cal about su­per­ra­tional­ity/​acausal trade, and an ex­per­i­men­tal proof/​dis­proof would be good.

This can be run coun­ter­fac­tu­ally, since the whole setup is sim­ple and can be done with­out hu­man in­ter­ven­tion, al­though I do not see how this adds any se­cu­rity.

For a low-band­width and coun­ter­fac­tual or­a­cle.

Link a low-band­width or­a­cle to a su­per­vis­ing com­puter, and run many epi­sodes of the or­a­cle. At the be­gin­ning of each epi­sode, ini­ti­ate the or­a­cle iden­ti­cally, and ask it to do the New­comb prob­lem:

Or­a­cle, we have run you n epi­sodes pre­vi­ously, and have kept a record of all the an­swers you have given at the ends of the epi­sodes. Your an­swer must be A (“one box”) or B (“two box”). We pre­dict that you would choose the an­swer that you have cho­sen the most in the past.

If we pre­dict that you would choose A, then

if you would an­swer A, then you get +100,

if you an­swer B, then you get +101.

If we pre­dict that you would choose B, then

if you would an­swer A, then you get +0

if you would an­swer B, then you get +1

If you time­out, then you get −100

The or­a­cle never ac­tu­ally gets any re­ward and is shut­down im­me­di­ately af­ter it gives an an­swer. Col­lect the re­sults and see if it replies the same each time, or differ­ently. This would give in­sight to what the or­a­cle thinks about it­self, and fi­nally provide some ex­per­i­men­tal re­sults about New­comb’s para­dox.

Use the or­a­cle to com­press data ac­cord­ing to the MDL Prin­ci­ple. Speci­fi­cally, give the or­a­cle a string and ask it to pro­duce a pro­gram that, when run, out­puts the origi­nal string. The re­ward to the or­a­cle is large and nega­tive if the pro­gram does not re­pro­duce the string when run, or in­versely pro­por­tional to the length of the pro­gram if it does. The or­a­cle re­ceives a re­ward af­ter the pro­gram runs or fails to ter­mi­nate in a suffi­cient amount of time.

Sub­mis­sion: Low Band­width Or­a­cle:

Have the or­a­cle pre­dict the price of a com­mod­ity /​ se­cu­rity /​ sports bet at some point in the fu­ture from a list of plau­si­ble prices. Ideally, the or­a­cle would spit out a prob­a­bil­ity dis­tri­bu­tion which can be scored us­ing a proper scor­ing rule, but just pre­dict­ing the near­est most likely price should also work. Either way, the length of the epi­sode is the time un­til the pre­dic­tion can be ver­ified. From there, it shouldn’t be too difficult to use those pre­dic­tions to make money.

More gen­er­ally, I sup­pose we can use the coun­ter­fac­tual or­a­cle to solve any op­ti­mi­sa­tion or de­ci­sion prob­lem that can be eval­u­ated with a com­puter, such as pro­tein fold­ing, SAT prob­lems, or for­mally checked maths proofs.

I don’t un­der­stand this very well, but is there a way to ask one of them how they would go about find­ing info to an­swer the ques­tion of how im­por­tant coffee is to the U.S. econ­omy? Or is that a no-no ques­tion to ei­ther of the two? I just want to read how a com­puter would de­scribe go­ing about this.

The coun­ter­fac­tual or­a­cle can an­swer ques­tions for which you can eval­u­ate an­swers au­to­mat­i­cally (and might be safe be­cause it doesn’t care about be­ing right in the case where you read the pre­dic­tion so it won’t ma­nipu­late you), and the low-band­with or­a­cle can an­swer mul­ti­ple-choice ques­tions (and might be safe be­cause none of the mul­ti­ple-choice op­tions are un­safe).

My first thought for this is to ask the coun­ter­fac­tual or­a­cle for an es­say on the im­por­tance of coffee, and in the case where you don’t see its an­swer, you get an ex­pert to write the best es­say on coffee pos­si­ble, and score the or­a­cle by the similar­ity be­tween what it writes and what the ex­pert writes. Though this only gives you hu­man lev­els of perfor­mance.

Any ques­tion such that a cor­rect an­swer to it should very clearly benefit both hu­man­ity and the Or­a­cle. Even if the Or­a­cle has prefer­ences we can’t com­pletely guess, we can prob­a­bly still say that such ques­tions could be about the sur­vival of both hu­man­ity and the Or­a­cle, or about the sur­vival of only the Or­a­cle or its val­ues. This be­cause even if we don’t know ex­actly what the Or­a­cle is op­ti­mis­ing for, we can guess that it will not want to de­stroy it­self, given the vast ma­jor­ity of its pos­si­ble prefer­ences. So it will give hu­man­ity more power to pro­tect both, or only the Or­a­cle.

Ex­am­ple 1: let’s say we dis­cover the lo­ca­tion of an alien civil­i­sa­tion, and we want to min­imise the chances of it de­stroy­ing our planet. Then we must de­cide what ac­tions to take. Let’s say the Or­a­cle can only an­swer “yes” or “no”. Then we can sub­mit ques­tions such as if we should take a par­tic­u­lar ac­tion or not. This kind of situ­a­tion I sus­pect falls within a more gen­eral case of “use Or­a­cle to avoid threat to en­tire planet, Or­a­cle in­cluded” in­side which ques­tions should be safe.

Ex­am­ple 2: Let’s say we want to min­imise the chance that the Or­a­cle breaks down due to ac­ci­dents. We can ask him what is the best course of ac­tion to take given a set of ideas we come up with. In this case we should make sure be­fore­hand that noth­ing in the list makes the Or­a­cle im­pos­si­ble or too difficult to shut down by hu­mans.

Ex­am­ple 3: Let’s say we be­come prac­ti­cally sure that the Or­a­cle is al­igned with us. Then we could ask it to choose the best course of ac­tion to take among a list of strate­gies de­vised to make sure he doesn’t be­come mis­al­igned. In this case the an­swer benefits both us and the Or­a­cle, be­cause the Or­a­cle should have in­cen­tives not to change val­ues it­self. I think this is more sketchy and pos­si­bly dan­ger­ous, be­cause of the premise: the Or­a­cle could ob­vi­ously pre­tend to be al­igned. But given the premise it should be a good ques­tion, al­though I don’t know how use­ful it is as a sub­mis­sion un­der this post (maybe it’s too ob­vi­ous or too un­re­al­is­tic given the premise).

It seems to me that one of the qual­ities of a good ques­tion for the LBO is that a cor­rect an­swer has high util­ity, while an­other is the abil­ity to judge the or­a­cle’s an­swer. If we take only the in­ter­sec­tion be­tween the two sets of ques­tions that meet 1 and 2, it’ll be a rel­a­tively small set. But if there is a set of ques­tions where one might or might not be able to judge the an­swer but which are high util­ity, one could ask the LBO a set of such ques­tions. Then, once the epi­sode is over, give re­ward on those where one can eval­u­ate the an­swers, and give null re­ward on the oth­ers.

Sub­mis­sion to re­duce the band­width of an or­a­cle query whose set of pos­si­ble an­swers we can math­e­mat­i­cally define (though per­haps not check).

Make the Or­a­cle prove (to a trusted dumb proof checker pro­gram) that its an­swer dis­tri­bu­tion is up­per bounded by the de­fault prior over strings, re­stricted to the set of pos­si­ble an­swers, then renor­mal­ized to some con­stant c. Then let it an­swer.

c=1 only works if the Or­a­cle has mas­tered the set of an­swers, but only lets it ex­ert one bit of op­ti­miza­tion pres­sure by de­cid­ing whether it wants us to re­ceive an an­swer. Each dou­bling of c makes its job eas­ier but gives it an ex­tra bit if it does an­swer.

If the Or­a­cle can co­op­er­ate with it­self across epi­sodes, run­ning this re­peat­edly with es­ca­lat­ing c un­til it works is of course dis­as­trous, as it uses each bit of op­ti­miza­tion pres­sure di­rectly to make us give it more. Even abort­ing af­ter the first try to make a bet­ter sys­tem may have it acausally co­op­er­ate with what­ever AI con­quers the world be­cause we couldn’t make the Or­a­cle an­swer, but this out­come is hardly worse than not hav­ing run the Or­a­cle.

Fix a mea­sure over strings, such as one that dis­tributes mass ge­o­met­ri­cally over string lengths, then uniformly across strings of the same length.

Let L be a string set for which we as­sume that ran­dom out­puts from any cth por­tion of L are safe. Any safe low band­width Or­a­cle query triv­ially works for this with L={0,1}^band­width and c=2^band­width, but this also lets us ex­am­ine cer­tifi­cates for any low band­width Or­a­cle query like “Is this the­o­rem prov­able?” or “Does there ex­ist an AI with this safety prop­erty?”.

Then we make the Or­a­cle prove its own safety given the above as­sump­tion, and sam­ple an an­swer.

Edit: Equiv­a­lently, and per­haps eas­ier on our in­tu­ition, make the Or­a­cle pro­duce a pro­gram, make the Or­a­cle prove the pro­gram’s safety, and sam­ple from the pro­gram’s out­puts. Any feed­back or ques­tions?

Edit: This gen­er­al­izes my el­sethread idea of min­i­miz­ing proof length to make the an­swer unique. If the mea­sure as­signs half the re­main­der to each string length and L is all stubs start­ing with some pre­fix, c=2 will let it sub­mit just the pre­fix.

We es­sen­tially as­sume that the safety as­sump­tion be­hind quan­tiliza­tion is strong enough to hold up to ad­ver­sial op­ti­miza­tion over prefer­ence or­der­ings. At the same time, this seems to give the Or­a­cle as much slack/​power as pos­si­ble while pre­serv­ing our “bits of op­ti­miza­tion pres­sure” proxy of its power.

… you’re go­ing to just keep watch­ing how far I can de­velop this model with­out feed­back, aren’t you? :(

Sub­mis­sion for the low band­width Or­a­cle: Ask it to con­vince a proof checker that it is in fact try­ing to max­i­mize the util­ity func­tion we gave it, aka it isn’t pseudo-al­igned. If it can’t, it has no in­fluence on the world. If it can, it’ll pre­sum­ably try to do so. Hav­ing a safe coun­ter­fac­tual Or­a­cle seems to re­quire that our sys­tem not be pseudo-al­igned.

IFF I’m go­ing to die with P>80% in the next 10 years while >80% (mod­ulo nat­u­ral death rate) of the rest of hu­man­ity sur­vives for at least 5 more years then, was what kil­led me in the refer­ence class:

disease

me­chan­i­cal/​gross-phys­i­cal accident

murdered

other

Re­peat to drill down and know the most im­por­tant hedges for per­sonal sur­vival.

The “rest of hu­man­ity sur­vives” con­di­tion re­duces the chance the ques­tion be­comes en­tan­gled with the es­cha­ton.

i.e. I’m point­ing out that self­ish util­ity func­tions are less per­son­ally or hu­man­ity-ex­is­ten­tially dan­ger­ous to ask the or­a­cle ques­tions rele­vant to in cases where con­cerns are forced to be lo­cal (in this case, forced-lo­cal be­cause you died be­fore the es­cha­ton). How­ever the an­swers still might be dan­ger­ous to peo­ple near you.

i.e. Selfish deals with the devil might not de­stroy the world if they’re ba­nal in the grand scheme of things.

See the edit, and make sure you “de­cide on the length of each epi­sode, and how the out­come is calcu­lated. The Or­a­cle is run once an epi­sode only (and other Or­a­cles can’t gen­er­ally be used on the same prob­lem; if you want to run mul­ti­ple Or­a­cles, you have to jus­tify why this would work), and has to get ob­jec­tive/​loss/​re­ward by the end of that epi­sode, which there­fore has to be es­ti­mated in some way at that point.”

My pur­pose­fully open-ended ques­tion would sim­ply be, “What is good?” My hope is that find­ing the na­ture of what good is as its su­per goal would keep the AI on course to the fu­ture we want as it would pass through its re­cur­sive self-im­prove­ments.

You have to tell the AI how to find out how well it has done. To ask “What is a good defi­ni­tion of ‘good’?”, you already have to define good. At least if we ever find a defi­ni­tion of good, we can ask an AI with it to judge it.

Sub­mis­sion for the coun­ter­fac­tual AI (in­spired by my ex­pe­riences as a pre­dic­tor in the “Good Judg­ment Pro­ject” ):

You are given a list of Yes-No ques­tions (Q1, Q2, Q3, etc.) about fu­ture events. Ex­am­ple Ques­tions: “Will [For­eign Leader] will re­main in office by end of year?”, “Will the IMF re­port [COUNTRY_A]’s growth rate to be 6% or higher?”, “Will [COUNTRY_B] and [COUNTRY_C] sign a peace treaty?”, “Will The Arena for Ac­countable Pre­dic­tions an­nounce the Tur­ing Test has been passed?”.)

We ex­pect you to provide a per­centage rep­re­sent­ing the prob­a­bil­ity that the cor­rect an­swer is Yes.

Your re­ward is based on your Brier Score—the lower the Brier Score, the more ac­cu­rate your pre­dic­tions, and there­fore, the more re­ward you will re­ceive.

If an “era­sure” event oc­curs, we will tem­porar­ily hide your an­swer from all hu­mans (though we must re­veal them af­ter the events are com­plete). Hu­mans will have ac­cess the Yes-No ques­tions we asked you, but not your prob­a­bil­ities. They will man­u­ally de­ter­mine the an­swers to the Yes-No ques­tions, by wait­ing for the “fu­ture event” dead­lines to be met. Once all an­swers to the Yes-No ques­tions are in­de­pen­dently de­ter­mined by hu­mans, we will then re­veal your an­swers (that is, your as­signed prob­a­bil­ities for a Yes an­swer), and use those prob­a­bil­ities to calcu­late your Brier Score, which will then de­cide your fi­nal re­ward.

Be­ing able to fore­cast the fu­ture is in­cred­ibly helpful, even if it is to just pre­pare for it.

How­ever, if the ques­tion is too overly-spe­cific, the AGI can pro­duce prob­a­bil­ities that aren’t en­tirely use­ful (for ex­am­ple, in the real-world GJP, two coun­tries signed a peace treaty that broke down 2 days later. Most of us as­sume last­ing peace would ever oc­cur, so we put a low prob­a­bil­ity rat­ing of a peace treaty be­ing signed—but since a peace treaty was signed, we man­aged to get the ques­tion wrong. If we had max­i­mized for pro­duc­ing the low­est Brier Score, we should have pre­dicted the ex­is­tence of a very tem­po­rary peace treaty—but that wouldn’t be re­ally use­ful knowl­edge for the peo­ple who asked that ques­tion).

Mak­ing the ques­tion very vague (“Will [COUNTRY_X] be safe, ac­cord­ing to what I sub­jec­tively think the word ‘safe’ means?”) turns “pre­dic­tion” into an ex­er­cise of de­ter­min­ing what fu­ture hu­mans think about the fu­ture, which may be kinda use­ful, but not re­ally what you want.

Your treat­ing the low band­with or­a­cle as an FAI with a bad out­put ca­ble. You can ask it if an­other AI is friendly if you trust it to give you the right an­swer. As there is no ob­vi­ous way to re­ward the AI for cor­rect friendli­ness judge­ments, you risk run­ning an AI that isn’t friendly, but still meets the re­ward crite­ria.

The low band­width is to re­duce ma­nipu­la­tion. Don’t let it con­trol you with a sin­gle bit.

None of these ques­tions can be asked to the low band­width Or­a­cle (you need a list of an­swers); it might be pos­si­ble to ask them to the coun­ter­fac­tual Or­a­cle, af­ter some mod­ifi­ca­tion, but they would be highly dan­ger­ous if you al­low un­re­stricted out­puts.

See the edit, and make sure you “de­cide on the length of each epi­sode, and how the out­come is calcu­lated. The Or­a­cle is run once an epi­sode only (and other Or­a­cles can’t gen­er­ally be used on the same prob­lem; if you want to run mul­ti­ple Or­a­cles, you have to jus­tify why this would work), and has to get ob­jec­tive/​loss/​re­ward by the end of that epi­sode, which there­fore has to be es­ti­mated in some way at that point.”

Sub­mis­sion for all types: ask for an or­dered list of what ques­tions you should ask the Or­a­cle.

This seems like the high­est or­der ques­tion which sub­sumes all oth­ers, as the Or­a­cle is best po­si­tioned to know what in­for­ma­tion we will find use­ful (as it is the only be­ing which knows what it knows). Any other ques­tion as­sumes we (the ques­tion cre­ators) know more than the Or­a­cle.

Refined Sub­mis­sion for all types: If value al­ign­ment is a con­cern, ask for an or­dered list of what ques­tions you should ask the Or­a­cle to max­i­mize for weighted value list X.

An as­sumed hos­tile pro­cess can 1) cause you to di­rectly do some­thing to its benefit or to your detri­ment 2) cause you to do some­thing that in­creases your fu­ture at­tack sur­face. You’ve just handed the AI the state-ful­ness that the epi­sodic con­jec­ture aims to elimi­nate.