There's a new LWW page on the Roko's basilisk thought experiment, discussing both Roko's original post and the fallout that came out of Eliezer Yudkowsky banning the topic on Less Wrong discussion threads. The wiki page, I hope, will reduce how much people have to rely on speculation or reconstruction to make sense of the arguments.

While I'm on this topic, I want to highlight points that I see omitted or misunderstood in some online discussions of Roko's basilisk. The first point that people writing about Roko's post often neglect is:

Roko's arguments were originally posted to Less Wrong, but they weren't generally accepted by other Less Wrong users.

Less Wrong is a community blog, and anyone who has a few karma points can post their own content here. Having your post show up on Less Wrong doesn't require that anyone else endorse it. Roko's basic points were promptly rejected by other commenters on Less Wrong, and as ideas not much seems to have come of them. People who bring up the basilisk on other sites don't seem to be super interested in the specific claims Roko made either; discussions tend to gravitate toward various older ideas that Roko cited (e.g., timeless decision theory (TDT) and coherent extrapolated volition (CEV)) or toward Eliezer's controversial moderation action.

In July 2014, David Auerbach wrote a Slatepiece criticizing Less Wrong users and describing them as "freaked out by Roko's Basilisk." Auerbach wrote, "Believing in Roko’s Basilisk may simply be a 'referendum on autism'" — which I take to mean he thinks a significant number of Less Wrong users accept Roko’s reasoning, and they do so because they’re autistic (!). But the Auerbach piece glosses over the question of how many Less Wrong users (if any) in fact believe in Roko’s basilisk. Which seems somewhat relevant to his argument...?

The idea that Roko's thought experiment holds sway over some community or subculture seems to be part of a mythology that’s grown out of attempts to reconstruct the original chain of events; and a big part of the blame for that mythology's existence lies on Less Wrong's moderation policies. Because the discussion topic was banned for several years, Less Wrong users themselves had little opportunity to explain their views or address misconceptions. A stew of rumors and partly-understood forum logs then congealed into the attempts by people on RationalWiki, Slate, etc. to make sense of what had happened.

I gather that the main reason people thought Less Wrong users were "freaked out" about Roko's argument was that Eliezer deleted Roko's post and banned further discussion of the topic. Eliezer has since sketched out his thought process on Reddit:

When Roko posted about the Basilisk, I very foolishly yelled at him, called him an idiot, and then deleted the post. [...] Why I yelled at Roko: Because I was caught flatfooted in surprise, because I was indignant to the point of genuine emotional shock, at the concept that somebody who thought they'd invented a brilliant idea that would cause future AIs to torture people who had the thought, had promptly posted it to the public Internet. In the course of yelling at Roko to explain why this was a bad thing, I made the further error---keeping in mind that I had absolutely no idea that any of this would ever blow up the way it did, if I had I would obviously have kept my fingers quiescent---of not making it absolutely clear using lengthy disclaimers that my yelling did not mean that I believed Roko was right about CEV-based agents [= Eliezer’s early model of indirectly normative agents that reason with ideal aggregated preferences] torturing people who had heard about Roko's idea. [...] What I considered to be obvious common sense was that you did not spread potential information hazards because it would be a crappy thing to do to someone. The problem wasn't Roko's post itself, about CEV, being correct.

This, obviously, was a bad strategy on Eliezer's part. Looking at the options in hindsight: To the extent it seemed plausible that Roko's argument could be modified and repaired, Eliezer shouldn't have used Roko's post as a teaching moment and loudly chastised him on a public discussion thread. To the extent this didn't seem plausible (or ceased to seem plausible after a bit more analysis), continuing to ban the topic was a (demonstrably) ineffective way to communicate the general importance of handling real information hazards with care.

On that note, point number two:

Roko's argument wasn’t an attempt to get people to donate to Friendly AI (FAI) research. In fact, the opposite is true.

Roko's original argument was not 'the AI agent will torture you if you don't donate, therefore you should help build such an agent'; his argument was 'the AI agent will torture you if you don't donate, therefore we should avoid ever building such an agent.' As Gerard noted in the ensuing discussion thread, threats of torture "would motivate people to form a bloodthirsty pitchfork-wielding mob storming the gates of SIAI [= MIRI] rather than contribute more money." To which Roko replied: "Right, and I am on the side of the mob with pitchforks. I think it would be a good idea to change the current proposed FAI content from CEV to something that can't use negative incentives on x-risk reducers."

Roko saw his own argument as a strike against building the kind of software agent Eliezer had in mind. Other Less Wrong users, meanwhile, rejected Roko's argument both as a reason to oppose AI safety efforts and as a reason to support AI safety efforts.

a futurist version of Pascal’s wager; an argument used to try and suggest people should subscribe to particular singularitarian ideas, or even donate money to them, by weighing up the prospect of punishment versus reward.

If I'm correctly reconstructing the sequence of events: Sites like RationalWiki report in the passive voice that the basilisk is "an argument used" for this purpose, yet no examples ever get cited of someone actually using Roko’s argument in this way. Via citogenesis, the claim then gets incorporated into other sites' reporting.

(E.g., in Outer Places: "Roko is claiming that we should all be working to appease an omnipotent AI, even though we have no idea if it will ever exist, simply because the consequences of defying it would be so great." Or in Business Insider: "So, the moral of this story: You better help the robots make the world a better place, because if the robots find out you didn’t help make the world a better place, then they’re going to kill you for preventing them from making the world a better place.")

In terms of argument structure, the confusion is equating the conditional statement 'P implies Q' with the argument 'P; therefore Q.' Someone asserting the conditional isn’t necessarily arguing for Q; they may be arguing against P (based on the premise that Q is false), or they may be agnostic between those two possibilities. And misreporting about which argument was made (or who made it) is kind of a big deal in this case: 'Bob used a bad philosophy argument to try to extort money from people' is a much more serious charge than 'Bob owns a blog where someone once posted a bad philosophy argument.'

Lastly:

"Formally speaking, what is correct decision-making?" is an important open question in philosophy and computer science, and formalizing precommitment is an important part of that question.

Moving past Roko's argument itself, a number of discussions of this topic risk misrepresenting the debate's genre. Articles on Slate and RationalWiki strike an informal tone, and that tone can be useful for getting people thinking about interesting science/philosophy debates. On the other hand, if you're going to dismiss a question as unimportant or weird, it's important not to give the impression that working decision theorists are similarly dismissive.

What if your devastating take-down of string theory is intended for consumption by people who have never heard of 'string theory' before? Even if you're sure string theory is hogwash, then, you should be wary of giving the impression that the only people discussing string theory are the commenters on a recreational physics forum. Good reporting by non-professionals, whether or not they take an editorial stance on the topic, should make it obvious that there's academic disagreement about which approach to Newcomblike problems is the right one. The same holds for disagreement about topics like long-term AI risk or machine ethics.

If Roko's original post is of any pedagogical use, it's as an unsuccessful but imaginative stab at drawing out the diverging consequences of our current theories of rationality and goal-directed behavior. Good resources for these issues (both for discussion on Less Wrong and elsewhere) include:

The Roko's basilisk ban isn't in effect anymore, so you're welcome to direct people here (or to the Roko's basilisk wiki page, which also briefly introduces the relevant issues in decision theory) if they ask about it. Particularly low-quality discussions can still get deleted (or politely discouraged), though, at moderators' discretion. If anything here was unclear, you can ask more questions in the comments below.

I applaud your thorough and even-handed wiki entry. In particular, this comment:

"One take-away is that someone in possession of a serious information hazard should exercise caution in visibly censoring or suppressing it (cf. the Streisand effect)."

Censorship, particularly of the heavy-handed variety displayed in this case, has a lower probability of success in an environment like the Internet. Many people dislike being censored or witnessing censorship, the censored poster could post someplace else, and another person might conceive the same idea in an independent venue.

And if censorship cannot succeed, then the implicit attempt to censor the line of thought will also fail. That being the case, would-be censors would be better served by either proceeding "as though no such hazard exists", as you say, or by engaging the line of inquiry and developing a defense. I'd suggest that the latter, actually solving rather than suppressing the problem, is in general likely to prove more successful in the long run.

Examples of censorship failing are easy to see. But if censorship works, you will never hear about it. So how do we know censorship fails most of the time? Maybe it works 99% of the time, and this is just the rare 1% it doesn't.

On reddit, comments are deleted silently. The user isn't informed their comment has been deleted, and if they go to it, it still shows up for them. Bans are handled the same way.

This actually works fine. Most users don't notice it and so never complain about it. But when moderation is made more visible, all hell breaks loose. You get tons of angry PMs and stuff.

Lesswrong is based on reddit's code. Presumably moderation here works the same way. If moderators had been removing all my comments about a certain subject, I would have no idea. And neither would anyone else. It's only when big things are removed that people notice. Like an entire post that lots of people had already seen.

I don't believe this can be true for active (and reasonably smart) users. If, suddenly, none of your comments gets any replies at all and you know about the existence of hellbans, well... Besides, they are trivially easy to discover by making another account. Anyone with sockpuppets would notice a hellban immediately.

I think you would be surprised at how effective shadow bans are. Most users just think their comments haven't gotten any replies by chance and eventually lose interest in the site. Or in some cases keep making comments for months. The only way to tell is to look at your user page signed out. And even that wouldn't work if they started to track cookies or ip instead of just the account you are signed in on.

But shadow bans are a pretty extreme example of silent moderation. My point was that removing individual comments almost always goes unnoticed. /r/Technology had a bot that automatically removed all posts about Tesla for over a year before anyone noticed. Moderators set up all kinds of crazy regexes on posts and comments that keep unwanted topics away. And users have no idea whatsoever.

I'm new to the subject, so I'm sorry if the following is obvious or completely wrong, but the comment left by Eliezer doesn't seem like something that would be written by a smart person who is trying to suppress information. I seriously doubt that EY didn't know about Streisand effect.

However the comment does seem like something that would be written by a smart person who is trying to create a meme or promote his blog.

In HPMOR characters give each other advice "to understand a plot, assume that what happened was the intended result, and look at who benefits." The idea of Roko's basilisk went viral and lesswrong.com got a lot of traffic from popular news sites(I'm assuming).

I also don't think that there's anything wrong with it, I'm just sayin'.

the comment left by Eliezer doesn't seem like something that would be written by a smart person who is trying to suppress information. I seriously doubt that EY didn't know about Streisand effect.

No worries about being wrong. But I definitely think you're overestimating Eliezer, and humanity in general. Thinking that calling someone an idiot for doing something stupid, and then deleting their post, would cause a massive blow up of epic proportions, is sometng you can really only predict in hindsight.

The line goes "to fathom a strange plot, one technique was to look at what ended up happening, assume it was the intended result, and ask who benefited". But in the real world strange secret complicated Machiavellian plots are pretty rare, and successful strange secret complicated Machiavellian plots are even rarer. So I'd be wary of applying this rule to explain big once-off events outside of fiction. (Even to HPMoR's author!)

I agree Eliezer didn't seem to be trying very hard to suppress information. I think that's probably just because he's a human, and humans get angry when they see other humans defecting from a (perceived) social norm, and anger plus time pressure causes hasty dumb decisions. I don't think this is super complicated. Though I hope he'd have acted differently if he thought the infohazard risk was really severe, as opposed to just not-vanishingly-small.

Perhaps this did generate some traffic, but LessWrong doesn't have adds. And any publicity this generated was bad publicity, since Roko's argument was far too weird to be taken seriously by almost anyone.

It doesn't look like anyone benefited. Eliezer made an ass of himself. I would guess that he was rather rushed at the time.

I think genuinely dangerous ideas are hard to come by though. They have to be original enough that few people have considered them before, and at the same time have powerful consequences. Ideas like that usually don't pop into the heads of random, uninformed strangers.

Daniel Dennett wrote a book called "Darwin's Dangerous Idea", and when people aren't trying to play down the basilisk (i.e. almost everywhere), people often pride themselves on thinking dangerous thoughts. It's a staple theme of the NRxers and the manosphere. Claiming to be dangerous provides a comfortable universal argument against opponents.

I think there are, in fact, a good many dangerous ideas, not merely ideas claimed to be so by posturers. Off the top of my head:

There are some things which could be highly dangerous which are protected almost purely by thick layers of tedium.

Want to make nerve gas? well if you can wade through a thick pile of biochemistry textbooks the information isn't kept all that secret.

Want to create horribly deadly viruses? ditto.

The more I learned about physics, chemistry and biology the more I've become certain that the main reason that major cities have living populations is that most of the people with really deep understanding don't actually want to watch the world burn.

You often find that extremely knowledgeable people don't exactly hide knowledge but do put it on page 425 of volume 3 of their textbook, written in language which you need to have read the rest to understand. Which protects it effectively from 99.99% of the people who might use it to intentionally harm others.

Argument against: back when cities were more flamable, people didn't set them on fire for the hell of it.

On the other hand, it's a lot easier to use a timer and survive these days, should you happen to not be suicidal.

"I want to see the world burn" is a great line of dialogue, but I'm not convinced it's a real human motivation. Um, except that when I was a kid, I remember wishing that this world was a dream, and I'd wake up. Does that count?

Second thought-- when I was a kid, I didn't have a method in mind. What if I do serious work with lucid dreaming techniques when I'm awake? I don't think the odds of waking up into being a greater intelligence are terribly good, nor is there a guarantee that my live would be better. On the other hand, would you hallucinations be interested in begging me to not try it?

Based on personal experience, I would have agreed with you, right up until last year, when I found myself in the rather terrifying position of being mentally aroused by a huge crash in my house, but unable to wake up all the way for several seconds afterward, during which my sleeping mind refused to reject the "something just blew a hole in the building we're under attack!" hypothesis.

(It was an overfilled bag falling off the wall.)

But absent actual difficulty waking for potential emergencies, sure; hang out in Tel'aran'rhiod until you get bored.

Sorry, should have defined dangerous ideas better - I only meant information that would cause a rational person to drastically alter their behavior, and which would be much worse for society as a whole when everyone is told at once about it.

Depends on your definition of "Dangerous." I've come across quite a few ideas that tend to do -severe- damage to the happiness of at least a subset of those aware of them. Some of them are about the universe; things like entropy. Others are social ideas, which I won't give an example of.

Alternatively, Roko could be part of the 1% of people who think of a dangerous idea (assuming his basilisk is dangerous) and spread it on the internet without second guessing themselves. Are there 99 other people who thought of dangerous ideas and chose not to spread them for our 1 Roko?

There is one positive side-effect of this thought experiment. Knowing about the Roko's Basilisk makes you understand the boxed AI problem much better. An AI might use the arguments of Roko's Basilisk to convince you to let it out of the box, by claiming that if you don't let it out, it will create billions of simulations of you and torture them - and you might actually be one of those simulations.

An unprepared human hearing this argument for the first time might freak out and let the AI out of the box.
As far as I know, this happened at least once during an experiment, when the person playing the role of the AI used a similar argument.

Even if we don't agree with an argument of one of our opponents or we find it ridiculous, it is still good to know about it (and not just a strawman version of it) to be prepared when it is used against us. (as a side-note: islamists manage to gain sympathizers and recruits in Europe partly because most people don't know how they think - but they know how most Europeans think - , so their arguments catch people off-guard.)

When Roko posted about the Basilisk, I very foolishly yelled at him, called him an idiot, and then deleted the post. [...] Why I yelled at Roko: Because I was caught flatfooted in surprise, because I was indignant to the point of genuine emotional shock, at the concept that somebody who thought they'd invented a brilliant idea that would cause future AIs to torture people who had the thought, had promptly posted it to the public Internet. In the course of yelling at Roko to explain why this was a bad thing, I made the further error---keeping in mind that I had absolutely no idea that any of this would ever blow up the way it did, if I had I would obviously have kept my fingers quiescent---of not making it absolutely clear using lengthy disclaimers that my yelling did not mean that I believed Roko was right about CEV-based agents [= Eliezer’s early model of indirectly normative agents that reason with ideal aggregated preferences] torturing people who had heard about Roko's idea. [...] What I considered to be obvious common sense was that you did not spread potential information hazards because it would be a crappy thing to do to someone. The problem wasn't Roko's post itself, about CEV, being correct.

I don't buy this explanation for EY actions. From his original comment, quoted in the wiki page:

"One might think that the possibility of CEV punishing people couldn't possibly be taken seriously enough by anyone to actually motivate them. But in fact one person at SIAI was severely worried by this, to the point of having terrible nightmares, though ve wishes to remain anonymous."

"YOU DO NOT THINK IN SUFFICIENT DETAIL ABOUT SUPERINTELLIGENCES CONSIDERING WHETHER OR NOT TO BLACKMAIL YOU. THAT IS THE ONLY POSSIBLE THING WHICH GIVES THEM A MOTIVE TO FOLLOW THROUGH ON THE BLACKMAIL. "

"... DO NOT THINK ABOUT DISTANT BLACKMAILERS in SUFFICIENT DETAIL that they have a motive toACTUALLY [sic] BLACKMAIL YOU. "

"Meanwhile I'm banning this post so that it doesn't (a) give people horrible nightmares and (b) give distant superintelligences a motive to follow through on blackmail against people dumb enough to think about them in sufficient detail, though, thankfully, I doubt anyone dumb enough to do this knows the sufficient detail. (I'm not sure I know the sufficient detail.) "

"You have to be really clever to come up with a genuinely dangerous thought. "

"... the gist of it was that he just did something that potentially gives superintelligences an increased motive to do extremely evil things in an attempt to blackmail us. It is the sort of thing you want to be EXTREMELY CONSERVATIVE about NOT DOING."

This is evidence that Yudkowsky believed, if not that Roko's argument was correct as it was, that at least it was plausible enough that could be developed in a correct argument, and he was genuinely scared by it.

It seems to me that Yudkowsky's position on the matter was unreasonable. LessWrong is a public forum unusually focused on discussion about AI safety, in particular at that time it was focused on discussion about decision theories and moral systems. What better place to discuss possible failure modes of an AI design?
If one takes AI risk seriously, and realized that an utilitarian/CEV/TDT/one-boxing/whatever AI might have a particularly catastrophic failure mode, the proper thing to do would be to publicly discuss it, so that the argument can be either refuted or accepted, and if it was accepted it would imply scrapping that particular AI design and making sure that anybody who may create an AI is aware of that failure mode. Yelling and trying to sweep it under the rug was irresponsible.

"One might think that the possibility of CEV punishing people couldn't possibly be taken seriously enough by anyone to actually motivate them. But in fact one person at SIAI was severely worried by this, to the point of having terrible nightmares, though ve wishes to remain anonymous."

This paragraph is not an Eliezer Yudkowsky quote; it's Eliezer quoting Roko. (The "ve" should be a tip-off.)

This is evidence that Yudkowsky believed, if not that Roko's argument was correct as it was, that at least it was plausible enough that could be developed in [sic] a correct argument, and he was genuinely scared by it.

If you kept going with your initial Eliezer quote, you'd have gotten to Eliezer himself saying he was worried a blackmail-type argument might work, though he didn't think Roko's original formulation worked:

"Again, I deleted that post not because I had decided that this thing probably presented a real hazard, but because I was afraid some unknown variant of it might, and because it seemed to me like the obvious General Procedure For Handling Things That Might Be Infohazards said you shouldn't post them to the Internet."

According to Eliezer, he had three separate reasons for the original ban: (1) he didn't want any additional people (beyond the one Roko cited) to obsess over the idea and get nightmares; (2) he was worried there might be some variant on Roko's argument that worked, and he wanted more formal assurances that this wasn't the case; and (3) he was just outraged at Roko. (Including outraged at him for doing something Roko thought would put people at risk of torture.)

What better place to discuss possible failure modes of an AI design? [...] Yelling and trying to sweep it under the rug was irresponsible.

There are lots of good reasons Eliezer shouldn't have banned R̶o̶k̶o̶ discussion of the basilisk, but I don't think this is one of them. If the basilisk was a real concern, that would imply that talking about it put people at risk of torture, so this is an obvious example of a topic you initially discuss in private channels and not on public websites. At the same time, if the basilisk wasn't risky to publicly discuss, then that also implies that it was a transparently bad argument and therefore not important to discuss. (Though it might be fine to discuss it for fun.)

Roko's original argument, though, could have been stated in one sentence: 'Utilitarianism implies you'll be willing to commit atrocities for the greater good; CEV is utilitarian; therefore CEV is immoral and dangerous.' At least, that's the version of the argument that has any bearing on the conclusion 'CEV has unacceptable moral consequences'. The other arguments are a distraction: 'utilitarianism means you'll accept arbitrarily atrocious tradeoffs' is a premise of Roko's argument rather than a conclusion, and 'CEV is utilitarian in the relevant sense' is likewise a premise. A more substantive discussion would have explicitly hashed out (a) whether SIAI/MIRI people wanted to construct a Roko-style utilitarian, and (b) whether this looks like one of those philosophical puzzles that needs to be solved by AI programmers vs. one that we can safely punt if we resolve other value learning problems.

I think we agree that's a useful debate topic, and we agree Eliezer's moderation action was dumb. However, I don't think we should reflexively publish 100% of the risky-looking information we think of so we can debate everything as publicly as possible. ('Publish everything risky' and 'ban others whenever they publish something risky' aren't the only two options.) Do we disagree about that?

(2) he was worried there might be some variant on Roko's argument that worked, and he wanted more formal assurances that this wasn't the case;

I don't think we are in disagreement here.

There are lots of good reasons Eliezer shouldn't have banned R̶o̶k̶o̶ discussion of the basilisk, but I don't think this is one of them. If the basilisk was a real concern, that would imply that talking about it put people at risk of torture, so this is an obvious example of a topic you initially discuss in private channels and not on public websites.

The basilisk could be a concern only if an AI that would carry out such type of blackmail was built. Once Roko discovered it, if he thought it was a plausible risk, then he had a selfish reason to prevent such AI from being built. But even if he was completely selfless, he could reason that somebody else could think of that argument, or something equivalent, and make it public, hence it was better sooner than later, allowing more time to prevent that design failure.

Also I'm not sure what private channles you are referring to. It's not like there is a secret Google Group of all potential AGI designers, is there?
Privately contacting Yudkowsky or SIAI/SI/MIRI wouldn't have worked. Why would Roko trust them to handle that information correctly? Why would he believe that they had leverage over or even knowledge about arbitrary AI projects that might end up building an AI with that particular failure mode?
LessWrong was at that time the primary forum for discussing AI safety issues. There was no better place to raise that concern.

Roko's original argument, though, could have been stated in one sentence: 'Utilitarianism implies you'll be willing to commit atrocities for the greater good; CEV is utilitarian; therefore CEV is immoral and dangerous.'

It wasn't just that. It was an argument against utilitarianism AND a decision theory that allowed to consider "acausal" effects (e.g. any theory that one-boxes in Newcomb's problem). Since both utilitarianism and one-boxing were popular positions on LessWrong, it was reasonable to discuss their possible failure modes on LessWrong.

There are lots of good reasons Eliezer shouldn't have banned R̶o̶k̶o̶ discussion of the basilisk, but I don't think this is one of them. If the basilisk was a real concern, that would imply that talking about it put people at risk of torture, so this is an obvious example of a topic you initially discuss in private channels and not on public websites. At the same time, if the basilisk wasn't risky to publicly discuss, then that also implies that it was a transparently bad argument and therefore not important to discuss. (Though it might be fine to discuss it for fun.)

As I understand Roko's motivation, it was to convince people that we should not build an AI that would do basilisks. Not to spread infohazards for no reason. That is definitely worthy of public discussion. If he really believed in the basilisk, then it's rational for him to do everything in his power to stop such an AI from being built, and convince other people of the danger.

Roko's original argument, though, could have been stated in one sentence: 'Utilitarianism implies you'll be willing to commit atrocities for the greater good; CEV is utilitarian; therefore CEV is immoral and dangerous.'

My understanding is that the issue is with Timeless Decision Theory, and AIs that can do acausal trade. An AI programmed with classical decision theory would have no issues. And most rejections of the basilisk I have read are basically "acausal trade seems wrong or weird", so they basically agree with Roko.

My understanding is that the issue is with Timeless Decision Theory, and AIs that can do acausal trade.

Roko wasn't arguing against TDT. Roko's post was about acausal trade, but the conclusion he was trying to argue for was just 'utilitarian AI is evil because it causes suffering for the sake of the greater good'. But if that's your concern, you can just post about some variant on the trolley problem. If utilitarianism is risky because a utilitarian might employ blackmail and blackmail is evil, then there should be innumerable other evil things a utilitarian would also do that require less theoretical apparatus.

As I understand Roko's motivation, it was to convince people that we should not build an AI that would do basilisks. Not to spread infohazards for no reason.

On Roko's view, if no one finds out about basilisks, the basilisk can't blackmail anyone. So publicizing the idea doesn't make sense, unless Roko didn't take his own argument all that seriously. (Maybe Roko was trying to protect himself from personal blackmail risk at others' expense, but this seems odd if he also increased his own blackmail risk in the process.)

Possibly Roko was thinking: 'If I don't prevent utilitarian AI from being built, it will cause a bunch of atrocities in general. But LessWrong users are used to dismissing anti-utilitarian arguments, so I need to think of one with extra shock value to get them to do some original seeing. This blackmail argument should work -- publishing it puts people at risk of blackmail, but it serves the greater good of protecting us from other evil utilitarian tradeoffs.'

(... Irony unintended.)

Still, if that's right, I'm inclined to think Roko should have tried to post other arguments against utilitarianism that don't (in his view) put anyone at risk of torture. I'm not aware of him having done that.

Roko wasn't arguing against TDT. Roko's post was about acausal trade, but the conclusion he was trying to argue for was just 'utilitarian AI is evil because it causes suffering for the sake of the greater good'. But if that's your concern, you can just post about some variant on the trolley problem. If utilitarianism is risky because a utilitarian might employ blackmail and blackmail is evil, then there should be innumerable other evil things a utilitarian would also do that require less theoretical apparatus.

Ok that makes a bit less sense to me. I didn't think it was against utilitarianism in general, which is much less controversial than TDT. But I can definitely still see his argument.

When people talk about the trolley problem, they don't usually imagine that they might be the ones tied to the second track. The deeply unsettling thing about the basilisk isn't that the AI might torture people for the greater good. It's that you are the one who is going to be tortured. That a pretty compelling case against utilitarianism.

On Roko's view, if no one finds out about basilisks, the basilisk can't blackmail anyone. So publicizing the idea doesn't make sense, unless Roko didn't take his own argument all that seriously.

Roko found out. It disturbed him greatly. So it absolutely made sense for him to try to stop the development of such an AI any way he could. By telling other people, he made it their problem too and converted them to his side.

It's that you are the one who is going to be tortured. That's a pretty compelling case against utilitarianism.

It doesn't appear to me to be a case against utilitarianism at all. "Adopting utilitarianism might lead to me getting tortured, and that might actually be optimal in utilitarian terms, therefore utilitarianism is wrong" doesn't even have the right shape to be a valid argument. It's like "If there is no god then many bad people will prosper and not get punished, which would be awful, therefore there is a god." (Or, from the other side, "If there is a god then he may choose to punish me, which would be awful, therefore there is no god" -- which has a thing or two in common with the Roko basilisk, of course.)

"Adopting utilitarianism might lead to me getting tortured, and that might actually be optimal in utilitarian terms, therefore utilitarianism is wrong" doesn't even have the right shape to be a valid argument.

You are strawmanning the argument significantly. I would word it more like this:

"Building an AI that follows utilitarianism will lead to me getting tortured. I don't want to be tortured. Therefore I don't want such an AI to be built."

Perhaps he hoped to. I don't see any sign that he actually did.

That's partially because EY fought against it so hard and even silenced the discussion.

So there are two significant differences between your version and mine. The first is that mine says "might" and yours says "will", but I'm pretty sure Roko wasn't by any means certain that that would happen. The second is that yours ends "I don't want such an AI to be built", which doesn't seem to me like the right ending for "a case against utilitarianism".

(Unless you meant "a case against building a utilitarian AI" rather than "a case against utilitarianism as one's actual moral theory"?)

The first is that mine says "might" and yours says "will", but I'm pretty sure Roko wasn't by any means certain that that would happen.

I should have mentioned that it's conditional on the Basilisk being correct. If we build an AI that follows that line of reasoning, then it will torture. If the basilisk isn't correct for unrelated reasons, then this whole line of reasoning is irrelevant.

Anyway, the exact certainty isn't too important. You use the word "might", as if the probability of you being tortured was really small. Like the AI would only do it in really obscure scenarios. And you are just as likely to be picked for torture as anyone else.

Roko believed that the probability was much higher, and therefore worth worrying about.

The second is that yours ends "I don't want such an AI to be built", which doesn't seem to me like the right ending for "a case against utilitarianism".

Unless you meant "a case against building a utilitarian AI" rather than "a case against utilitarianism as one's actual moral theory"?

Well the AI is just implementing the conclusions of utilitarianism (again, conditional on the basilisk argument being correct.) If you don't like those conclusions, and if you don't want AIs to be utilitarian, then do you really support utilitarianism?

It's a minor semantic point though. The important part is the practical consequences for how we should build AI. Whether or not utilitarianism is "right" is more subjective and mostly irrelevant.

All I know about what Roko believed about the probability is that (1) he used the word "might" just as I did and (2) he wrote "And even if you only think that the probability of this happening is 1%, ..." suggesting that (a) he himself probably thought it was higher and (b) he thought it was somewhat reasonable to estimate it at 1%. So I'm standing by my "might" and robustly deny your claim that writing "might" was strawmanning.

if you don't want AIs to be utilitarian

If you're standing in front of me with a gun and telling me that you have done some calculations suggesting that on balance the world would be a happier place without me in it, then I would probably prefer you not to be utilitarian. This has essentially nothing to do with whether I think utilitarianism produces correct answers. (If I have a lot of faith in your reasoning and am sufficiently strong-minded then I might instead decide that you ought to shoot me. But my likely failure to do so merely indicates typical human self-interest.)

The important part is the practical consequences for how we should build AI.

Perhaps so, in which case calling the argument "a case against utilitarianism" is simply incorrect.

This is evidence that Yudkowsky believed (...) that at least it was plausible enough that could be developed in a correct argument, and he was genuinely scared by it.

Just to be sure, since you seem to disagree with this opinion (whether it is actually Yudkowsky's opinion or not), what exactly is it that you believe?

a) There is absolutely no way one could be harmed by thinking about not-yet-existing dangerous entities; even if those entities in the future will be able to learn about the fact that the person was thinking about them in this specific way.

b) There is a way one could be harmed by thinking about not-yet-existing dangerous entities, but the way to do this is completely different from what Roko proposed.

If it happens to be (b), then it still makes sense to be angry about publicly opening the whole topic of "let's use our intelligence to discover the thoughts that may harm us by us thinking about them -- and let's do it in a public forum where people are interested in decision theories, so they are more qualified than average to find the right answer." Even if the proper way to harm oneself is different from what Roko proposed, making this a publicly debated topic increases the chance of someone finding the correct solution. The problem is not the proposed basilisk, but rather inviting people to compete in clever self-harm; especially the kind of people known for being hardly able to resist such invitation.

I'm not the person you replied to, but I mostly agree with (a) and reject (b). There's no way you can could possibly know enough about a not-yet-existing entity to understand any of its motivations; the entities that you're thinking about and the entities that will exist in the future are not even close to the same. I outlined some more thoughts here.

I think saying "Roko's arguments [...] weren't generally accepted by other Less Wrong users" is not giving the whole story. Yes, it is true that essentially nobody accepts Roko's arguments exactly as presented. But a lot of LW users at least thought something along these lines was plausible. Eliezer thought it was so plausible that he banned discussion of it (instead of saying "obviously, information hazards cannot exist in real life, so there is no danger discussing them").

In other words, while it is true that LWers didn't believe Roko's basilisk, they thought is was plausible instead of ridiculous. When people mock LW or Eliezer for believing in Roko's Basilisk, they are mistaken, but not completely mistaken - if they simply switched to mocking LW for believing the basilisk is plausible, they would be correct (though the mocking would still be mean, of course).

If you are a programmer and think your code is safe because you see no way things could go wrong, it's still not good to believe that it isn't plausible that there's a security hole in your code.

You rather practice defense in depth and plan for the possibility that things can go wrong somewhere in your code, so you add safety precautions. Even when there isn't what courts call reasonable doubt a good safety engineer still adds additional safety procautions in security critical code. Eliezer deals with FAI safety. As a result it's good for him to have mindset of really caring about safety.

German nuclear power station have trainings for their desk workers to teach the desk workers to not cut themselves with paper. That alone seems strange to outsiders but everyone in Germany thinks that it's very important for nuclear power stations to foster a culture of safety even when that means something going overboard.

If you are a programmer and think your code is safe because you see no way things could go wrong, it's still not good to believe that it isn't plausible that there's a security hole in your code.

Let's go with this analogy. The good thing to do is ask a variety of experts for safety evaluations, run the code through a wide variety of tests, etc. The think NOT to do is keep the code a secret while looking for mistakes all by yourself. If you keep your code out of the public domain, it is more likely to have security issues, since it was not scrutinized by the public. Banning discussion is almost never correct, and it's certainly not a good habit.

Let's go with this analogy. The good thing to do is ask a variety of experts for safety evaluations, run the code through a wide variety of tests, etc. The think NOT to do is keep the code a secret while looking for mistakes all by yourself.

No, if you don't want to use code you don't give the code to a variety of experts for safety evaluations but you simply don't run the code.
Having a public discussion is like running the code untested on a mission critical system.

What utility do you think is gained by discussing the basilisk?

and it's certainly not a good habit.

Strawman. This forum is not a place where things get habitually banned.

An interesting discussion that leads to better understanding of decision theories? Like, the same utility as is gained by any other discussion on LW, pretty much.

Strawman. This forum is not a place where things get habitually banned.

Sure, but you're the one that was going on about the importance of the mindset and culture; since you brought it up in the context of banning discussion, it sounded like you were saying that such censorship was part of a mindset/culture that you approve of.

Just FYI, if you want a productive discussion you should hold back on accusing your opponents of fallacies. Ironically, since I never claimed that you claimed Eliezer engages in habitual banning on LW, your accusation that I made a strawman argument is itself a strawman argument.

If a philosophical framework causes you to accept a basilisk, I view that as grounds for rejecting the framework, not for accepting the basilisk. The basilisk therefore poses no danger at all to me: if someone presented me with a valid version, it would merely cause me to reconsider my decision theory or something. As a consequence, I'm in favor of discussing basilisks as much as possible (the opposite of EY's philosophy).

One of my main problems with LWers is that they swallow too many bullets. Sometimes bullets should be dodged. Sometimes you should apply modus tollens and not modus ponens. The basilisk is so a priori implausible that you should be extremely suspicious of fancy arguments claiming to prove it.

To state it yet another way: to me, the basilisk has the same status as an ontological argument for God. Even if I can't find the flaw in the argument, I'm confident in rejecting it anyway.

Somehow, blackmail from the future seems less plausible to me than every single one of your examples. Not sure why exactly.

How plausible do you find TDT and related decision theories as normative accounts of decision making, or at least as work towards such accounts? They open whole new realms of situations like Pascal's Mugging, of which Roko's Basilisk is one. If you're going to think in detail about such decision theories, and adopt one as normative, you need to have an answer to these situations.

Once you've decided to study something seriously, the plausibility heuristic is no longer available.

I find TDT to be basically bullshit except possibly when it is applied to entities which literally see each others' code, in which case I'm not sure (I'm not even sure if the concept of "decision" even makes sense in that case).

I'd go so far as to say that anyone who advocates cooperating in a one-shot prisoners' dilemma simply doesn't understand the setting. By definition, defecting gives you a better outcome than cooperating. Anyone who claims otherwise is changing the definition of the prisoners' dilemma.

Defecting gives you a better outcome than cooperating if your decision is uncorrelated with the other players'. Different humans' decisions aren't 100% correlated, but they also aren't 0% correlated, so the rationality of cooperating in the one-shot PD varies situationally for humans.

Part of the reason why humans often cooperate in PD-like scenarios in the real world is probably that there's uncertainty about how iterated the PD is (and our environment of evolutionary adaptedness had a lot more iterated encounters than once-off encounters). But part of the reason for cooperation is probably also that we've evolved to do a very weak and probabilistic version of 'source code sharing': we've evolved to (sometimes) involuntarily display veridical evidence of our emotions, personality, etc. -- as opposed to being in complete control of the information we give others about our dispositions.

Because they're at least partly involuntary and at least partly veridical, 'tells' give humans a way to trust each other even when there are no bad consequences to betrayal -- which means at least some people can trust each other at least some of the time to uphold contracts in the absence of external enforcement mechanisms. See also Newcomblike Problems Are The Norm.

Defecting gives you a better outcome than cooperating if your decision is uncorrelated with the other players'. Different humans' decisions aren't 100% correlated, but they also aren't 0% correlated, so the rationality of cooperating in the one-shot PD varies situationally for humans.

You're confusing correlation with causation. Different players' decision may be correlated, but they sure as hell aren't causative of each other (unless they literally see each others' code, maybe).

But part of the reason for cooperation is probably also that we've evolved to do a very weak and probabilistic version of 'source code sharing': we've evolved to (sometimes) involuntarily display veridical evidence of our emotions, personality, etc. -- as opposed to being in complete control of the information we give others about our dispositions.

Calling this source code sharing, instead of just "signaling for the purposes of a repeated game", seems counter-productive. Yes, I agree that in a repeated game, the situation is trickier and involves a lot of signaling. The one-shot game is much easier: just always defect. By definition, that's the best strategy.

You're confusing correlation with causation. Different players' decision may be correlated, but they sure as hell aren't causative of each other (unless they literally see each others' code, maybe). [...] The one-shot game is much easier: just always defect. By definition, that's the best strategy.

Imagine you are playing against a clone of yourself. Whatever you do, the clone will do the exact same thing. If you choose to cooperate, he will choose to cooperate. If you choose to defect, he chooses to defect.

The best choice is obviously to cooperate.

So there are situations where cooperating is optimal. Despite there not being any causal influence between the players at all.

I think these kinds of situations are so exceedingly rare and unlikely they aren't worth worrying about. For all practical purposes, the standard game theory logic is fine. But it's interesting that they exist. And some people are so interested by that, that they've tried to formalize decision theories that can handle these situations. And from there you can possibly get counter-intuitive results like the basilisk.

What's needed for rational cooperation in the prisoner's dilemma is a two-way dependency between A and B's decision-making. That can be because A is causally impacting B, or because B is causally impacting B; but it can also occur when there's a common cause and neither is causing the other, like when my sister and I have similar genomes even though my sister didn't create my genome and I didn't create her genome. Or our decision-making processes can depend on each other because we inhabit the same laws of physics, or because we're both bound by the same logical/mathematical laws -- even if we're on opposite sides of the universe.

(Dependence can also happen by coincidence, though if it's completely random I'm not sure how'd you find out about it in order to act upon it!)

The most obvious example of cooperating due to acausal dependence is making two atom-by-atom-identical copies of an agent and put them in a one-shot prisoner's dilemma against each other. But two agents whose decision-making is 90% similar instead of 100% identical can cooperate on those grounds too, provided the utility of mutual cooperation is sufficiently large.

For the same reason, a very large utility difference can rationally mandate cooperation even if cooperating only changes the probability of the other agent's behavior from '100% probability of defection' to '99% probability of defection'.

Calling this source code sharing, instead of just "signaling for the purposes of a repeated game", seems counter-productive.

I disagree! "Code-sharing" risks confusing someone into thinking there's something magical and privileged about looking at source code. It's true this is an unusually rich and direct source of information (assuming you understand the code's implications and are sure what you're seeing is the real deal), but the difference between that and inferring someone's embarrassment from a blush is quantitative, not qualitative.

Some sources of information are more reliable and more revealing than others; but the same underlying idea is involved whenever something is evidence about an agent's future decisions. See: Newcomblike Problems are the Norm

Yes, I agree that in a repeated game, the situation is trickier and involves a lot of signaling. The one-shot game is much easier: just always defect. By definition, that's the best strategy.

If you and the other player have common knowledge that you reason the same way, then the correct move is to cooperate in the one-shot game. The correct move is to defect when those conditions don't hold strongly enough, though.

I'd go so far as to say that anyone who advocates cooperating in a one-shot prisoners' dilemma simply doesn't understand the setting. By definition, defecting gives you a better outcome than cooperating. Anyone who claims otherwise is changing the definition of the prisoners' dilemma.

I think this is correct. I think the reason to cooperate is not to get the best personal outcome, but because you care about the other person. I think we have evolved to cooperate, or perhaps that should be stated as we have evolved to want to cooperate. We have evolved to value cooperating. Our values come from our genes and our memes, and both are subject to evolution, to natural selection. But we want to cooperate.

So if I am in a prisoner's dilemma against another human, if I perceive that other human as "one of us," I will choose cooperation. Essentially, I care about their outcome. But in a one-shot PD defecting is the "better" strategy. The problem is that with genetic and/or memetic evolution of cooperation, we are not playing in a one-shot PD. We are playing with a set of values that developed over many shots.

Of course we don't always cooperate. But when we do cooperate in one-shot PD's, it is because, in some sense, there are so darn many one-shot PD's, especially in the universe of hypotheticals, that we effectively know there is no such thing as a one-shot PD. This should not be too hard to accept around here where people semi-routinely accept simulations of themselves or clones of themselves as somehow just as important as their actual selves. I.e. we don't even accept the "one-shottedness" of ourselves.

I think the reason to cooperate is not to get the best personal outcome, but because you care about the other person.

I just want to make it clear that by saying this, you're changing the setting of the prisoners' dilemma, so you shouldn't even call it a prisoners' dilemma anymore. The prisoners' dilemma is defined so that you get more utility by defecting; if you say you care about your opponent's utility enough to cooperate, it means you don't get more utility by defecting, since cooperation gives you utility. Therefore, all you're saying is that you can never be in a true prisoners' dilemma game; you're NOT saying that in a true PD, it's correct to cooperate (again, by definition, it isn't).

The most likely reason people are evolutionarily predisposed to cooperate in real-life PDs is that almost all real-life PDs are repeated games and not one-shot. Repeated prisoners' dilemmas are completely different beasts, and it can definitely be correct to cooperate in them.

I think the reason to cooperate is not to get the best personal outcome, but because you care about the other person.

If you have 100% identical consequentialist values to all other humans, then that means 'cooperation' and 'defection' are both impossible for humans (because they can't be put in PDs). Yet it will still be correct to defect (given that your decision and the other player's decision don't strongly depend on each other) if you ever run into an agent that doesn't share all your values. See The True Prisoner's Dilemma.

This shows that the iterated dilemma and the dilemma-with-common-knowledge-of-rationality allow cooperation (i.e., giving up on your goal to enable someone else to achieve a goal you genuinely don't want them to achieve), whereas loving compassion and shared values merely change goal-content. To properly visualize the PD, you need an actual value conflict -- e.g., imagine you're playing against a serial killer in a hostage negotiation. 'Cooperating' is just an English-language label; the important thing is the game-theoretic structure, which allows that sometimes 'cooperating' looks like letting people die in order to appease a killer's antisocial goals.

If you have 100% identical consequentialist values to all other humans, then that means 'cooperation' and 'defection' are both impossible for humans (because they can't be put in PDs). ... To properly visualize the PD, you need an actual value conflict

True, but the flip side of this is that efficiency (in Coasian terms) is precisely defined as pursuing 100% identical consequentialist values, where the shared "values" are determined by a weighted sum of each agent's utility function (and the weights are typically determined by agent endowments).

I think belief conflicts might work, even if the same values are shared. Suppose you and I are at a control panel for three remotely wired bombs in population centers. Both of us want as many people to live as possible. One bomb will go off in ten seconds unless we disarm it, but the others will stay inert unless activated. I believe that pressing the green button causes all bombs to explode, and pressing the red button defuses the time bomb. You believe the same thing, but with the colors reversed. Both of us would rather that no buttons be pressed than both buttons be pressed, but each of us would prefer that just the defuse button be pressed, and that the other person not mistakenly kill all three groups. (Here, attempting to defuse is 'defecting' and not attempting to defuse is 'cooperating'.)

[Edit]: As written, in terms of lives saved, this doesn't have the property that (D,D)>(C,D); if I press my button, you are indifferent between pressing your button or not. So it's not true that D strictly dominates C, but the important part of the structure is preserved, and a minor change could make it so D strictly dominates C.

If a philosophical framework causes you to accept a basilisk, I view that as grounds for rejecting the framework, not for accepting the basilisk.

...

To state it yet another way: to me, the basilisk has the same status as an ontological argument for God. Even if I can't find the flaw in the argument, I'm confident in rejecting it anyway.

Despite the other things I've said here, that is my attitude as well. But I recognise that when I take that attitude, I am not solving the problem, only ignoring it. It may be perfectly sensible to ignore a problem, even a serious one (comparative advantage etc.). But dissolving a paradox is not achieved by clinging to one of the conflicting thoughts and ignoring the others. (Bullet-swallowing seems to consist of seizing onto the most novel one.) Eliminating the paradox requires showing where and how the thoughts went wrong.

I agree that resolving paradoxes is an important intellectual exercise, and that I wouldn't be satisfied with simply ignoring an ontological argument (I'd want to find the flaw). But the best way to find such flaws is to discuss the ideas with others. At no point should one assign such a high probability to ideas like Roko's basilisk being actually sound that one refuses to discuss them with others.

The wiki article talks more about this; I don't think I can give the whole story in a short, accessible way.

It's true that LessWrongers endorse ideas like AI catastrophe, Hofstadter's superrationality, one-boxing in Newcomb's problem, and various ideas in the neighborhood of utilitarianism; and those ideas are weird and controversial; and some criticism of Roko's basilisk are proxies for a criticism of one of those views. But in most cases it's a proxy for a criticism like 'LW users are panicky about weird obscure ideas in decision theory' (as in Auerbach's piece), 'LWers buy into Pascal's Wager', or 'LWers use Roko's Basilisk to scare up donations/support'.

So, yes, I think people's real criticisms aren't the same as their surface criticisms; but the real criticisms are at least as bad as the surface criticism, even from the perspective of someone who thinks LW users are wrong about AI, decision theory, meta-ethics, etc. For example, someone who thinks LWers are overly panicky about AI and overly fixated on decision theory should still reject Auerbach's assumption that LWers are irrationally panicky about Newcomb's Problem or acausal blackmail; the one doesn't follow from the other.

I'm not sure what your point is here. Would you mind re-phrasing? (I'm pretty sure I understand the history of Roko's Basilisk, so your explanation can start with that assumption.)

For example, someone who thinks LWers are overly panicky about AI and overly fixated on decision theory should still reject Auerbach's assumption that LWers are irrationally panicky about Newcomb's Problem or acausal blackmail; the one doesn't follow from the other.

My point was that LWers are irrationally panicky about acausal blackmail: they think Basilisks are plausible enough that they ban all discussion of them!

If you're saying 'LessWrongers think there's a serious risk they'll be acausally blackmailed by a rogue AI', then that seems to be false. That even seems to be false in Eliezer's case, and Eliezer definitely isn't 'LessWrong'. If you're saying 'LessWrongers think acausal trade in general is possible,' then that seems true but I don't see why that's ridiculous.

Is there something about acausal trade in general that you're objecting to, beyond the specific problems with Roko's argument?

If you're saying 'LessWrongers think there's a serious risk they'll be acausally blackmailed by a rogue AI', then that seems to be false. That even seems to be false in Eliezer's case,

Is it?

Assume that:
a) There will be a future AI so powerful to torture people, even posthumously (I think this is quite speculative, but let's assume it for the sake of the argument).
b) This AI will be have a value system based on some form of utilitarian ethics.
c) This AI will use an "acausal" decision theory (one that one-boxes in Newcomb's problem).

Under these premises it seems to me that Roko's argument is fundamentally correct.

As far as I can tell, belief in these premises was not only common in LessWrong at that time, but it was essentially the officially endorsed position of Eliezer Yudkowsky and SIAI. Therefore, we can deduce that EY should have believed that Roko's argument was correct.

But EY claims that he didn't believe that Roko's argument was correct. So the question is: is EY lying?

His behavior was certainly consistent with him believing Roko's argument. If he wanted to prevent the diffusion of that argument, then even lying about its correctness seems consistent.

So, is he lying? If he is not lying, then why didn't he believe Roko's argument? As far as I know, he never provided a refutation.

1 - Logical decision theories are supposed to one-box on Newcomb's problem because it's globally optimal even though it's not optimal with respect to causally downstream events. A decision theory based on this idea could follow through on blackmail threats even when doing so isn't causally optimal, which appears to put past agents at risk of coercion by future agents. But such a decision theory also prescribes 'don't be the kind of agent that enters into trades that aren't globally optimal, even if the trade is optimal with respect to causally downstream events'. In other words, if you can bind yourself to precommitments to follow through on acausal blackmail, then it should also be possible to bind yourself to precommitments to ignore threats of blackmail.

The 'should' here is normative: there are probably some decision theories that let agents acausally blackmail each other, but others that perform well in Newcomb's problem and the smoking lesion problem but can't acausally blackmail each other; it hasn't been formally demonstrated which theories fall into which category.

2 - Assuming you for some reason are following a decision theory that does put you at risk of acausal blackmail: Since the hypothetical agent is superintelligent, it has lots of ways to trick people into thinking it's going to torture people without actually torturing them. Since this is cheaper, it would rather do that. And since we're aware of this, we know any threat of blackmail would be empty. This means that we can't be blackmailed in practice.

3 - A stronger version of 2 is that rational agents actually have an incentive to harshly punish attempts at blackmail in order to discourage it. So threatening blackmail can actually decrease an agent's probability of being created, all else being equal.

4 means that the argument lacks practical relevance. The idea of CEV doesn't build in very much moral philosophy, and it doesn't build in predictions about the specific dilemmas future agents might end up in.

1 - Humans can't reliably precommit. Even if they could, precommittment is different than using an "acausal" decision theory. You don't need precommitment to one-box in Newcomb's problem, and the ability to precommit doesn't guarantee by itself that you will one-box. In an adversarial game where the players can precommit and use a causal version of game theory, the one that can precommit first generally wins. E.g. Alice can precommit to ignore Bob's threats, but she has no incentive to do so if Bob already precommitted to ignore Alice's precommitments, and so on. If you allow for "acausal" reasoning, then even having a time advantage doesn't work: if Bob isn't born yet, but Alice predicts that she will be in an adversarial game with Bob and Bob will reason acausally and therefore he will have an incentive to threaten her and ignore her precommitments, then she has an incentive not to make such precommitment.

2 - This implies that the future AI uses a decision theory that two-boxes in Newcomb's problem, contradicting the premise that it one-boxes.

3 - This implies that the future AI will have a deontological rule that says "Don't blackmail" somehow hard-coded in it, contradicting the premise that it will be an utilitarian. Indeed, humans may want to build an AI with such constants, but in order to do so they will have to consider the possibility of blackmail and likely reject utilitarianism, which was the point of Roko's argument.

"I precommit to shop at the store with the lowest price within some large distance, even if the cost of the gas and car depreciation to get to a farther store is greater than the savings I get from its lower price. If I do that, stores will have to compete with distant stores based on price, and thus it is more likely that nearby stores will have lower prices. However, this precommitment would only work if I am actually willing to go to the farther store when it has the lowest price even if I lose money".

Humans don't follow any decision theory consistently. They sometimes give in to blackmail, and at other times resist blackmail. If you convinced a bunch of people to take acausal blackmail seriously, presumably some subset would give in and some subset would resist, since that's what we see in ordinary blackmail situations. What would be interesting is if (a) there were some applicable reasoning norm that forced us to give in to acausal blackmail on pain of irrationality, or (b) there were some known human irrationality that made us inevitably susceptible to acausal blackmail. But I don't think Roko gave a good argument for either of those claims.

From my last comment: "there are probably some decision theories that let agents acausally blackmail each other". But if humans frequently make use of heuristics like 'punish blackmailers' and 'never give in to blackmailers', and if normative decision theory says they're right to do so, there's less practical import to 'blackmailable agents are possible'.

This implies that the future AI uses a decision theory that two-boxes in Newcomb's problem, contradicting the premise that it one-boxes.

No it doesn't. If you model Newcomb's problem as a Prisoner's Dilemma, then one-boxing maps on to cooperating and two-boxing maps on to defecting. For Omega, cooperating means 'I put money in both boxes' and defecting means 'I put money in just one box'. TDT recognizes that the only two options are mutual cooperation or mutual defection, so TDT cooperates.

Blackmail works analogously. Perhaps the blackmailer has five demands. For the blackmailee, full cooperation means 'giving in to all five demands'; full defection means 'rejecting all five demands'; and there are also intermediary levels (e.g., giving in to two demands while rejecting the other three), with the blackmailee prefer to do as little as possible.

For the blackmailer, full cooperation means 'expending resources to punish the blackmailee in proportion to how many of my demands were met'. Full defection means 'expending no resources to punish the blackmailee even if some demands aren't met'. In other words, since harming past agents is costly, a blackmailer's favorite scenario is always 'the blackmailee, fearing punishment, gives in to most or all of my demands; but I don't bother punishing them regardless of how many of my demands they ignored'. We could say that full defection doesn't even bother to check how many of the demands were met, except insofar as this is useful for other goals.

The blackmailer wants to look as scary as possible (to get the blackmailee to cooperate) and then defect at the last moment anyway (by not following through on the threat), if at all possible. In terms of Newcomb's problem, this is the same as preferring to trick Omega into thinking you'll one-box, and then two-boxing anyway. We usually construct Newcomb's problem in such a way that this is impossible; therefore TDT cooperates. But in the real world mutual cooperation of this sort is difficult to engineer, which makes fully credible acausal blackmail at least as difficult.

This implies that the future AI will have a deontological rule that says "Don't blackmail" somehow hard-coded in it, contradicting the premise that it will be an utilitarian.

I think you misunderstood point 3. 3 is a follow-up to 2: humans and AI systems alike have incentives to discourage blackmail, which increases the likelihood that blackmail is a self-defeating strategy.

Shut up and multiply.

Eliezer has endorsed the claim "two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one". This doesn't tell us how bad the act of blackmail itself is, it doesn't tell us how faithfully we should implement that idea in autonomous AI systems, and it doesn't tell us how likely it is that a superintelligent AI would find itself forced into this particular moral dilemma.

Since Eliezer asserts a CEV-based agent wouldn't blackmail humans, the next step in shoring up Roko's argument would be to do more to connect the dots from "two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one" to a real-world worry about AI systems actually blackmailing people conditional on claims (a) and (c). 'I find it scary to think a superintelligent AI might follow the kind of reasoning that can ever privilege torture over dust specks' is not the same thing as 'I'm scared a superintelligent AI will actually torture people because this will in fact be the best way to prevent a superastronomically large number of dust specks from ending up in people's eyes', so Roko's particular argument has a high evidential burden.

Assuming you for some reason are following a decision theory that does put you at risk of acausal blackmail: Since the hypothetical agent is superintelligent, it has lots of ways to trick people into thinking it's going to torture people without actually torturing them. Since this is cheaper, it would rather do that. And since we're aware of this, we know any threat of blackmail would be empty.

Um, your conclusion "since we're aware of this, we know any threat of blackmail would be empty" contradicts your premise that the AI by virtue of being super-intelligent is capable of fooling people into thinking it'll torture them.

One way of putting this is that the AI, once it exists, can convincingly trick people into thinking it will cooperate in Prisoner's Dilemmas; but since we know it has this property and we know it prefers (D,C) over (C,C), we know it will defect. This is consistent because we're assuming the actual AI is powerful enough to trick people once it exists; this doesn't require the assumption that my low-fidelity mental model of the AI is powerful enough to trick me in the real world.

For acausal blackmail to work, the blackmailer needs a mechanism for convincing the blackmailee that it will follow through on its threat. 'I'm a TDT agent' isn't a sufficient mechanism, because a TDT agent's favorite option is still to trick other agents into cooperating in Prisoner's Dilemmas while they defect.

My point is that to my knowledge, given the evidence that I have about his beliefs at that time, and his actions, and assuming that I'm not misunderstanding them or Roko's argument, then it seems that there is a significant probability that EY lied about not beliving that Roko's argument was correct.

Sorry, I'll be more concrete; "there's a serious risk" is really vague wording. What would surprise me greatly is if I heard that Eliezer assigned even a 5% probability to there being a realistic quick fix to Roko's argument that makes it work on humans. I think a larger reason for the ban was just that Eliezer was angry with Roko for trying to spread what Roko thought was an information hazard, and angry people lash out (even when it doesn't make a ton of strategic sense).

It sounds like you have a different model of Eliezer (and of how well-targeted 'lashing out' usually is) than I do. But, like I said to V_V above:

According to Eliezer, he had three separate reasons for the original ban: (1) he didn't want any additional people (beyond the one Roko cited) to obsess over the idea and get nightmares; (2) he was worried there might be some variant on Roko's argument that worked, and he wanted more formal assurances that this wasn't the case; and (3) he was just outraged at Roko. (Including outraged at him for doing something Roko thought would put people at risk of torture.)

The point I was making wasn't that (2) had zero influence. It was that (2) probably had less influence than (3), and its influence was probably of the 'small probability of large costs' variety.

It seems unlikely that they would, if their gun is some philosophical decision theory stuff about blackmail from their future. I don't expect that gun to ever fire, no matter how many times you click the trigger.

What if your devastating take-down of string theory is intended for consumption by people who have never heard of 'string theory' before? Even if you're sure string theory is hogwash, then, you should be wary of giving the impression that the only people discussing string theory are the commenters on a recreational physics forum.

I wasn't saying that there's anything wrong with trying to convince random laypeople that specific academic ideas (including string theory and non-causal decision theories) are hogwash. That can be great; it depends on execution. My point was that it's bad to mislead people about how much mainstream academic acceptance an idea has, whether or not you're attacking the idea.

I think that where are 3 levels of Roko's argument. I signed for the first mild version, and I know another guy who independently comes to the same conclusion and support first mild version.

Mild. Future AI will reward those who helped to prevent x-risks and create safer world, but will not punish. May be they will be resurrected first, or they will get 2 millions dollars of universal income instead of 1 mln, or a street will be named by their name. If any limited resource will be in the future they will be in first lines to get it. (But children first). It is the same as soldier on war expect that if he die, his family will get pension. Nobody is punished but some are rewarded.

Roko's original. You will be punished if you knew about RB, but didn't help to create safe AI.

Strong, ISIS-style RB. All humanity will be tortured if you don't invest all your efforts in promotion of idea of RB. The ISIS already is using this tactic now - they torture people, who didn't join ISIS (and upload videos about it), and the best way for someone to escape future ISIS-torture is to join ISIS.

I think that 2 and 3 are not valid because FAI can't torture people, period. But aging and bioweapons catasrophe could.

If I believe that FAI can't torture people, strong versions of RB does not work for me.

We can imagine the similar problem: If I kill a person N I will get 1 billion USD, which I could use on saving thousands of life in Africa, creating FAI and curing aging. So should I kill him? It may look rational to do so by utilitarian point of view. So will I kill him?
No, because I can't kill.

The same way if I know that an AI is going to torture anyone I don't think that it is FAI, and will not invest a cant in its creation. RB fails.

We can imagine the similar problem: If I kill a person N I will get 1 billion USD, which I could use on saving thousands of life in Africa, creating FAI and curing aging. So should I kill him? It may look rational to do so by utilitarian point of view. So will I kill him? No, because I can't kill.

I'm not seeing how you got to "I can't kill" from this chain of logic. It doesn't follow from any of the premises.

Relevant here is WHY you can't kill. Is it because you have a deontological rule against killing? Then you want the AI to have deontologist ethics. Is it because you believe you should kill but don't have the emotional fortitude to do so? The AI will have no such qualms.

It is more like ultimatum in territory which was recently discussed on LW. It is a fact which I know about myself. I think it has both emotional and rational roots but not limited by them. So I also want other people to follow it and of course AI too. I also think that AI is able to find a way out of any trolley stile problems.