I was talking a couple of weeks ago to a few friends from the glowfic community about whether it was logically consistent for a power to completely erase a concept from existence to also be able to guarantee that no one would die or cease to exist due to this change. I argued that no, this is impossible, because there exist concepts such that, if you erase them, you will change some (or even most) people enough that it will be equivalent to killing them and replacing them with someone else. I pointed to a parallel between this situation and the definition of “omniscience” and then the discussion went to that but it was past 10PM and I’m no longer a spry young person who can stay up that late and stay functional, so I couldn’t explain exactly what I meant very well. Then I kinda forgot to return to this post and it sat in my drafts for weeks.

I’m gonna explain what I meant here, and I’m gonna start by talking about determinism.

Of the many, many different ideas put forward to explain, model, and experiment with artificial cognition, one of the most practically successful is the Artificial Neural Network. Not because we’ve managed to actually create cognition with one, but because of how useful it is for Machine Learning and Pattern Recognition. If you’ve ever heard of the latest buzz about “Deep Learning,” it has to do with that.

The inspiration, as you might have guessed, are biological neural networks, such as brains. The ANN can be seen as a collection of “neurons,” basic units of calculation, organised in many interconnected layers, which takes a vector of inputs and combines them many times, applying transformations to each such combination, until it reaches the output layer with the output of the computation.

In the above example, we have the inputs , , and . Then, each hidden node combines them (each with its own weight) and applies a (generally nonlinear) transformation on them. After that, each output node does the same, with its own transformation applied, and giving us its result in the end.

The networks don’t have to be fully connected like in the above case, we could remove some arrows. They also don’t have to only have two layers (not counting the inputs), and can have as many layers as we like (though it can be easily shown that if all the transformations are linear then the entire network is equivalent to a single-layer network). And while that might look all interesting and such, at first glance it may not look particularly useful – at least, not without many many layers and thousands upon thousands of neurons, so we could approximate some part of a brain, for instance.

But there are some very interesting results about ANNs. One of the most astonishing is that, if the transformations applied by the nodes fall within a certain class of functions, then we need only two layers in order to be able to approximate any continuous function to arbitrary precision (though how many nodes we’d need for that is undefined), and only three layers to be able to approximate any mathematical function at all (same). And, like I said, ANNs are very useful for Machine Learning, which implies that they can learn these functions.

I have mentioned before that Bayesian Inference is, in general, intractable. People like Gaussians a lot because many nice results in closed-form can be derived from them, analytical treatments are in general possible and even easy to do, but for more general distributions you need a lot of information in order to perform useful inference.

Consider, for instance, the triparametrised Student’s t-distribution which I talked about in twoposts of mine. Unlike the normal distribution, which has very neat closed forms for calculating many aspects of it, we can’t find a closed-form solution for its Maximum Likelihood Estimation or its Maximum A Posteriori value. There are iterative algorithms to calculate these things, sure, but they have to be applied case-by-case.

Even that is not what I mean by “intractable,” though. I think the best example we can use is that of a few discrete binary variables. We know from Bayes’ Theorem that, for any two variables and , , whence to infer the value of after having observed we need to calculate .

Now suppose we’re dealing with four binary variables, , , , and , and we have a joint distribution table for them (that is, a table with the values of and and and so on for every possible combination). Then, in order to calculate and perform inference, we need:

(That is the tiniest size can render anything on my blog, and it still wasn’t enough.)

To make the above clearer, I’ll rewrite it:

Where I’m shorthanding for . We have to sum over all possible values of the other variables.

The numerator, then, is a sum of terms. The denominator is a sum of terms. This suggests a general rule, which is correct: if we’re dealing with binary variables, then simple inference of the value of one of them when the value of another has been observed requires terms to calculate the numerator and terms to calculate the denominator. For small values of , this is not too bad, but this calculation takes exponentially many steps to be performed.

Not to mention the exponentially many memory entries necessary to even have the table! As you might’ve noticed, we need to store values (the there is because they all must necessarily sum to , so the last value can be deduced from the other ones) in the table. And usually, we’re dealing with much more than variables, and stuff gets complicated really fast when numbers start growing.

(No, really fast. Like, I’m fairly good at maths and the table of the time it takes to do breadth-first search in a tree as a function of its depth is still breathtaking:

Consider that we can have wayyyyy more than variables in an inference problem and you can surely understand why this can be fairly hard.

(Well this is not a completely fair comparison, breadth-first search takes much more time than Bayesian Inference in binary variables, but my point was just how fast exponential complexity is, really, in practical terms.))

And even this analysis is not complete, since full Bayesian treatment of real life would involve a Solomonoff prior over every possible reality, and that is literally uncomputable, amongst other complications even when we ignore the above. Just trying to give you a taste of what I mean.

We need some way of making this easier on us. One such way is a Bayesian Network.

A few months ago, someone who used to be called perversesheaf came to tumblr to bash LessWrong there. Now, while there is a very large number of criticisms that can be aimed at it, both as a website and community, this person decided to bash Bayesianism, which, as readers of this blog might have noticed, is something I personally believe is probably Correct. That person has since deleted their blog, but Scott’stumblr still has the original post, which I’m going to reproduce here, in a fashion. But first, the basics.

There is a principle in statistics called the Likelihood Principle whose basic content is that, “given a statistical model, all of the evidence in a sample relevant to model parameters is contained in the likelihood function.” You will remember that if we’re trying to estimate parametres after observing some data then the likelihood function is the following quantity:

when seen as a function of . And in particular, any function that has the form where is any function of the data alone and not of the parametres can be seen as a likelihood function.

More intuitively, we say that a function is a probability in a case where we’re wondering about the outcome when the parametres are held fixed: “If a coin was tossed ten times (experiment) and it is fair (parametre), what’s the probability that it lands heads every time (outcome)?”; and it’s a likelihood in a case where we’re wondering about the parametres when the outcome is held fixed: “If a coin was tossed ten times (experiment) and it landed heads every time (outcome), what’s the likelihood that it is fair (parametre)?”

So the likelihood principle, then, says that all the information a sample can give about a parametre is contained in the function . And if one believes that Bayes’ Theorem is the correct way to deal with uncertainty, then one must, necessarily, believe that the likelihood principle is true, since:

Three days ago I got slightly drunk with a few friends (two of which were mentioned in a recent post) and one of them and I were trying to explain to the other what the difference between confidence and credible intervals were. Since we were, as mentioned, not exactly sober, that did not go as well as it could have, and besides it’s not exactly the simplest of concepts, and the distinction can be hard to really pinpoint. So, I’m writing this to explain it better.

Very often, when an estimated value is reported, it also comes with a “confidence interval,” which is supposed to say something about where the true value is likely to be. For example, when polling people for their voting intentions, maybe it’s reported that of voters will vote for Sanders. Now, there are a few problems with this, the biggest of which being that this is a frequentist concept that does not treat probability in the same way we’re intuitively used to.

For all the use its methods see, frequentist interpretations of probability are actually quite counterintuitive. For a frequentist, probability is a sort of limit: the probability that a given event will occur is the limiting frequency with which it will occur should the trial I’m performing be repeated an infinite number of times. As such, there’s no such thing as “the probability that Sanders will win the 2016 election” or “the probability that it will rain tomorrow.” Either it will or it won’t, it’s not like you can repeat the 2016 election an infinite number of times and see how many times Sanders has won.

Bayesianism reflects something more intuitive, that the probability has to do with our uncertainty over what state the world occupies. Cox’s theorem shows that, if you follow a few reasonable-sounding constraints when dealing with your own uncertainty, then it behaves according to the laws of probability. So in that sense, when people talk about probability in their daily lives, their musings approach the Bayesian interpretation much more than the frequentist one.

And this is why there is a lot of misunderstanding about confidence intervals, as reported by papers and even sometimes the media. When someone reports a 95% confidence interval for a value, like how many people are likely to vote for Sanders next year, even the name suggests that we should be 95% confident that the true value will be there. But the more accurate interpretation is a bit subtler than that.

In part 3, I discussed the problem of finding a way of drawing a posterior point estimate of a number based on a series of point estimates that’s more “theoretically valid” than taking the median, which is the standard of the domain that inspired the post in the first place. I arrived at a likelihood function like:

So I decided to look at what that looks like with the vector and various values of and (for now all the individual hyperparametres of the estimates will be the same).

for something like an “ignorance” prior, with “effective” prior observations of precision , or variance :

and , for again almost no prior observations but with a higher precision:

Pretty much the same thing.

However, for and , one effective prior observation with sample precision (whatever the hell that means with only one observation):

Which is pretty, well, pretty. It’s not even multimodal, and the prior confidence in all four estimates is exactly the same, with a fairly low precision. If I take the precision to :

I was talking to a friend (the same friend who inspired the two previousposts), who was talking to a friend of ours about a thing, and there’s a context but it doesn’t matter to what I want to write here.

Suppose there is some quantity, I’ll call it , that I don’t know. Now, some people have estimated it, and given me point estimates , , etc, so I have a vector of estimates.

One possible way to get a posterior estimate for what the true value is, in a sort of Bayesian Model Averaging way, is by having a vector of confidences in each of those estimates and then having that , which is a weighted average of the estimates. However, in the context of the question, the standard practice seems to be using the median instead of the average, because then we get rid of outliers.

This seems, at first glance, unjustified. Surely there’s some way to use the estimates themselves to determine that an estimate is problematic? Well, suppose are the four different estimates. When you look at that, it seems obvious that the fourth person screwed up. Yet, how can you really tell, if all the information about you have to go on are those estimates? What if the first three are the wrong ones? The median may be theoretically inadequate, but how do you reconcile that with the fact that it doesn’t let weird estimates screw your expected value up too much? Even if, a priori, you trusted each of those four banks equally, after seeing that last value it seems very likely that it’s horribly wrong.

My friend suggested a couple of ways of dealing with that, measures that’d be between the average and the median. One suggestion was taking an average between the average and the median of the estimates. The other was a bit more complicated, and it went thusly:

Let equal the number of standard deviations is above or below the average . Then he suggests that the components of the vector should be given by:

This actually produces interesting results, corresponding to intuition! And I want to thwack him upside his frequentist head for even thinking of creating an operational tool like the ones he did before trying to derive stuff from first principles, but he ended the email he sent me with the questions, “What do you think? Are there established methods to update your beliefs in each of the models of a set, conditional on the predictions of all of them?” so at least his second instinct of trying to figure out if there exists another way is good (when you read this, I still love you ♥).

There’s a concept in Probability Theory called “entropy,” which borrows a bit from the physical concept of the same name. Intuitively, it’s the degree of “surprise” of an observation, or of a probability distribution. Alternatively, it’s how much information you gain when you observe something, as opposed to something else.

Let’s be more specific.

Suppose I have a biased coin that gives heads 80% of the time and tails 20% of the time. If I toss it and see tails I’ll be more surprised than if I see heads. If I toss it 5 times and see 4 tails and 1 head I’ll be much more surprised than if I see 4 heads and 1 tail. So it seems that whatever definition of entropy I use must reflect that; it should show me a higher number for more surprising observations, i.e. ones with a lower probability. We conclude then that this number is decreasing in the probability of the thing: lower probability means higher entropy.

So, for now, let’s call the entropy of an observation . Then the above condition says that .

Back to our example. Suppose I observe tails twice. Should I be twice as surprised as if I had observed it once? That sounds reasonable to me, so for now we have that . Since (when and are independent), a general form of seems to do the trick just fine; the negative sign is to make it an increasing function on the probability, and the base of the logarithm will be explained in a bit (this is a retroactive pun and was completely unintended at the time of writing).

Now that we know how to calculate the entropy of a single observation, what’s the entropy of a distribution? That is, if I toss that coin a bunch of times, how surprised should I expect to be, on average? We can just calculate the expectation of that function (where is the random variable that can take the relevant values). This is the average entropy of observations drawn from a distribution, usually just called the entropy of that distribution. In the case of our biased coin, then, we have:

(Bits are short for “binary digits” and are the unit of numbers. They’re also the names of those little and in a computer, and this is relevant, too. Now’s when the pun becomes relevant. Har-de-har-har.)

[Warning: Memetic hazard and philosophical trip. Also, probably incorrect. Talks about death and torture and robots.]

The universe is probably infinite, flat, uniform, and ergodic. This means not only that there are an infinity of copies of the Earth, all of them identical, all of them containing “a you” that’s reading this post written by “a me.” In fact, all possible distributions of matter happen somewhere an infinity of times.

The Many-Worlds Interpretation of Quantum Mechanics is probably correct. That means everything that “can happen” will, in “some universe.” This means all possible distributions of matter happen everywhere.

There’s a reasonable chance the process that spawned our universe was some form of Eternal or Chaotic Inflation, in which there’s a huge “field” whose local fluctuations spawn universes. In fact, due to the above, it’s likely that our universe repeats an infinite number of times.

It’s fun to think about the possibility that some strong form of Mathematical Platonism is true, and that the only thing that exists is mathematics, and our universe is no more than a mathematical structure. The implication of a myriad ways all distributions of matter can happen is left as an exercise to the reader.

I listed the above possibilities in what I think is a decreasing order of probability. But suppose any one of them is true.

How do you know which “you” you are? If all things happen, why did no one ever see anything unusual? Why doesn’t Santa Claus suddenly materialise in the middle of Times Square, why does everything follow such neat and predictable laws as if there was only one way things could be?

In Quantum Mechanics, there is a thing called an “amplitude,” which is, well, a number. And the frequency with which we observe a certain outcome is proportional to the square of the amplitude of that outcome. Why? Beats me.

Some people call this probability a “measure.” That’s because there’s a part of maths called “measure theory” which deals with, amongst other things, comparing the “relative size” of infinities. So, for example, if there are an infinite number of “yous” that are in Earths and an infinite number of “yous” that are being run in a computer, a measure is something that compares “how many” there are of each of them relative to the other, even if there are an infinity of both.

I call it “magical reality fluid” because I have no idea how it works and it’s misleading to call it something that looks Serious when you don’t understand it. You might start believing you do.

Regardless, this would mean that, somehow, there’s more magical reality fluid in the “yous” that are in Earths identical to yours than in the “yous” that are Boltzmann brains and only exist for a femtosecond.

However, you have no way, even in principle of telling whether “you” are the you in this Earth or that Earth, or whether you’re being run on a computer, or what. In fact, there’s not even a question of the matter; you are all identical, the same person, so asking “which of them” is the real you is like asking whether this 3 or this 3 is the real 3.

However, there are ways of ruling out certain “classes of you” that you’re not. For example, there is probably a “you” that chose something other than you did for breakfast, or didn’t eat breakfast at all, or actually ate it when you didn’t. That you is certainly not you. I mean, it’s not you right now. They might’ve been you until yesterday, but now they’re not.

And likewise, if there is some computer that’s (and therefore an infinity of computers that are) running a you that’s identical to you up until 4AM tomorrow, when it will suddenly turn you into a talking duck with all your memories, then you won’t know until tomorrow at 4AM whether you’re one of those.

However, suppose that all the “yous” suddenly died right now, except for the yous that will turn into a duck tomorrow. Then you will never notice. Subjectively, nothing will change for you. Except now there’s a 100% probability (Well.) that you’ll turn into a duck tomorrow. There’ll be no “yous” that will experience anything different.

We have reason to suppose that the vast majority of your magical reality fluid is in Earths that were formed naturally and follow the inexorable emergent determinism of physics. In those Earths, there is very little room for variation, and most of it is likely in the form of quantum fluctuations that may not affect all that much – or maybe they do, who knows, maybe brains are quantum computers, or maybe quantum fluctuations are enough to make the uncertainty about which sperm fertilises which ovum significant. But in any case, whatever happens to one you in such an Earth probably happens to the vast majority of them. And so when you die in one Earth, you’ll die in most of them.

Subjectively speaking, though, you’ll never notice it. Because there is some magical reality fluid in versions of you that aren’t on Earths, but are nonetheless identical to you. They are you, for all intents and purposes. Except they didn’t die.

I mean, some of them did die. But after they do, subjectively, the only yous that remain are the ones that didn’t – and who, inevitably, broke the laws of physics to do so. Past that point… all bets are off. We don’t have any way, even in principle, to predict what’s going to happen.

But worse than that is that most of these computer simulations won’t even be someone consciously simulating you. You’ll probably just be a byproduct of some other computation. You might be the consequence of the calculation of some function, or something. There’s no reason to expect, a priori, that your future computation will be benign – quite the opposite, since there are many more ways for a human to suffer and be disfigured beyond recognition than there are ways for them to thrive and have a reasonably satisfactory life.

Before you died, that didn’t really matter; the measure – pardon, magical reality fluid – of “yous” in those awful mathematical hells was so absurdly tiny compared to the fluid in deterministic Earths that the probability you’d face them was effectively zero. After all the deterministic Earths have gotten rid of you, however, you’re left with the effective zero fluid as your total fluid, and subjectively you’re… well, who the hell knows where you’ll be? Some mathematical hell, probably.

A benign superintelligence wishing to offset this risk would probably spend a reasonable amount of resources simulating what it believed were reasonable approximations of people who have died in order to try and make the most post-death magical reality fluid of them not be in a mathematical hell, but who knows if this will work, or be enough?

I have a friend who doesn’t want to be cryo-preserved because he’s afraid that humanity, in the future, has a reasonable chance of becoming horrible people who will torture past humans for fun (I want to link a certain recent SMBC comic here about robots that resurrect humans in order to inflict them the maximal amount of pain they can feel but can’t find it), and his probability for that is high enough that the negative effect offsets any of the positive effect of eternal life. The above argument says we’re screwed anyway, but he doesn’t believe it, though I have no idea why, since he’s already entertaining the idea of future humans torturing us for fun.

I wrote a post exactly six months ago explaining why I was a Singularitarian. Or, well, so I thought. Except then I thought about it long and hard. And I finished reading Bostrom’s book. And, well…

My core argument there, that there are many, many ways of getting to an AGI, is sound, I think. The prediction “AGI will be a thing” is disjunctive, and it’s probably correct. However, of the many forms AGI can take, software AI seems to be the murkiest, least well understood one. And… it’s the only one that really promises a “singularity” in the strictest of senses.

The argument, basically, is that a smarter-than-human software AI with access to its own source code and to our knowledge on how it was built would get into a huge feedback loop where it’d constantly improve itself and soar. And that’s a very intuitive argument. Humans are very likely not the “smartest” possible minds, and just eliminating all the cognitive biases and giving us faster processing power would probably be a huge step in the right direction.

But the absurdity heuristic has a converse: if we dismiss things that sound intuitively absurd before looking at their own merits, we accept intuitively plausible ideas too readily before criticising them. And I don’t think this heuristic should have a name, because, well, it’s probably not a single thing, it’s the set of all biases and heuristics, it’s just intuition itself, but my point here is that the argument… has a hard time surviving probing. It’s intuitive, we accept it readily, and we don’t question it enough. Or at least, I didn’t.

We don’t have a well-defined notion of what agency and intelligence are, we have absolutely no idea how to even begin building a software agent, and even if we did there is very, very little exploration on actual theoretical hard limits on improvement. Complexity theory, information theory, computability theory, all of those are highly necessary for us to even begin having a grasp on what’s possible and what’s not.

Which is not to say superintelligence won’t happen! In 300, maybe 200, maybe 100 years, it might be here. I don’t know. I can’t predict that. But right now, the Singularity is Pascal’s Mugging, or some other kind of mugging where the situation is so completely out of any reference classes we’ve known that even giving it a probability would be a farce.

And this is also not to say that research into AI safety isn’t necessary. What MIRI is doing, right now, is foundational research, it’s trying to create the field of AI safety as an actual field, with actual people doing research on it. And yes, it will probably include complexity, computability, information, logic, all of that. They’re starting with logic, because logic can prove things for us that are true everywhere, they’re a place to start. They’re working on decision theory, they’re working on value alignment. Those things are good and necessary, and I’m not going to discuss here what priority I personally believe should be given to each of those approaches or how effective MIRI is.

But I no longer think this is an urgent problem. I no longer believe this is something that needs doing immediately. I’ve unconvinced myself that this is a high-impact high-importance project, right now. I’ve unconvinced myself that… I should work on it.

So what, now? I spent the past five years of my life geared towards that goal, I have built a fairly large repertoire of knowledge that would help me there, I have specialised. My foundation is no longer there.

So I guess I’m going to try to use that, my skills and interests and capacity, to make an impact, somehow.