Nice summary. I think of the concept at a slightly higher level of abstraction:

It's just Bayesian inference to update the posterior. A theoretical model gives an estimate of what the state of the system should be (call that the "prior"). A sensor measurement gives an independent estimate of the state (call that the "likelihood"). Composing (multiplying) the two gives the posterior. Further, in this formulation it doesn't really matter what you call the prior or the posterior. You can easily combine estimates from many different sources/sensors.

Kalman filters are just the special case where the likelihood and prior are both Gaussian -- the distributions can be specified with a couple of numbers, and there is a simple closed-form expression for composing them. More generally, one can use all the tools available for Bayesian inference.

I've had this discussion with a number of smart people (math PhDs and so on) over the years, and I've never gotten a good explanation. I understand in the very abstract sense how Kalman filters (and particle filters and so on) can be seen as Bayesian, but when it gets to actually applying Bayes' theorem it really gets complicated very fast. For instance, if you're estimating a 3D position (that will be A) and your measurements are some 2D subspace (these will be B), the least squares approach works nicely (normalized products of Normal PDFs).

However, I'm not sure I understand a single component of the expression P(A|B) = P(B|A) P(A) / P(B) in this context. What does A|B or B|A even mean when A is a 3D Gaussian and B is 2D? What are the dimensionalities of these operations? What does it mean to divide a product of pdfs?

I don't expect an answer, but I always wince a bit when I see Bayes in this context. For Normal distributions, I prefer thinking of it as least squares.

However, I'm not sure I understand a single component of the expression P(A|B) = P(B|A) P(A) / P(B) in this context.

You've come to the right place :)

In the context of least squares estimation you can usually ignore the denominator P(B) and only look at P(A|B) = P(B|A) P(A). Real mathematicians (!= me) cringe here because the numerator alone isn't a proper PDF but since we're only seeking the value where that PDF has its maximum, that's ok.

Good, so now we have P(A|B) ~ P(B|A) P(A). What you want to know is the maximum of P(A|B), that is "where's the most probable position of my 3D point A given all the 2D measurements B"?

Let's look at the solution P(B|A) P(A). The first factor is the likelihood function, the second factor is the prior. The prior is easy: It's some prior belief about the distribution of A. So if you know the position of your 3D point somewhat (even without looking at B), say you know it must be in a range of coordinates, or you know that Z must be positive or any other statement about A that you know must hold, that information would go into the prior.

P(B|A) on the other hand is your likelihood. That's kind of the "meat" of Bayes. Your likelihood is a distribution over B, given A. Meaning, if you knew your 3D point A, how would the PDFs of your 2D observations of it look like? Note that the likelihood is to be understood as a function of state, meaning the domain of it is A.

So to summarize argmax(P(A|B)) = argmax(P(B|A) P(A)), meaning to get the most probable location of your 3D point you need to look in your likelihood domain A (in this case the 3D space) where the 3D points are that would produce the observations B that you have and then also take into account your prior beliefs about A, to narrow down that set of candidate points.

In the concrete case of least squares both likelihood and prior are assumed to be gaussian, so if you multiply them you get a new gaussian with a single maximum. Neat!

I appreciate the kind reply, but I don't think you understood the gist of my complaint. I have a degree (just undergrad) in math, and I've implemented Kalman filters, Kalman smoothers, information filters, particle filters and so on at least a dozen times. I know what operations to perform, and I even have an intuition about why they work.

When I complain about the Bayes theorem way of describing or thinking about Kalman filters, I really mean that I don't understand what almost anything in that expression means. The notation is intuitive if read as English, but opaque to what operations are being performed on what mathematical objects. Capital P means probability of an event or predicate, and A is some abstraction of the thing I'm trying to estimate, and B is the abstraction of a new observation or measurement.

Think of it this way: I have some knowledge of my prior estimate, and I'm able to characterize it as a Normally distributed random variable. It has a mean and a covariance in 3D space (let's call those x and P). There is a multivariate function for that pdf (lowercase p); it's an exponential of the negative square of the Mahalanobis distance normalized to have a unit volume. The same thing applies for my measurement, except it's only a 2D mean and covariance (let's call those z and R). I already know that I can use my H matrix, some matrix inverses and multiplications to combine this information. It's really just a case of multiplying the pdf for the state estimate and the measurement and re-normalizing to a unit volume. In other words, I believe both things are true, and they are independent, so I can multiply their pdfs and re-normalize to get a combined result. The rest is just linear algebra.

Now let's get to Bayes. P(A) is supposed to be the probability of an "event" or predicate. I can squint sideways and translate that as integrate my pdf for the prior estimate (Normal with mean x and covariance P) over a some unspecified bounds and convert that to a probability of my estimate being in those bounds, but I'm not sure that's what is intended. Again a similar thing applies for B (Normal with mean z and covariance R). It's frustrating that the bounds are never stated, because they could be radically different spaces in the numerator and denominator. I guess I'll give that a pass because maybe the notation would be too cumbersome if it was included, but this simplification seems to never be stated in any of the books or papers I've read on it.

Next, P(B|A) has to be probability of another "event", and if I squint again, the best I'm able to come up with is that it means a new Normal pdf with mean = H'z and covariance = H'RH. However, I haven't seen that spelled out any where, and so really that's just assuming the conclusion I want, which is super questionable. It also doesn't help me understand the left side of the Bayes equation - that vertical bar there seems to mean something different. When I look to the definition of conditional probability, I don't see anything about the vertical bar applied to multivariate Normal distributions.

If you translate it all to English, it reads as a coherent sentence, and that's fine. This is the "prior state estimate", that's the "observation", and this other thing is that's the "a posteriori" etc... However, if Bayes really helps with the understanding the math, the vertical bar has to mean something specific as an operator, and one would hope it meant the same thing on the left and right sides of the equation.

Similarly, in order for all of those P( ... ) to become scalars so multiplication and division are well defined, you need some integration bounds to turn the event into a probability. But at this point, I'm not even sure if that's what is intended by the notation - maybe they aren't scalars, and multiplication and division mean something radically different here. I honestly don't know.

Probability notation generally works best for the people who already understand the concept in question. Let me take a crack at your question.

The equation in question is P(A|B) = P(B|A) P(A) / P(B). In modern Kalman filter literature, this would be stated as something like: P(x_k | z_k) = P(z_k | x_k) P(x_k) / P(z_k). It is generally left as implicit in these sorts of equations that everything is also conditioned on the sequence z_1 to z_{k-1}. In this equation, x_k is a free variable in the state space (possibly multi-dimensional, so a vector), while z_k is the measurement, which is a realization of the random variable distributed as N(H(x_k), R_k). The result is a PDF over the free variable x_k.

So let's tackle the terms one by one:

1. P(z_k | x_k) - this is the probability that we measured z_k, given that the true object is at x_k. This is the aforementioned normal distribution N(H(x_k), R_k).

2. P(x_k) - this is the prior probability of the state estimate, generally after propagation through the motion model from P(x_{k-1} | x_{k-1}). In the Kalman filter, this is also Gaussian, and conveniently has the same mean as the term above (see note below).

3. P(z_k) - this is the denominator that someone else mentioned earlier can be effectively ignored, which is right - you only need it to normalize the numerator. If you must compute it, it can be factored as the integral over the entire state space of P(z_k|x_k)*p(x_k). Given z_k, this is a number, not a function.

4. P(x_k | z_k) - The result, which is a Gaussian PDF. You can arrive at it numerically by plugging in specific values for x_k, in which case (1) and (2) are numbers. Or symbolically, in which case (1) and (2) are functions, and you'll end up with the form of a Gaussian PDF.

Note: The original article quotes a distribution for the product of two Gaussians with arbitrary means. It does not state that this is an approximation, which is exact only in the case of equal means. This is why unbiased measurements are one of the Kalman assumptions.

I think the crux of the complaint of is the imprecision in saying e.g. "P(z_k | x_k) - this is the probability that we measured z_k, given that the true object is at x_k"

Technically, you can only give a useful answer about the probability density of the random variable Z_k at the value z_k, conditioned on X_k = x_k. In the Bayesian interpretation of the Kalman filter, you never have an event "I measured z_k" (that event has probability 0, of course).

I agree that the probability notation is the issue here. Look at how wikipedia shows Bayes' rule for continuous random variables on both sides of the |. : https://en.wikipedia.org/wiki/Bayes%27_theorem#Random_variab... That's the kind of explicit and precise notation I would use to help someone understand the Kalman filter from a Bayesian perspective.

Once you use that definition of Bayes' rule, then you can substitute the definitions of the multivariate normal pdf, Do The Math, and derive the Kalman filter recursive updates.

In term 1, you say P(z_k | x_k) is a probability (which should be a scalar value between 0.0 and 1.0 inclusive). How did we get from the vector pdfs of measurement and estimate to a scalar probability? Maybe you meant evaluating a pdf at an arbitrary point z_k, getting its density (scalar non-negative value, between 0.0 and infinity) at that point? I have the same confusion on your other enumerated terms. If they are probabilities, I'm only familiar with getting those by integrating the pdfs over some set of bounds or range.

I've seen the k subscripts plenty before, and some people use + and - to indicate pre and post parts of an operation, others put conditional vertical bars in the subscripts, but your definition for the pdf of z_k is a bit different than I'm used to. If we focus on the current instant of time (dropping the subscripts), ignore the motion model part, and only deal with a single measurement, I would say the estimate and measurement are defined like this:

Ignoring the information filter stuff, and getting back to your list, Term 1 looks like a pdf with the dimensionality of z, but the result in Term 4 is a pdf with the dimensionality of x. In otherwords, the resultant Term 4 is a pdf (function) which takes a vector argument in the x space, and I don't see where to pick any given z for the pdf in Term 1.

It's getting late, and I'm getting spacey, but it's tempting to try and write the type signatures of these functions in some strongly typed pseudocode so I can show you my confusion.

I think I can clear that up for you: Bayes formula works for both concrete probabilities (P(A), P(A|B), etc..) as well as PDFs (p(A), p(A|B), ...). So

p(A|B) = (p(B|A) * p(A)) / p(B)

is just

P(A|B) = (P(B|A) * P(A)) / P(B)

the only difference is that in the first version the inputs are functions of A and B and the output is a function of A and B. In the second case you have A and B given and the output is a concrete number.

In the case of the Kalman filter (or LS estimation in general) all your p(..) PDFs are Gaussians.

Yeah, the difference between lowercase p and capital P seems pretty important. Most places show capital P, so that's part of the confusion.

There's more though. Lowercase p(B|A) seems to really mean p_B(h(a)), and that's not obvious. Hell, I might still have it wrong.

And most everyone says to ignore p(B) in the denominator, but that's really sloppy hand waiving. The notation means something, and there should be a well defined set of substitutions, but in each of the four terms, they do something radically different. I can't see a pattern to follow.

It's Bayesian because we are estimating a (posterior) distribution for the state of the system, rather than just a "point" estimate. Restricting to only the Gaussian distribution (while a nice example) is losing the forest for the trees.

> However, I'm not sure I understand a single component of the expression P(A|B) = P(B|A) P(A) / P(B) in this context. What does A|B or B|A even mean when A is a 3D Gaussian and B is 2D? What are the dimensionalities of these operations? What does it mean to divide a product of pdfs?

I'm not able to understand why any of that is bothersome. Suppose A is the system's state (x,y,z) and B is the sensor reading (measures just the projection along the x-axis, such as a depth measurement). Since the sensor is not perfect, it could generate a range of readings for any given position of the object, specified by p(B[x]|A[x,y,z]) -- suppose you're give one value of B which is sampled from that distribution (say that was a Gaussian with standard deviation sigma). Given that value, you need to estimate the likelihood p(A|B). Alternatively, you can imagine performing variational inference for p(A|B) in the family of Gaussian distributions (which happens to contain the true distribution in this case). Whichever way we go about it, p(A|B) is a Gaussian with the same standard deviation sigma, centered at the measured B. (Intuitively, we can see that a measurement B will constrain the "x" component of A but not the other two coordinates)

p(A) is the "prior" you get as an extrapolation from your knowledge of the system's past state. However complicated/nonlinear the system's evolution function, you can find p(A, t) either as a closed form expression, or by doing simulations, etc. (typically from p(A,t-1) but note, for example, that this general formulation can be used even if your system's dynamics does not satisfy the Markov property).

If you have both p(A|B) and p(A) manifestly in the form of probability density functions, multiplying them is easy. At the end of this process, it is good enough to have an un-normalized posterior for A, which specifies the relative probability of different states. Dividing by p(B) is equivalent to normalizing the candidate posterior distribution. More generally, if the prior and the likelihood are specified by moments, or by some variational parameters, or by simulations, we can think of other ways to combine them.

Ok. It's bothersome to me because Bayes gets trotted out in the context of Kalman filters as though it is shedding light on the topic or providing rigor to the approach, but I haven't found anyone who can really describe what the elements of that expression are without hand waiving.

You've switched from capital 'P' to lowercase 'p' in your message (which I take to mean Probability and pdf respectively). If that was intentional, it opens a new batch of questions for what the notation means.

Anyways, maybe I'm just a slow learner, too pedantic, or something else. If you're interested, read my other reply where I go into it more.

A|B is not an object in its own right. P is not just a function that means "probability of this thing", as it would if you were dealing with probabilistic events [1]. Instead, P(A) means the probability density function of the random variable A. It is an abuse of notation, and maybe that's what's confusing you: A both represents the random variable, which is a complex object [2], and the particular number that's being substituted into its probability density function to give a number as a result.

Mathematicians use a longer notation: P_A(a). Here A (which should be a subscript but I can't write that here) represents the random variable, P_A is a function R^3->R (the PDF of A) and a is a particular R^3, so that P_A(a) is in R.

Then this formula:

> P(A|B) = P(B|A) P(A) / P(B)

should really be this (I wrote the subscripts in square brackets just to try and distinguish from the actual function parameters):

> P_[A|B=b](a) = P_[B|A=a](b) P_[A](a) / P_[B](b)

Note that A|B=b is a random variable, in this case describing the position of the object given that it was detected at b.

Thank you for the reply. I spent most of the day talking to a coworker about this stuff, but I didn't see your message until just now.

I agree about the abuse of notation, but even with your improvements, there is still plenty to be confused about. In particular, the denominator has P_[B](b), which looks like a function from R^2->R (the PDF of B in measurement space). I guess it could also mean that PDF evaluated at the specific value of b. However, I think it's actual value is supposed to be "the integral from negative to positive infinity in all directions of whatever is in the numerator", which isn't at all obvious from the notation. However, maybe I've got that wrong.

Let's go one step further. On the left side we've got a function R^3->R with parameter 'a'. On the right, we've got two functions of 'b' (R^2->R), one in the numerator, and another in the denominator, presumably evaluated at the value 'b'. Both of those evaluations give scalars, and so it looks like we just get a scaled version of P_[A](a), which is definitely not the right answer.

An alternative is that we're not really evaluating those two functions at 'b', and instead leaving them as functions. If that's so, where did the free variable come from? It's not in the argument list on the left hand side. Although it is in the subscript on the left... Yeah, I'm still pretty confused :-)

> On the right, we've got two functions of 'b' (R^2->R), ... presumably evaluated at the value 'b'. ... An alternative is that we're not really evaluating those two functions at 'b',

You're right the first time. The formula holds for all possible values of 'b' (and all possible 'a'), but as you said, we evaluate it at one particular 'b'. In the case of the Kalman filter, this is the particular position at which the sensor detected the object (and 'B' is the random variable of possible positions where the sensor was likely to have detected the object, given what we knew before we observed 'B'). As a reminder, 'A' the actual position of the object and its PDF represents our belief of where it could before we observe it.

> On the right, we've got two functions of 'b' ... Both of those evaluations give scalars

This is not true. The expression P_[B|A=a](b) depends on both 'a' and 'b'. This is the probability density that the sensor detects the object at 'b' given that it's really at 'a', so it's large if 'b' is approx equal 'a' and small if 'b' is far from 'a'. In other words, it depends on 'a' because, if you knew for sure where the object was, then that would affect your belief of where the sensor is likely to detect it.

The 'a' is tucked away in a subscript because this is only a PDF with respect to 'b' i.e. if you choose any 'a' you like and integrate over 'b' then you get 1, but if you choose 'b' and integrate over 'a' then you could get anything. Overall, it's still a function of 'a' and 'b' i.e. over R^6.

So in Bayes formula, when we fix a specific 'b', P_[B|A=a](b) becomes just a function of 'a'. It is large when 'a' is close to that specific 'b', and small when 'a' is far away from it. It's not a coincidence that I said something similar earlier, but with 'a' and 'b' the other way round.

In the Kalman filter: P_[B|A=a](b) is a function over 'a' that has a peak where the object was detected. P_A(a) has a peak where you thought the object was last time (but the peak is now slightly less concentrated near this point, because it might have moved). When you multiply these two functions, you end up with a single peak that is somewhere in between those two. This is a Gaussian PDF except for the fact its integral isn't 1, so we multiply by a constant so that it does become 1. Which leads to ...

> The denominator has P_[B](b), which looks like a function from R^2->R (the PDF of B in measurement space). I guess it could also mean that PDF evaluated at the specific value of b. However, I think it's actual value is supposed to be "the integral from negative to positive infinity in all directions of whatever is in the numerator"

Yes, it is the PDF of B evaluated at 'b'. And it is also the integral of the numerator. That's because those two things are equal; you're applying another formula:

P_[B](b) = \int P_[B|A=q](b) P_[A](q) dq

I chose a different symbol for the variable you're integrating over to stress that it's not really the same as the candidate value 'a' used in Bayes formula. This formula holds for all possible 'b', and even for all pairs of random variables. It's sometimes called the multiplication rule for conditional probability.

This formula doesn't get so much attention because it's less subtle. While Bayes formula is working backwards from evidence to the root cause, this is going forward from the root cause 'A' (position of object) to the effect B|A=q (detection given position of object) to B (overall possible detections, given all possible places the object could be).

This is a very nice breakdown of Kalman Filtering. One slight correction - Kalman Filtering is one approach to solving the SLAM problem, but SLAM doesn’t require a KF. A particle filter, for example, can be used to solve the SLAM problem.

An addition to your correction: there are many other ways to solve the SLAM problem beyond (kalman/information/particle) filters. Optimization based approaches are very popular (search terms: Graph SLAM, Factor Graphs, Pose Graphs).

We also do the opposite, and we use our IMU to better estimate the position of those features/clues. This is exactly what SLAM does, at its core it is just a Kalman filter with the twist that each time it finds a reliable visual clue it treats it as a sensor and makes it part of its state, a thing that allow us to tell our position much better and build a map.

Some nitpicking here: There are some SLAM systems and flavors that use a Kalman filter as the main estimator under the hood but most are either hybrids or separate the tracking and map optimization (using batch / bundle adjustment methods).

Digging a little deeper here I think it's important to mention that a Kalman filter is just a recursive least squares estimator, and is in fact equivalent to an Information Filter (which, IMO, is much easier to understand if you're already familiar with linear regression / LS systems).

In fact a Kalman filter is just a fancy mathematical reformulation of an Information Filter that allows you to save quite a bit of computation time in case each observations is low-dimensional compared to your state.

I used to work at a company that did Dynamic Positioning systems for vessels and oil rigs and I worked on the periphery software systems running with a Kalman-based algorithm there. Didn't work on the Kalman stuff then, but I do think its cool what they can achieve and have always been interested. I will add this to my reading TODO. Thank you.

This is a positively fantastic explanation. One of the best I've seen despite being so short.

Minor nitpick: It might be a bit more clear if the terms "F", "s", "v" were explicitly defined, as well as a sentence explaining the matrix M inversion. It's really great, but when I read "our prediction formula F could be as simple as s = v*t_dot" I immediately think "F? was F defined earlier? Did I miss it? Also what is s?".

I wish I had this summary quite a few years ago. I ended up buying aeronautical software textbooks and trying to glean the information from those to figure out how to integrate multiple IMU inputs.

I learned that real world IMU use in navigation is pretty ugly and has to be supplemented due to constant errors. This can be particularly difficult when flying. In my case I used other sensors, such as a GPS. But a strictly IMU-only approach to lengthy navigation is, AFAIK, impossible.

> I learned that real world IMU use in navigation is pretty ugly and has to be supplemented due to constant errors. This can be particularly difficult when flying. In my case I used other sensors, such as a GPS. But a strictly IMU-only approach to lengthy navigation is, AFAIK, impossible.

Yeah IMU error can be annoying unless you shell out $$$, but this issue occurs with many other kinds of sensors as well.

One common trick is to treat the IMU error as another state variable that is evolving over time. It won't eradicate the compounding error, but it will reduce the amount of calibration necessary.

Depends on the quality of the IMU. With a MEMS IMU (what’s in your smart phone and what’s cost effective for most consumer or industrial products today) yes, it’s impossible. Too much drift as a result of process variance in MEMS manufacturing and the “runtime” noise of the IMU itself. With a precision machined IMU, like something you’d find in a submarine or pre-GPS aircraft, totally possible. There’s a YouTube video (would have to dig for the link) showing a military plane in the 1950s flying transcon NYC to LA solely on IMU dead reckoning and arriving with 1/8 mile drift or something ridiculous. The IMU probably weighed a few hundred pounds and cost many hundreds of thousands, if not millions of dollars, though.

This is true when using very-low-cost (i.e. consumer MEMS) sensors but as a general statement it's false; many military and civilian space- and aircraft systems are solely or primarily inertial.. even ~50 year old passenger jets can make 12-13 hour flights with very high inertial accuracy - enough to find the dest airport runway without a radio beacon (though probably not enough to automatically land on it).

For anyone interested in the technical history of inertial nav, I'd highly recommend "Inventing Accuracy" by Don MacKenzie.

There are good IMUs that can be used for dead reckoning (and, if an old professor is to believed, watching the thermal expansion of a building over the course of a day) but they're hard to come by and extraordinarily expensive.

Interesting - I don't have the strongest math background so I'm usually somewhat intimidated by topics like this, but this makes it seem like a fairly simple weighted average based on squared error values like in standard deviation? I assume this would also only work on a normal distribution too?

Makes it seem a lot more approachable than many people have made it sound to me, but I may be horribly misunderstanding still.

I struggled a long time trying to get past all the matrix math that the usual Kalman filter tutorial starts out with. KF's make a lot more sense if you start from an example. By the end of the example, you realize that matrix math simplifies the notation hugely. But without the intuition about what it is doing for you, it doesn't help much -- at least it doesn't help me much.

So... my over-simplified touch-stones for KF's:

1. Everything is a Baysian quantity -- a mean (best guess) and a std. dev. (confidence).

2. You have a model of the system state.

3. You have some sensors, and you have some model for the accuracy & precision (noise) in the measurements.

Now, two rules:

1. The model runs open loop and at a regular cadence: "Tick tock, tick tock, the model updates on the clock." Of course, since you are running open loop, your confidence in every value of the model gets worse with each iteration (std. dev. grows). How much you adjust (reduce) the confidence is based on the precision of the system and the control inputs.

2. Sensors readings are applied as they come in: "Use'em if you got'em."

So.... now the moving average part... if you are twice as confident in your current estimate as you are in the reliability of the new measurement, do a weighted average of 2/3 of the current estimate and 1/3 of the sensor. This is why the confidence is carried along for all state values and sensor readings. Of course compute the weights of the moving average according to the current confidence values with every update.

That’s exactly right - new measurements (observations) or state predictions (estimates based on the assumed system model) are weighted based on the ratio of the standard deviation (uncertainty) of that observation/measurement to the historical standard deviation of the model to that point in time. For the closed form equations of the KF to work, you have to assume a) all measurements and parameters in the state model can be approximated by normal distributions with known standard deviations and means b) the underlying system model is linear. If you can’t assume b), then as the author mentions you can approximate the non-linear model with a linear one by taking the Jacobian and evaluating it at each step, and use the slightly modified EKF algorithm to bring everything together into an estimate of the system’s underlying state.

As an aside a nice property of particle filtering (a different approach for localization and SLAM) is that there’s no linearity assumption. A nice property of both particle filtering and the KF/EKF is that the Markov assumption holds. This simplifies the underlying mathematics, as well as benefits the implementation on a computer by reducing time and space complexity required to evaluate the algorithms at each step.

I love this. I've idly tried to understand the Kalman filter forever (ever=10yr+), and always encountered explanations that threw you straight at the linear algebra[1]. (yeah, even "quick" or "simple" explanation).

This explanation starting from the conceptual is great, and finally provides the needed foundation for me to understand the rest.

[1] huff not that I couldn't understand the linear algebra, just that I, uh, didn't have time for that... >_>