Abstract

:
In 1960, Rudolf E. Kalman created what is known as the Kalman filter, which is a way to estimate unknown variables from noisy measurements. The algorithm follows the logic that if the previous state of the system is known, it could be used as the best guess for the current state. This information is first applied a priori to any measurement by using it in the underlying dynamics of the system. Second, measurements of the unknown variables are taken. These two pieces of information are taken into account to determine the current state of the system. Bayesian inference is specifically designed to accommodate the problem of updating what we think of the world based on partial or uncertain information. In this paper, we present a derivation of the general Bayesian filter, then adapt it for Markov systems. A simple example is shown for pedagogical purposes. We also show that by using the Kalman assumptions or “constraints”, we can arrive at the Kalman filter using the method of maximum (relative) entropy (MrE), which goes beyond Bayesian methods. Finally, we derive a generalized, nonlinear filter using MrE, where the original Kalman Filter is a special case. We further show that the variable relationship can be any function, and thus, approximations, such as the extended Kalman filter, the unscented Kalman filter and other Kalman variants are special cases as well.

1. Introduction

In 1960, Rudolf E. Kalman demonstrated an ingenious way to estimate unknown variables from noisy measurements [1]. He did this by including information about the underlying dynamical system that governed the variables under consideration. With this, the optimal state of the system was inferred. To do this, he had two main assumptions: (1) all noise was Gaussian or normal and linearly additive; (2) the dynamical system was linear. The result was what is known as the Kalman filter.

Essentially, the algorithm follows the logic that if the previous state of the system is known, it could be used as the best guess for the current state. This information is used in two ways, the first is that prior to any measurement, the underlying dynamics of the system may be known. Given this knowledge and the previous state, the new state could be determined. Second, measurements of the unknown variables are taken. These two ways may conflict. Which solution should we believe? The answer is that we should believe them both, with some uncertainty. They should both be taken into account to determine what our new belief is for the state or what the values of the variables are after the measurements.

Bayesian inference is specifically designed to accommodate the problem of updating what we think of the world based on partial or uncertain information. It is well remarked that the Kalman filter is a special case of Bayesian inference [2]. We present our own derivation of the general Bayesian filter, then adapt it for Markov systems. A simple example is shown for pedagogical purposes with emphasis on the construction of the Kalman gain. Besides offering a greater pedagogical understanding of the algorithm, this also offers an insight into what to do when the Kalman assumptions are not valid or no longer apply. This allows the enhancement of sophisticated solutions, such as the extended Kalman, unscented Kalman, etc. [3].

However, Bayes rule does not assign probabilities; it is only a rule to manipulate them. The MaxEnt method [4,5] was designed to assign probabilities. This method has evolved to a more general method, the method of maximum (relative) entropy (MrE) [6,7], which has the advantage of not only assigning probabilities but updating them when new information is given in the form of constraints on the family of allowed posteriors. The main purpose of this paper is to show both general and specific examples of how the MrE can be applied using data and moment constraints. It should also be noted that Bayes’ rule is a special case of MrE. This implies that MrE is capable of producing every aspect of orthodox Bayesian inference and proves the complete compatibility of the Bayesian and entropy methods. Further, it opens the door to tackling problems that could not be addressed by either the MaxEnt or orthodox Bayesian methods individually; problems in which one has data and moment constraints. Thus, Kalman filters can be seen as a special case of filters developed using the MrE methods with Kalman assumptions.

In this paper, we will show several things; first, the derivation of the general Bayesian filter as used for problems that are of the nature that the Kalman filter is intended for, i.e., Markov systems. Second, we will show a simple example illustrating that the Kalman filter is a special case of the general Bayesian filter for pedagogical purposes. Third, we show that using the Kalman assumptions or “constraints”, we can arrive at the Kalman filter from MrE directly. Finally, we will show how the same Kalman logic can be applied to non-linear dynamical systems using Bayes rule and avoid approximations that are usually applied in extended Kalman filter and the unscented Kalman filter.

2. Bayesian Filter

Here, we will build the Bayesian filter. We start with Bayes rule,

p(xk|Yk)=p(Yk|xk)p(xk)p(Yk)

(1)

where the k is a temporal index, p(xk) is our prior, p(Yk|xk) is our likelihood, p(xk|Yk) is the posterior, Yk = {y1, …, yk} are our measurements and xk is some unknown variable that we would like to infer. We can split Yk into two sets, yk and Yk−1, where Yk−1 = {yk−1, …, y1}, which would give us,

p(xk|yk,Yk−1)=p(yk,Yk−1|xk)p(xk)p(yk,Yk−1).

(2)

Using the product rule [8] (which is used to derive Bayes theorem), we attain,

p(xk|yk,Yk−1)=p(yk|xk,Yk−1)p(Yk−1|xk)p(xk)p(yk|Yk−1)p(Yk−1).

Further, we can recognize that we can use Bayes Rule to rewrite part of the equation as,

p(xk|Yk−1)=p(Yk−1|xk)p(xk)p(Yk−1).

(3)

This can be seen as the prior, for xk, given all the other measurements; in other words, all the Bayesian updating on xk prior to measuring yk. This yields,

p(xk|yk,Yk−1)=p(yk|xk,Yk−1)p(xk|Yk−1)p(yk|Yk−1),

(4)

which is sometimes referred to as a recursive Bayesian filter, because each subsequent solution step (posterior) is the prior for the new step.

At this point, we come to our first key assumption for Kalman; if xk is “complete” [9], then yk is not conditionally dependent on Yk−1 or p(yk|xk, Yk−1) = p(yk|xk). In other words, if we have, xk, then we do not need Yk−1 to determine the probability of yk. This then yields,

p(xk|yk,Yk−1)=p(yk|xk)p(xk|Yk−1)p(yk|Yk−1).

(5)

Often, the form for the prior, p(xk|Yk−1), is not known. However, it can be seen as a marginal,

p(xk|Yk−1)=∫p(xk|xk−1)p(xk−1|Yk−1)dxk−1

(6)

where xk−1 is the previous state and p(xk−1|Yk−1) is the previous state posterior. This completes the typical recursive Bayesian filter with the “complete” or Markov assumption.

3. Kalman Filter

The second key assumption of the Kalman filter is that we assume that we do not have the past measurements, Yk−1, when trying to determine our belief for xk. This means that we need a form for our prior that allows us not to use past measurements. However, the previous value for the state, xk−1, is known.

Now, we will include the main Kalman assumptions above, first that all noise is Gaussian and linearly additive. Therefore, we will use Gaussians for our density distributions,

p(yk|xk)=12πσyk2exp[−12σyk2(yk−〈yk〉)2]

(7)

and,

p(xk|Yk−1)=12πσxk2exp[−12σxk2(xk−〈xk〉)2]

(8)

where 〈yk〉 and 〈xk〉 are the means of yk and xk and
σy2 and
σx2 are the variances of each variable, respectively. Note, for illustration purposes, we limit ourselves to one variable. In later sections, we will include multiple variables. Thus, the posterior that we are looking for is,

The next question is deciding on the value of the means, since we are not inferring those, as can be seen from the posterior. For this, we have to look at the “forward” problems for each density function. For the prior, the forward problem is xk = Fk,k−1xk−1 + ηk−1, where Fk,k−1 is called the “transition matrix” and for the likelihood, it is yk = Gkxk + ɛk, where Gk is called the “measurement matrix” function [10]. Each have Gaussian noise, ηk−1 and εk, with zero means, respectively. Therefore, for the prior, we have,

〈xk〉=∫xkp(xk|Yk−1)dxk,

(10)

where substituting yields,

〈xk〉=∫(Fk,k−1xk−1+ηk−1)p(xk|Yk−1)dxk,

(11)

and finally, we have,

〈xk〉=Fk,k−1xk−1.

(12)

Here, it should be noted that a critical point for the construction of Kalman is that we wish to determine Equation (1). By definition, the prior, Equation (3), is conditional on previous measurements. However, this is replaced with a function that is purely based on the dynamics of the system, x, i.e., there are no previous measurements explicitly in Equation (12).

Note, this is similar to a least squares (which itself is a special case of Bayes). There is one more obvious question to be answered: while this may be a solution for the density function in regards to xk, how do we get xk−1? We need a single number. The answer depends on the what is considered the “best guess” or point estimate for xk−1. There are many choices, such as the mean, median or mode. However, since we are dealing with a symmetric solution, they are one in the same. Therefore, the easiest point estimate to get is the mode, x̂, where,

∂p∂x|x=x^=0,

(17)

for any x. Sometimes, this is called the maximum a posteriori (MAP) solution. Thus, the answer to the question noted above is that the point estimate that is used in Equation (12) is x̂k−1, which is the mode from the previous step.

3.1. A Simple Example

To show its processing workflow, we show a very simple example. We wish to know our 1D location given the known dynamical system and a measurement at each time step. First, we let Gk = 1. Then, we apply the dynamical equation,

xk=xk−1+v0Δt+ηk−1,

(18)

where v0 is a known velocity constant and Δt is a known time step. It should noted, as well, that we can write this in terms of the noise,

ηk−1=xk−(xk−1+v0Δt),

(19)

and thus, what we are modeling with our Gaussian is the noise. However, we eventually wish to know xk, so we will write for our case,

〈xk〉=xk−1+v0Δt,

(20)

since, as before, the mean of the noise is zero. Thus, our posterior is,

It is presented in this form to show that the the new variance,
σ^k2, is the inverse of the sum of the inverse individual variances, which is true in general for this type of problem; for example, if we had more than the bivariate case.

One last question that needs to be addressed is what is the value of xk−1? The last assumption of the Kalman filter is that the MAP estimate of the last estimate is the best estimate for this value. This is a key assumption and not trivial, as it means that with each iterative step, information would be lost generally. This is not the case for the Kalman filter, since the Gaussian is assumed, and therefore, the mode and the variance uniquely identify the distribution.

These solutions can be manipulated and written in the other following form, as well,

x^k=xk−1+v0Δt+Kk(yk−xk−1−v0Δt)

(24)

where Kk is what is known as the Kalman “gain”, which for this example is,

Kk=11σyk2+1σxk2=σyk2σxk2σxk2+σyk2.

(25)

This is an especially important pedagogical result, as it shows that Kalman “gain” Equation (25) is simply the new, inferred variance Equation (23).

4. Maximum Relative Entropy

First, we present a review of maximum relative entropy. For a more detailed discussion and several examples, please see [6,11]. Our first concern when using the MrE method to update from a prior to a posterior distribution is to define the space in which the search for the posterior will be conducted. We wish to infer something about the values of one or several quantities, θ ∈ Θ, on the basis of three pieces of information: prior information about θ (the prior), the known relationship between x and θ (the model) and the observed values of the data x ∈ 𝒳. Since we are concerned with both x and θ, the relevant space is neither 𝒳 nor Θ, but the product 𝒳 × Θ, and our attention must be focused on the joint distribution, P (x, θ). The selected joint posterior, Pnew(x, θ), is that which maximizes the entropy (in MrE terminology, we “maximize” the negative relative entropy, S, so that S ≤ 0. This is the same as minimizing the relative entropy),

S[P,Pold]=−∫P(x,θ)logP(x,θ)Pold(x,θ)dxdθ,

(26)

subject to the appropriate constraints. Pold (x, θ) contains our prior information, which we call the joint prior. To be explicit,

Pold(x,θ)=Pold(θ)Pold(x|θ),

(27)

where Pold (θ) is the traditional Bayesian prior and Pold (x|θ) is the likelihood. It is important to note that they both contain prior information. The Bayesian prior is defined as containing prior information. However, the likelihood is not traditionally thought of in terms of prior information. Of course, it is reasonable to see it as such, because the likelihood represents the model (the relationship between θ and x) that has already been established. Thus, we consider both pieces, the Bayesian prior and the likelihood to be prior information.

The new information is the observed data, x′, which in the MrE framework must be expressed in the form of a constraint on the allowed posteriors. The family of posteriors that reflects the fact that x is now known to be x′ is such that,

P(x)=∫P(x,θ)dθ=δ(x−x′),

(28)

where δ(x − x′) is the Dirac delta function. This amounts to an infinite number of constraints: there is one constraint on P (x, θ) for each value of the variable, x, and each constraint will require its own Lagrange multiplier, λ(x). Furthermore, we impose the usual normalization constraint,

∫P(x,θ)dxdθ=1,

(29)

and include additional information about θ in the form of a constraint on the expected value of some function f (θ),

∫P(x,θ)f(θ)dxdθ=〈f(θ)〉=F.

(30)

Note that an additional constraint in the form of ∫P(x, θ)g(x)dxdθ = 〈g〉 = G could only be used when it does not contradict data constraint Equation (28). Therefore, it is redundant, and the constraint would simply get absorbed when solving for λ (x). We also emphasize that constraints imposed at the level of the prior need not be satisfied by the posterior. What we do here differs from the standard Bayesian practice in that we require the constraint to be satisfied by the posterior distribution.

We proceed by maximizing Equation (26) subject to the above constraints. The purpose of maximizing the entropy is to determine the value for P when S = 0, meaning that we want the value of P that is closest to Pold given the constraints. The calculus of variations is used to do this by varying P → δP, i.e., setting the derivative with respect to P equal to zero. The Lagrange multipliers, α, β and λ(x) are used so that the P that is chosen satisfies the constraint equations. The actual values are determined by the value of the constraints themselves. We now provide the detailed steps in this maximization process.

The terms inside the brackets must sum to zero, therefore we can write,

logP(x,θ)=logPold(x,θ)−1+α+βf(θ)+λ(x)

(34)

or

Pnew(x,θ)=Pold(x,θ)e(−1+α+βf(θ)+λ(x))

(35)

In order to determine the Lagrange multipliers, we substitute our solution Equation (35) into the various constraint equations. The constant α is eliminated by substituting Equation (35) into Equation (29),

The final step is to marginalize the posterior, Pnew(x, θ) over x to get our updated probability,

Pnew(θ)=Pold(x′,θ)eβf(θ)ζ(x′,β)

(46)

Additionally, this result can be rewritten using the product rule (P (x, θ) = P(x) P (θ|x)) as

Pnew(θ)=Pold(θ)Pold(x′|θ)eβf(θ)ζ′(x′,β),

(47)

where ζ′(x′, β) = ∫eβf(θ)Pold(θ) Pold(x′|θ) dθ. The right side resembles Bayes theorem, where the term Pold(x′|θ) is the standard Bayesian likelihood and Pold(θ) is the prior. The exponential term is a modification to these two terms. In an effort to put some names to these pieces we will call the standard Bayesian likelihood the likelihood and the exponential part the likelihood modifier so that the product of the two gives the modified likelihood. The denominator is the normalization or marginal modified likelihood. Notice when β = 0 (no moment constraint) we recover Bayes’ rule. For β ≠ 0 Bayes’ rule is modified by a “canonical” exponential factor.

5. Maximum Relative Entropy and Kalman

There are works where entropy maximization is being used in Kalman filtering [12–14]. For example, [14] uses entropy maximization as one of the approximation tools to reduce uncertainty. Here, we show that if the same assumptions are taken into account, the explicit closed form solution is derived. So, a numerical comparison, as in [14], becomes unnecessary and even limited. To the best of our knowledge, there is no work that shows a direct link between the original Kalman filter and maximization of the relative entropy and produces the closed form solution. We will now present a more complicated example illustrating the maximum relative entropy (MrE) solution and discuss the assumptions and constraints that lead to the same closed form Kalman filter solution.

This example consists of analyzing a linear system composed of two equations that represent linear motion with constant acceleration ca,k. The dynamics of the velocity, vk, and the position, xk, are encoded in the following two relationships, which we assume are relevant for predicting the linear motion of the next state,

xk=xk−1+Δtvk−1+ca,kΔt22,

(48)

vk=vk−1+Δtca,k,

(49)

where Δt is the discretization interval between the subsequent measurements, and the index, k, represents the temporal discretization interval.

Here, we will derive the so-called “prediction step”, which will be the posterior or the following optimization criterion or entropy, which has the form,

S(P¯k,Pprior,k)=−∫−∞∞∫−∞∞P¯k(xk,vk)lnP¯k(xk,vk)Pprior,k(xk,vk)dxkdvk,

(50)

where xk is the position, vk is the velocity, Pprior,k (xk, vk) is the prior probability distribution function (which is sometimes a uniform distribution for the first sample) and P̄k(xk, vk) is the posterior distribution to be found as a result of the first Kalman filter step, which is also called a “prediction” step [9]. In fact, it is the object that we are deriving, P̄k(xk, vk), that will be the prior for our Bayesian filter and in the special case, the Kalman filter, such as Equation (8).

All constraints come from the same Kalman filter assumptions. We derive the first constraint using Equation (48), which is the variance,

∫−∞∞∫−∞∞(xk,〈xk〉)2P¯k(xk,vk)dxkdvk=0,

(51)

where,

xk=xk−1+Δtvk−1+ca,kΔt22.

(52)

Notice that there is no noise term in Equation (52). This is why the variance is zero. There is no noise, and thus, the variables of interest could be determined explicitly from the model. However, obviously, all variables (and constants, if any) have some noise (η) variables, which in the Kalman filter are assumed to be additive and linear. Then, Equation (51) can be rewritten to,

∫−∞∞⋯∫−∞∞(Ψk)2P¯k(xk,vk,ηx,k−1,ηv,k−1,ηa,k−1)Ωk=0,

(53)

where Ωk = dxkdvkdηx,k−1dηv,k−1dηa,k−1, and where,

Ψk=xk−[(x^k−1+ηx,k−1)+Δt(v^k−1+ηv,k−1)+(ca,k+ηa,k−1)Δt22],

(54)

where Ψk is the noise term of the model. Note that this is effectively following the note mentioned in conjunction with Equation (19) and where x̂k−1 and v̂k−1 are estimates of our variables from the previous discretization interval and ηx,k−1, ηv,k−1, ηa,k−1 are multivariate normal distribution additive noise variables, which have means of zero. Frequently, the joint prior distribution of noise variables is defined by four main assumptions:

(1)

The means of all noise variables are zero;

(2)

The joint distribution function is a multivariate normal distribution;

(3)

The covariance matrix is not only valid for the previous posterior distribution discretization interval, but also for the current posterior distribution discretization interval, which in our case is P̄(xk, vk, ηx,k−1, ηv,k−1, ηa,k−1). In other words, it is implied that in our specific case, we have the following equalities,

where the variances and covariances are usually taken from the inference result of the previous discretization interval or are set with initial guesses.

(4)

The last, but not the least, assumption in Kalman Filtering is that our noise variables are independent from our main state variables, i.e., xk and vk. The main benefit of this assumption is that we can manipulate Equation (53) by applying the constraints in Equations (55)–(58). Keep in mind that the means of noise variables are zeros, so many additive terms will zero out after integrations. Therefore, we finally get Equation (53) in the form of,

∫−∞∞∫−∞∞(xk−〈xk〉)2P¯k(xk,vk)dxkdvk

(59)

=σx,k−12+2covx,v,k−1Δt+σv,k−12Δt2+σa,k−12Δt44,

(60)

where,

〈xk〉=x^k−1+Δtv^k−1+ca,kΔt22.

(61)

Similarly, we can construct two other MrE constraints based on Kalman filter assumptions as,

With these equations, along with a normalization constraint, like Equation (29), we get the posterior distribution function of the prediction step of the Kalman filter using MrE. For illustration purposes, we assume that Pprior,k(xk, vk) is a uniform distribution. In other words, the distribution that maximizes Equation (50) with constraint Equations (59), (62) and (64) is the prediction step function below. This solution yields the same answer as the prediction step as in the Kalman filter, where P̄k(xk, vk) is an unknown posterior distribution function being constrained. This distribution is simply a multivariate normal distribution with a covariance matrix containing elements defined by scalar values on the right side of Equations (59), (62) and (64), and the means defined by our mathematical model, which are,

xk=x^k−1+Δtv^k−1+ca,kΔt22,

(65)

vk=v^k−1+Δtca,k.

(66)

For simplicity’s sake, we will introduce those scalar coefficients as,

σ¯x,k2=σx,k−12+2covx,v,kΔt+σv,k−12Δt2+σa,k−12Δt44,

(67)

σ¯v,k2=σv,k−12+σa,k−12Δt2,

(68)

cov¯x,v,k=covx,v,k−1+σv,k−12Δt+σa,k−12Δt32.

(69)

Then, the posterior multivariate normal distribution of this inference step (the prediction step of the Kalman filter) is,

5.1. Kalman Filter’s Updating Step

Our focus here is the traditional updating step of the Kalman filter and its reproduction by MrE. The measurement distribution needed would be obtained in a similar manner as in the predictive step using the following constraints,

for this example, where y′k is the observation value that is analogous to Equation (28). Using these pieces, we get,

Plikelihood,k(xk,vk)=12πσy,k2exp[−12σy,k2(yk′−Gkxk)2].

(75)

To avoid confusion, we make this note: this indeed is a function of xk and vk; the reason being that following the previous section, the yk in Plikelihood,k (yk, vk) was replaced with the observed value, y′k through the data constraint (Dirac delta function) and the mean of yk is a function of xk. Therefore, the joint posterior that is analogous to Equation (16) would be,

Pk(xk,vk)∝P¯k(xk,vk)Plikelihood,k(xk,vk).

(76)

To get the estimates of state variables for discretization interval k, we select the modes as in Equation (17),

∂Pk(xk,vk)∂xk|xk=x^k=0,and∂Pk(xk,vk)∂vk|vk=v^k=0.

(77)

This yields the final estimates for the discretization interval, k, as follows,

These covariance values and estimates are then used in the next discretization interval step by repeating the same procedure from the beginning, i.e., starting with Equation (50), we have the next discretization interval’s Kalman filter’s first step as,

where we assumed that
σa,k2=σa,k−12 and ca,k = ca,k−1, i.e., acceleration is constant in this problem. The iterative nature of Kalman filter comes into effect by assuming that Pprior,k+1 (xk+1, vk+1) = Pprior,k+1 (xk, vk) = Pk (xk, vk). In other words, the previous discretization interval’s posterior function is the prior for the current discretization interval.

5.2. Kalman Filter Revisited

We will now present the solution of the Kalman filter that is the same closed form solution as in the previous subsection. First, we need to construct our problem in matrix form.

where x̄k is a vector that has coordinates representing our position and velocity variables. The transition matrix and additive term matrices are,

Fk=[1Δt01],

(90)

hk=[12ca,k−1Δt2ca,k−1Δt].

(91)

The covariance matrix of our state space variables is,

Φk−1=[σx,k−12covx,v,k−1covx,v,k−1σv,k−12].

(92)

Then, the covariance matrix of the so-called noise, based on our mathematical model, is,

Qk=[14σa,k2Δt412σa,k2Δt312σa,k2Δt3σa,k2Δt2].

(93)

The prediction step of the Kalman filter for calculating the covariance matrix is,

Φ¯k=FkΦk−1FkT+Qk,

(94)

and this equation produces the same (the exact closed form solution expressed by elementary algebraic functions) covariance values between variables as in Equations (67), (68) and (69). Therefore, there is a one-to-one match between the prediction step of the Kalman filter and the MrE solution. If we do not measure velocity, but just position (as in the simple example above), then our observation model, Gk, its covariance matrix, Rk, and the measurement vector matrix, yk, are,

Gk=[10],andRk=[σy,k2],andyk=[yk′].

(95)

The updating step of the Kalman filter requires the calculation of the Kalman gain,

Kk=Φ¯kGkT(GkΦ¯kGkT+Rk)−1,

(96)

and the covariance matrix of the Kalman filter is,

Φk=(I−KkGk)Φ¯k(I−KkGk)T+KkRkKkT,

(97)

where I is a unit matrix. Matrix Equation (97) has elements that are exactly the same as the closed form solutions in Equations (80), (81) and (82). The expectation values of the Kalman filter are,

xk=x¯k+Kk(yk−Gkx¯k),

(98)

which produces exactly the same closed form solution as Equation (78) and Equation (79). Thus, the Kalman filter produces the same solution MrE. In other words, if our assumptions are the same, both approaches return the same answer.

Sometimes, there are discussions [15] about the numerical stability of the Kalman filter and a different, and simplified version of covariance matrix Equation (97) is used [9],

Φk=(I−KkGk)Φ¯k.

(99)

However, this equation sometimes produces numerically unstable results [16], but the context is important. On page 151 of [16] it states: “However, Joseph’s stabilized version has more computations than the form given by Equation 3.44. Hence, a filter designer must trade off computational workload versus potential round off errors”. We think that there is some confusion here. If mathematical reduction is done in the estimates’ expressions (after those are obtained by applying the matrix operations as defined in the original Kalman filter), then by definition, there are no rounding errors in the remaining irreducible expressions (which are already reduced to the lowest terms), which might result in a filter’s instability. By reduction to the expression’s lowest terms, we also solve the computational workload issue and avoid the round off error problem. In other words, we must do our homework offline to avoid computational workload and potential round off errors when solving practical engineering problems online. Joseph’s stabilized version or the original matrix form discussion is an artifact of matrix manipulation, but is not an artifact of the filter itself.

Summarizing, if our state space system and its corresponding transition matrix is of such a size and/or sparsity that we can get its inverse matrix analytically in its reduced solution without numerical iterations, then selecting which version (Joseph’s stabilized or original) does not matter, because the closed form and the numeric answer would be the same. From a practical point of view, the maximum relative entropy method might be particularly useful for loosely coupled systems (not necessarily small), because its complexity (the total number of Lagrange multipliers) is equal to the total number of transition equations, and it has no explicit difficulty in calculating inverse matrices, because of the variational techniques used.

The fact is that both Equation (97) and Equation 99 return exactly the same closed form solution as MrE does in Equations (80)–(82), and by definition, these expressions are numerically stable, always. However, if one applies Kalman filter matrix operations and does not continue with further reductions of the final estimates’ expressions into their irreducible forms, then, yes, expression Equation (97) might be better to use compared to Equation (99). This shows one more benefit of MrE: the closed form solutions allow one to avoid numeric instability in certain situations.

6. Nonlinear Filter

The original Kalman filter has an assumption that the relationships between the state space system’s variables are linear. This assumption allows it to be expressed in a matrix form. Therefore, by definition, the Kalman filter is a linear filter, and nonlinear relationships have no explicit representations in transition matrix Equation (90). For that reason, variants of the Kalman filter were invented. One is called the extended Kalman filter, where any function is approximated locally by calculating a Jacobian at the approximate estimated location. Another variant is the unscented Kalman filter, which locally restricts the data sampling to a set of 2n+1 sigma points, where n is the dimension of the state space. It allows the avoidance of calculating the Jacobian and has the benefit that the nonlinear transition can be locally approximated by a cubic characteristic, if we looked from a single variable’s point of view. In the next section, we will derive the Kalman filter expression by a probabilistic approach by applying a one-to-one transition between the state variables, where the transition can be monotonic increasing or decreasing and not necessarily linear, as in original Kalman filter assumptions. More complex and generalized formulae might be found in [17].

In this section, we construct a general transformation of variables. While this can be found in undergraduate texts and advanced literature on Kalman filtering [18,19], we feel the need to include it, as it allows us to further point directly to why Kalman assumptions are necessary. Assume we have a single random variable, X (thus, a univariate approach). A random variable is a function whose value is subject to variations, due to some randomness. A value of this function (which we will call a random variable from now on) is associated with some probability (discrete case) or with probability density (continuous case). In this section, we deal with continuous real-valued data values only, but the approach itself is not restricted to them only.

The cumulative distribution function (cdf) of X is the function given by,

FX(x)=P(X≤x).

(100)

If probability density function (pdf) f of a random variable, X, is known, then the cdf is given by,

FX(x)=∫−∞∞fX(x´)dx´andfX(x)=dFxdx,

(101)

where x́ and x are values of the measurable space.

Assume we are measuring the traveled distance by a robot in meters (random variable X) and we can learn how many kilometers the robot has traveled (random variable Y); then, there is a physical relationship in the measurable space, because we know the ratio between kilometers and meters (A = 1000; note that this parameter is known, i.e., 1000, is a constant), as,

y=A⋅x.

(102)

One of the very first Kalman filter assumptions is that the relationships in Equation (102) have the same configuration in the random variables space, too, i.e., by definition, we can write this as,

Y≡A⋅X.

(103)

Or more generally, there is an invertible function, g (X) (whose inverse function is X = g−1(Y)), with which we can transform the random variable, X, to Y. In other words, the variable, Y, has complete probabilistic information for this transformation, and we can therefore get all the probabilistic characteristics of X. We could assume some nonlinear assumptions here, where there are one-to-many links between variables [17], but for simplicity’s sake, we avoid this here.

Current one-to-one assumptions are still more general than the original Kalman filter assumptions, because we allow not only the linear equation system of variables, but also the system of any continuously increasing or decreasing functions. Then, the definition or the meaning of Equation (103) can be represented by the following expressions:

where Equation (104) is the case when g (X) is a continuous increasing function gincreasing (x) and Equation (105) is the case when g (X) is a continuous decreasing function gdecreasing (X). The main parts of these inequalities are P (Y ≤ y) = P (gincreasing (X) ≤ y) and P (Y ≤ y) = P (gdecreasing (X) ≤ y), i.e., by definition, we can apply the transformation of random variables when constructing the probabilities. In other words, if we know the cdf of the random variable, X (the right side of Equation (104) and Equation (105)), then we can construct the cdf of Y (the left side of Equation (104) and Equation (105). Extending Equation (101) by reusing the expression of the cumulative distribution from Equation (104) and applying the Fundamental Theorem of Calculus and the Chain rule, we attain,

The sign of fY (y) depends on the sign of
d(gdecreasing−1(y))dy, i.e., when it is increasing (the derivative is non-zero), the sign is positive, and when decreasing, the sign is negative. Therefore, in the general case, when function g (X) is invertible and monotonic, we can write the final transformation expression as,

fY(y)=fX(gdecreasing−1(y))|d(gdecreasing−1(y))dy|.

(108)

This is simply known as the Method of Transformations in most elementary probability books or in dynamical systems, the Perron–Frobenius operator. The Perron–Frobenius operator for surjective maps is shown on page 187 in [20]. It is noted in [21]; however, it misses the original proof. We go one step further and show its link to the Kalman filter. We see that expression Equation (108) holds for any continuous monotonic increasing or decreasing function (so, it is not restricted to just the linear functions, as in the Kalman filter). Therefore, we can safely remove the main Kalman filter assumption that all random variables link to each other through linear relationships at the first Kalman stage, called the prediction step. Moreover, we can extend Equation (108) to the cases when the relationships between x and y are not one-to-one. Thus, in general, such a filter could be a nonlinear filter.

We begin the generation for a multivariate, nonlinear filter. A system of transformations in the measurement space is,

{y1=g1(x1,x2,⋯,xn);y2=g2(x1,x2,⋯,xn);⋮yn=gn(x1,x2,⋯,xn),

(109)

where there are n-dimensional vectors, and the transformation function, g, is a multivariate continuous function representing the one-to-one representation between measurement spaces of vectors x = (x1, x2, … xn) and y = (y1, y, … yn). Again, this is not necessary to assume that g(x) is a system of linear functions, like in the Kalman filter. Therefore, the Kalman filter is a special case of the approach that we outline here. The investigation of situations when we have many-to-one (or other combinations) of relationships between vectors x and y is out of scope of this work. Therefore, we have an invertible one-to-one transformation:

The expression in Equation (112) outlines the basis of the first step of the Kalman filter. This step includes modification of the covariance matrix. In other words, based on our knowledge about the prediction model in Equation (109), we can infer the pdf of the next sample. In a previous subsection, the constant, A, represented a relationship between distinct dimensions of the same measured variable, x, where y was the value in another physical dimension. In the Kalman filter, however, x represents a certain measurement value happening at the time, tk, and y represents the subsequent measurement value, y, happening at the time tk+1 = tk + Δt, where index k ranges from one to n and is the discretization interval of our measurement signal, x (t). Therefore, in the Kalman filter context, we should rewrite Equation (102) as,

The predicted time moment sample from the pdf of the previous sample can be obtained by utilizing Equation (115) in the probabilistic space, i.e., we can rewrite Equation (112) and Equation (113) for the prediction step of the Kalman filter as,

The expression in Equation (116) and covariance matrix change fully describe the prediction step of the nonlinear filter, where we assume that there is a one-to-one relation between subsequent time measurements. The original Kalman filter is a special case of probabilistic inference, when we assume that the transformation function is nothing more than the linear transformation, i.e.,

xk(tk+1)=g1(x1(tk),x2(tk),⋯,xn(tk))=A1x1(tk)+A2x2(tk)+⋯+Anxn(tk).

(117)

There is a way to deal with surjective transformations, when the transforming function is nonlinear. In such cases, we should also develop the relationship between such a surjective transform and the extensions of the Kalman filter, such as the extended Kalman filter and the unscented Kalman filter. However, such a development is not in the scope of this paper, and we are going to include them in future discussions.

6.3. Kalman Filter Revisited Using a Jacobian

In this subsection, we will revisit the Kalman filter example from the previous section. Our state space system with transformation function g (x) stays the same,

The prior distribution, fk−1 (⋯), is constructed exactly the same as in previous subsection by eliminating the random noise variables. In other words, it is a multivariate normal distribution with a covariance matrix of,

After inserting Equation (123) into Equation (121), we recover the exact closed form solution as in Equation (70). Therefore, our initial assumptions are relevant and match the assumptions we made in the maximization of relative entropy or the assumptions of the original Kalman filter.

7. Summary and Final Remarks

Kalman demonstrated an ingenious way to estimate unknown variables from noisy measurements, in part by making various assumptions. In this paper, we derive the Bayesian filter and, then, show that by applying the Kalman assumptions, we arrive at a solution that is consistent with the original Kalman filter for pedagogical purposes; explicitly showing that the “transition” or “predictive” step is the prior information and the “measurement” or “updating” step is the likelihood of Bayes’ rule. Further, we showed that the well-known Kalman gain is the new uncertainty associated with the posterior distribution.

Recently, a paper [22] used maximum relative entropy (MrE) with the Kalman assumptions, but did not explicitly state that there is a direct link between these two approaches. Here, we showed that the method of maximum relative entropy (MrE) explicitly produces the same mathematical solutions as the Kalman filter, and thus, Kalman is a special case of MrE. We also showed that the closed form solutions after the application of MrE are immune to real-life numeric problems, which might occur when manipulating the Kalman filter matrix operations.

By applying and manipulating pure probabilistic definitions and techniques used in signal analysis theory, we derived a general, nonlinear filter, where constraining the variables of interest in the form of continuous monotonic increasing or decreasing functions and not necessarily a linear set of functions, like in the original Kalman filter. Thus, we can include more information and extend approximation approaches, such as the extended Kalman filter and unscented Kalman filter techniques and other hybrid variants.

In the end, we derived general distributions using MrE for use in Bayes’ Theorem for the same purposes as the original Kalman filter and all of its offshoots. However, MrE can do even more. An important future work will be to include simultaneous constraints on the posterior that Bayes cannot do easily alone, such as including macroscopic or course-grained relationships between the various variables of interest. This has been demonstrated in [11].

We would like to acknowledge valuable discussions with Julian Center and Peter D. Joseph.