$$ \nonumber \def\Pr#1{P\left(#1\right)} \def\Li#1{p\left(#1\right)} \def\CondPr#1#2{P\left(#1\mid#2\right)} \def\CondLi#1#2{p\left(#1\mid#2\right)} \def\scalar{x} \newcommand{\st}{\lambda} \newcommand{\ob}{\scalar} $$

The Kalman smoother arises when you have a sequence of \(N\) observations, \(\scalar_1\dots\scalar_N\), and you want to infer a sequence of unobserved states, \(\lambda_1\dots\lambda_N\); the states follow a first order Markov process, then the observations depend on the states (see the diagram below). Explanations of Kalman smoothers tend to start with the filter and then obfuscate it by blurring in a bunch of tricky Gaussian convolutions. Here we start at the top and derive the recursions without worrying about the actual distributions; they can be added later.


State \(i\) trivially depends upon state \(i-1\), but there is a less trivial dependency on state \(i+1\). So the first thing is integrate it out; this defines the Kalman smoother: \[\begin{aligned} \underbrace{\CondLi{\lambda_i}{\scalar_1\dots\scalar_N}}_{\text{Smoother }i} &= \int d\lambda_{i+1}\, \CondLi{\lambda_i}{\lambda_{i+1},\scalar_1\dots\scalar_N} \CondLi{\lambda_{i+1}}{\scalar_1\dots\scalar_N}, \\ &= \int d\lambda_{i+1}\, \CondLi{\lambda_i}{\lambda_{i+1},\scalar_1\dots\scalar_i} \underbrace{\CondLi{\lambda_{i+1}}{\scalar_1\dots\scalar_N}}_{\text{Smoother }i+1}.\end{aligned}\] So it’s recursive in the smoother term. Notice the conditional independence in the second line: given state \(i+1\), we don’t need the observations after that. The inference is the wrong way around in the first term though, so \[\begin{aligned} \CondLi{\lambda_i}{\lambda_{i+1},\scalar_1\dots\scalar_i} &= \frac{ \CondLi{\lambda_{i+1}}{\lambda_i,\scalar_1\dots\scalar_i} \CondLi{\lambda_i}{\scalar_1\dots\scalar_i} }{ \CondLi{\lambda_{i+1}}{\scalar_1\dots\scalar_i} }, \\ &= \frac{ \CondLi{\lambda_{i+1}}{\lambda_i}\CondLi{\lambda_i}{\scalar_1\dots\scalar_i} }{ \CondLi{\lambda_{i+1}}{\scalar_1\dots\scalar_i} }.\end{aligned}\] That final term in the numerator is now similar to the original question, but independent of future observatons; that is the Kalman filter. It is evaluated by considering the current observation: \[\begin{aligned} \underbrace{\CondLi{\lambda_i}{\scalar_1\dots\scalar_i}}_{\text{Filter }i} &= \frac{ \CondLi{\scalar_i}{\lambda_i,\scalar_1\dots\scalar_{i-1}} \CondLi{\lambda_i}{\scalar_1\dots\scalar_{i-1}} }{\CondLi{\scalar_i}{\scalar_1\dots\scalar_{i-1}}}, \\ &= \frac{ \CondLi{\scalar_i}{\lambda_i} \CondLi{\lambda_i}{\scalar_1\dots\scalar_{i-1}} }{\CondLi{\scalar_i}{\scalar_1\dots\scalar_{i-1}}}.\end{aligned}\] Again, some observations become redundant given knowledge of state \(i\). The final term in the numerator is the Kalman predictor; it is evaluated using the trivial relationship between states \(i\) and \(i-1\): \[\begin{aligned} \underbrace{\CondLi{\lambda_i}{\scalar_1\dots\scalar_{i-1}}}_{\text{Predictor }i} &= \int d\lambda_{i-1}\, \CondLi{\lambda_i}{\lambda_{i-1},\scalar_1\dots\scalar_{i-1}} \CondLi{\lambda_{i-1}}{\scalar_1\dots\scalar_{i-1}}, \\ &= \int d\lambda_{i-1}\, \CondLi{\lambda_i}{\lambda_{i-1}} \underbrace{\CondLi{\lambda_{i-1}}{\scalar_1\dots\scalar_{i-1}}}_{\text{Filter }i-1}.\end{aligned}\] So the filter evaluation is recursive given the predictor.

The evaluation of the whole thing involves working through the equations in the opposite order to which they’re derived:

  1. Define an initial predictor, \(\Li{\lambda_0}\), to be some distribution.

  2. Recurse the filter using the predictor from state \(1\) to state \(N\). This yields filter values for each state.

  3. Initialise the smoother using the last filter value; recurse the smoother backwards from state \(N\) to state \(1\).