Hidden Markov Models (HMM)

A hidden markov model (HMM) is a stochatic process (evolution in time of a random process) based on a markov chain (probability of each event depends only on the state attained in the previous event). A HMM infers unobserved information from observed data through the study of probabilities.

Observed data can contain noise (i.e. due to sensor readings) which hides the ground truth. For instance, if reading location from a GPS while standing still, the sensor may jump around a small amount even though the hidden truth is constant. This “hidden truth” is called a state variable. A HMM chains these states together (chronologically) to estimate the state at time = 1 from time = 0 (for the case of a simple 1st order markov chain). Each state \(X_i\) has an associated observation \(Y_i\).

Assume there are M possible states to choose from (i.e. for binary, M=2). We can define a matrix A of size \(M \times M\) which symbolizes the transition matrix. This transition matrix contains the probabilities of moving from a state \(j\) given the previous state \(i\).

\[a_{ij} = P[X_k = j | X_{k-1} = i]\]

We can define a vector \(S_k\) at a specific point in time \(k\) as a probability vector.

\[\begin{split}S_k = \begin{vmatrix} p[X_k = 1] \\ p[X_k = 2] \\ ... \\ p[X_m = m] \\ \end{vmatrix}\end{split}\]

This is calculated by multiplying the previous state’s probabilities \(S_{k-1}\) by the transition matrix \(A\).

\[S_k^T = S_{k-1}^T A\]

Again, this tells us the probabilities of a state at timestep \(k\).

The probabilities of a state \(i\) resulting in an observed value \(j\) can be contained in an emission matrix \(B\), of size \(m \times k\). Written formally,

\[b_{ij} = P[Y_k = j | X_j = i].\]

The emission matrix shows us the probability of a state \(i\) will result in a result \(k\). For instance, there exists a (say) 5% chance that intentionally typing “d” (state) will result in “c” (observed) by slipping your fingers.

Generally speaking, a transition matrix \(A\) and emission matrix \(B\) could be rebuilt at each timestep \(t\) to consider non-stationary markov chains. However, in the above case, we assume stationary behavior and can therefore conclude a single matrix \(A\) that can be applied across time.

Baum Welch

The transition matrix \(A\), emission matrix \(B\) and initial state distribution \(\pi_0\) can be trained so that the model is maximally like the observed data. This is called expectation maximization (EM). In the training process, there are three phases: 1) initialization, 2) forward, 3) backward, and 4) update.

In the initialization phase, the \(A\), \(B\) and \(\pi_0\) parameters are initialized. This can be randomized, but usually is distributed as a Dirichlet, \(Dir(\alpha)\), which is a multivariate generalization of the beta distribution. This distribution is an exponential family distribution, it has a conjugate prior and thus can be used easily in bayesian analysis.