[TOC]

Reference: the Hidden Markov Model

The HMM is a sequence model. A sequence model or sequence classifier is a model whose job is to assign a label or class to each unit in a sequence, thus mapping a sequence of observations to a sequence of labels.

A HMM is a probabilistic sequence model: given a sequence of units(words, letters, morphemes, sentences, whatever), then compute a probability distribution over possible sequence of labels and choose the best label sequence.

In this blog, we begin with the Markov chain and then including the main three constituent algorithms: the **Viterbi** algorithm, the **Forward** algorithm and the **Baum-Welch** or EM algorithm for unsupervised (or semi supervised) learning.

a weighted finite automaton is defined by a set of states and a set of Markov chain transitions between states, with each arc associated with a weight.

A Markov chain is a special case of a weighted automaton in which weights are probabilities (the probabilities on all arcs leaving a node must sum to 1) and in which the input sequence uniquely determines which states the automaton will go through. Because it can’t represent inherently ambiguous problems, a Markov chain is only useful for assigning probabilities to unambiguous sequences.

We can view a Markov chain as a kind of **probabilistic graphical model**: a way of representing probabilistic assumptions in a graph. So a MMC is specified by the following components:

```
Q = q_1 q_2 ...q_N a set of N states
A = a_01 a_02 ...a_n1 ...a_nn a transition probability matrix A, each a_ij representing the probability of moving from state i P to state j, s.t. sum(n a_ij j=1 = 1 ∀i
q 0 ,qF a special start state and end (ﬁnal) state that are not associated with observations
```

In a **first order** Markov chain, the probability of a particular state depends only on the previous state.

An alternative representation that is sometimes used for Markov chains doesn’t rely on a start or end state, instead representing the distribution over initial states and accepting states explicitly:

In many cases, however, the events we are interested in may not be directly observable in the world. We call the part-of-speech tags hidden because they are not observed.

A **hidden Markov model (HMM)** allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model.

Different from Markov model, An HMM is specified by the following components:

For an example of number of ice creams eaten give a HOT or COLD weather, we use a probabilistic graph(**PG**) to denote:

If there is a (non-zero) probability of transitioning between any two states. Such an HMM is called a **fully connected** or **ergodic** HMM.

Sometimes, however, we have HMMs in which many of the transitions between states have zero probability. For example ,the **Bakis HMM**, which is generally used to model temporal processes like speech:

An inﬂuential tutorial by Rabiner (1989), based on tutorials by Jack Ferguson in the 1960s, introduced the idea that hidden Markov models should be characterized by **three fundamental problems:**

Given an HMM λ = (A,B) and an observation sequence O, determine the likelihood P(O | λ). |

Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q.

the Viterbi algorithm has one component that the forward algorithm doesn’t have: backpointers. The reason is that while the forward algorithm needs to produce an observation likelihood, the Viterbi algorithm must produce a probability and also the most likely state sequence. We compute this best state sequence by keeping track of the path of hidden states that led to each state, as suggested in Fig below, and then at the end backtracing the best path to the beginning (the Viterbi backtrace).

Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B.

The input to such a learning algorithm would be an unlabeled sequence of observations O and a vocabulary of potential hidden states Q. Thus, for the ice cream task, we would start with a sequence of observations O = {1,3,2,…,} and the set of hidden states H and C. For the part-of-speech tagging task we introduce in the next chapter, we would start with a sequence of word observations O = {w 1 ,w 2 ,w 3 …} and a set of hidden states corresponding to parts of speech Noun, Verb, Adjective,… and so on.

The **Baum-Welch algorithm** is a special case of the **Expectation-Maximization(EM) algorithm**. The algorithm will let us train both the *transition probabilities* **A** and the *emission probabilities* **B** of the HMM. Crucially, EM is an iterative algorithm. **It works by computing an initial estimate for the probabilities, then use those estimates to computing a better estimate, and so on, iteratively improving the probabilities that it learns.**