The Speech Recognition Problem

the speech recognition problem can be described as a funtion that defines a mapping from the acoustic evidence to a single or a sequence of words.

Let X = (x1, x2, x3, …, xt) represent the acoustic evidence that is generated in time (indicated by the index t) from a given speech signal and belong to the complete set of acoustic sequences, XX . Let W = (w1, w2, w3, …, wn) denote a sequence of n words, each belonging to a fixed and known set of possible words, WW. there’re two frameworks to describe the speech recognition function:

Speech Recognition Architecture

the above equation establishes the components of a speech recognizer:

The statistical framwork for speech recognition brings 4 problems that must be addressed.

  1. The acoustic processing problem.

    low dimensionality, discriminability, robustness: feature extraction

  2. The acoustic modeling problem.

    decide on how P(XW) should be computed. the acoustic models are usually estimated using HMMs.
  3. The language modeling problem.

    decide on how to compute the priori probability P(W) for a sequence of words, such as N-Gram

  4. The search problem.

Signal Processing and Feature Extraction

due to the physical limitations on the movement rate, a segment of speech sufficiently short can be considered equivalent to a stationary process.

In practical terms, a sliding window (with a fixed length and shape) is used to isolate each segment from the speech signal. Typically, the segments have between 20 ms and 30 ms and they are overlapped by 10 ms.

This approach is commonly referred to short-time analysis.

Signal-based analysis

the only assumption is that the signal is stationary.

two methods are commonly used:

Production-based Analysis