TERM: MFCC 梅尔频率倒谱系数


关于语音,我们首先需要了解的是一个人发出的声音是由人产生的声音是由包括舌,牙齿等vocal tract的形状filter之后得到的。这些形状决定了发出的声音是怎样的。我们如果能准确辨别出这些shape,就可以得到一种准确的音位


声道的shape表现为短时间功率谱的包络线(envelope of the short time power spectrum),MFCCs的工作则是如何准确地表征这种envelope。本文就是关于这一点的。

Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. They were introduced by Davis and Mermelstein in the 1980’s, and have been state-of-the-art ever since.

Step at a Glance


There are a few more things commonly done, sometimes the frame energy is appended to each feature vector. Delta and Delta-Delta features are usually also appended. Liftering is also commonly applied to the final features.

Mel Scale 梅尔刻度

The Mel scale relates perceived frequency, or pitch of a pure tone to its actual measured frequency.


The formula for converting from frequency to Mel scale is:


To go from Mels back to frequency:


Implementation Steps

让我们从Speech Signal开始,假设采样率是16kHz。

Plot of Mel Filterbank and windowed power spectrum

​ Plot of Mel Filterbank and windowed power spectrum

The resulting features (12 numbers for each frame) are called Mel Frequency Cepstral Coefficients.

接下来是上文中提到的Mel filterbank 如何计算的问题:

Computing the Mel filterbank

In this section the example will use 10 filterbanks because it is easier to display, in reality you would use 26-40 filterbanks.

To get the filterbanks shown in figure 1(a) we first have to choose a lower and upper frequency. Good values are 300Hz for the lower and 8000Hz for the upper frequency. Of course if the speech is sampled at 8000Hz our upper frequency is limited to 4000Hz. Then follow these steps:

  1. Using equation 1, convert the upper and lower frequencies to Mels. In our case 300Hz is 401.25 Mels and 8000Hz is 2834.99 Mels.

  2. For this example we will do 10 filterbanks, for which we need 12 points. This means we need 10 additional points spaced linearly between 401.25 and 2834.99. This comes out to:

    m(i) = 401.25, 622.50, 843.75, 1065.00, 1286.25, 1507.50, 1728.74, 
           1949.99, 2171.24, 2392.49, 2613.74, 2834.99
  3. Now use equation 2 to convert these back to Hertz:h(i) = 300, 517.33, 781.90, 1103.97, 1496.04, 1973.32, 2554.33, 3261.62, 4122.63, 5170.76, 6446.70, 8000Notice that our start- and end-points are at the frequencies we wanted.

  4. We don’t have the frequency resolution required to put filters at the exact points calculated above, so we need to round those frequencies to the nearest FFT bin. This process does not affect the accuracy of the features. To convert the frequncies to fft bin numbers we need to know the FFT size and the sample rate,

    f(i) = floor((nfft+1)*h(i)/samplerate)

    This results in the following sequence:

    f(i) =  9, 16,  25,   35,   47,   63,   81,  104,  132, 165,  206,  256

    We can see that the final filterbank finishes at bin 256, which corresponds to 8kHz with a 512 point FFT size.

  5. Now we create our filterbanks. The first filterbank will start at the first point, reach its peak at the second point, then return to zero at the 3rd point. The second filterbank will start at the 2nd point, reach its max at the 3rd, then be zero at the 4th etc. A formula for calculating these is as follows: img where img is the number of filters we want, and img is the list of M+2 Mel-spaced frequencies.

The final plot of all 10 filters overlayed on each other is:

Plot of 10 filter Mel FilterbankA Mel-filterbank containing 10 filters. This filterbank starts at 0Hz and ends at 8000Hz. This is a guide only, the worked example above starts at 300Hz.

Deltas and Delta-Deltas 

Also known as differential and acceleration coefficients. The MFCC feature vector describes only the power spectral envelope of a single frame, but it seems like speech would also have information in the dynamics i.e. what are the trajectories of the MFCC coefficients over time. It turns out that calculating the MFCC trajectories and appending them to the original feature vector increases ASR performance by quite a bit (if we have 12 MFCC coefficients, we would also get 12 delta coefficients, which would combine to give a feature vector of length 24).

To calculate the delta coefficients, the following formula is used:


where img is a delta coefficient, from frame img computed in terms of the static coefficients img to img. A typical value for img is 2. Delta-Delta (Acceleration) coefficients are calculated in the same way, but they are calculated from the deltas, not the static coefficients.


I have implemented MFCCs in python, available here. Use the ‘Download ZIP’ button on the right hand side of the page to get the code. Documentation can be found at readthedocs. If you have any troubles or queries about the code, you can leave a comment at the bottom of this page.

There is a good MATLAB implementation of MFCCs over here.


Davis, S. Mermelstein, P. (1980) Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. In IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28 No. 4, pp. 357-366

X. Huang, A. Acero, and H. Hon. Spoken Language Processing: A guide to theory, algorithm, and system development. Prentice Hall, 2001.