Page 1 of 5

Phoneme Based Bangla Digit Recognition

Tanvir Anjum

201606115

Department of Electrical and Electronic

Engineering, Bangladesh University of

Engineering and Technology

tanvir1167052@gmail.com

Ataher Sams

201606105

Department of Electrical and Electronic

Engineering, Bangladesh University of

Engineering and Technology

asnsamsniloy@gmail.com

Mehedi Hasan Emon

201606113

Department of Electrical and Electronic

Engineering, Bangladesh University of

Engineering and Technology

emon00713@gmail.com

Khyrun Nesa Neesa

201606124

Department of Electrical and Electronic

Engineering, Bangladesh University of

Engineering and Technology

khyrunneesa3@gmail.com

Abstract:

This research work aims at recognizing the Bangla digit

from 0-9 by separating the phonemes of the digits. The

ambiguity in phonemes in Bangla speech is more extreme

and varied than that of English speech since Bangla stems

from the “Indo-European language family”. In this research,

dataset is collected in manual fashion. Then, the digits are

broken into their respective phonemes (both voiced and

unvoiced part). Next, Mel frequency Cepstral Coefficients

(MFCC) features of the segmented digits are calculated and

trained using artificial neural network, which is used to

recognize the unknown digit. The proposed system is

implemented through MATLAB and accuracy of the system

averages above 90%.

Keywords:

BANGLA SPEECH RECOGNITION, MFCC, ARTIFICIAL

NEURAL NETWORK, BANGLA DIGITS, PHONEME .

Introduction:

Research in Automatic Speech recognition by machine has

been done for almost four decades.[1] Though Bangla has

approximately 260-300 million total speakers worldwide ,

speech recognition is not done as much as it should be. The

phonemes of different language of the world is different, so

a system for another language will not provide accurate

result for Bangla language. So a system unique to Bangla

language is designed in the paper, which can accurately

predict the digit uttered which can be useful for

implementing this system for further development in this

field like predicting any isolated or connected word of

Bangla language. It can have a huge impact for people with

disabilities and can eliminate the language barrier between

people. It can map human speech to text or commands or

subtitling or indexing video recordings and streams, speech

translation, language learning etc.

In this research study, an effort is made to develop such a

system using phonemes which consists of both Voiced and

unvoiced part from the segmented digits to extract significant

features from MFCC and to train the feature vectors using artificial

neural network to recognize the unknown digit.

Abbreviations and Acronyms

 MFCC - Mel frequency Cepstral Coefficients

 FIR - Finite Impulse Respose

 FFT - Fast Fourier Transform

 DCT - Discrete Cosine Transform

 IFFT - Inverese Fast Fourier Transform

Units

 Hertz (Hz) -it is the unit of frequency

 Watts (W) - it is the unit of power; the rate of change

of energy per second

 Miliseconds (ms) - refers to 0.001 seconds

Page 2 of 5

Methodology:

Our work began with collecting test data from different

persons to build up a useful data set. We inserted the data by

sorting and labeling those from 0 to 9 accordingly . Thus we

have built up our data set which is consisted of data from

around 100 different people (including both male and

female vocal).

Now that we had our data set ready, we started to write

MATLAB program to implement our project. At first we

linked the data set to the m.file by writing MATLAB code,

we have done this in such a way that the ten folders (labeled

0 to 9) and their corresponding data files path and location

can be traced. We first resampled the data just to make sure

the sampling rate remains the same throughout the data set.

Followed by resampling, we have selected an audio channel

from the dual audio channel to avoid interference. After that

we have started framing the data set .We have used an FIR

filter (coefficients [1 -0.95]) . This FIR filter processing has

been done to pre-emphasize the data, ensuring system

stability and noise attenuation. Each audio file in the data

set has been time framed within a window of 25ms of which

only 15 ms was set to overlap.

Now that we had divided each audio record in a 25ms

window with 15ms overlap; we determined the power of

each window. Then we observed the power spectrum of

each window . We also had to know the CVC pattern for 0

to 9 bangla digit which are as follows:

Figure 1: CVC pattern of bangla digit 0-9

We subtracted the first and last portion of the frame below

30% max power and considered the rest of the part as the

voiced part . The duration of 56ms before and after the

voiced part duration, we considered as unvoiced potion of

the audio signal. Also 265ms around the highest peak has

been taken for the voiced part. These three CVC portions of

each audio signal has been taken separately to extract

features. A short note is that, we have considered “Shunno”

to have the CVC pattern.

After that we have started to extract features using MFCC.

Mel Frequency Cepstral Coefficient (MFCC)

In any automatic speech recognition the first step is to

extract features i.e. identify the components of the audio

signal that are good for identifying the linguistic content and

discarding all the other stuff which carries information like

background noise, emotion etc.

The main point to understand about speech is that the

sounds generated by a human are filtered by the shape of the

vocal tract including tongue, teeth etc. This shape

determines what sound comes out. If we can determine the

shape accurately, this should give us an accurate

representation of the phoneme being produced. The shape of

the vocal tract manifests itself in the envelope of the short

time power spectrum, and the job of MFCCs is to accurately

represent this envelope. This page will provide a short

tutorial on MFCCs.

Mel Frequency Cepstral Coefficents (MFCCs) are a feature

widely used in automatic speech and speaker recognition.

They were introduced by Davis and Mermelstein in the

1980's, and have been state-of-the-art ever since[2].

Steps in MFCC Implementation

1. Frame the signal into short frames.

2. For each frame calculate the periodogram

3. estimate of the power spectrum.

4. Apply the mel filterbank to the power spectra, sum

5. the energy in each filter.

6. Take the logarithm of all filterbank energies.

7. Take the DCT of the log filterbank energies.

8. Keep DCT coefficients 2-13, discard the rest.

Figure 2: MFCC Flow chart

Page 3 of 5

Since an audio signal is constantly changing, so to simplify

things we assume that on short time scales the audio signal

doesn't vary much (when we say it doesn't vary, we mean

statistically i.e. statistically stationary, obviously the

samples are constantly changing on even short time scales).

This is why we frame the signal into 20-40ms frames. If the

frame is much shorter we don't have enough samples to get

a reliable spectral estimate, if it is longer the signal changes

too much throughout the frame.

What is the Mel scale?

Figure 3: Mel filter bank [3]

The Mel scale relates perceived frequency, or pitch, of a

pure tone to its actual measured frequency. Humans are

much better at discerning small changes in pitch at low

frequencies than they are at high frequencies. Incorporating

this scale makes our features match more closely what

humans hear [4].

The formula for converting from frequency to Mel scale is:

To go from Mels back to frequency:

We considered only first 14 coefficients of MFCC to extract

features. The first 14 Mel filter bank list is given below-

Table 1: Coefficients of MFCC[5]

After using those fourteen Mel filter banks and frequency

wrapping we have observed the following figures i.e for

voiced part to differentiate features among different digits

and also tried to analyze the reason behind the difference- i.e.Here x-axis is the time axis and each row from 1to 14 are

different filters output- Figure 4: Heat map of Bangla digit 0 to 9

From the heat map, it gives us the mel filter output spectrum

vs time graph. 1-3 no are the unvoiced region, 4-13 are the

voiced region and 14-16 are again unvoiced region. If

looked closely, it can be seen that there are characteristic

strips in filter no 3,5,9 and 12 for Bangla digit

4(char),5(pach) and 7(sat) at the beginning of the voiced

part. The 6(choy) and 9(noy) digits’ features also show

similarity in different voiced strips. Again there are

distinctive difference between two(dui) or three(tin) and

four(char) or five(pach). We have extracted 1104

features(14*16) in total. Then we formed feature vectors