How do you feel? by Jaybrata Chakraborty Dept of MCA/CSE/IT

It is often difficult to answer this question. Imagine your lonely world you are talking with a computer and sharing your thoughts. To do so the computer must understand “how do you feel?”.

Human machine interaction are widely used nowadays in many applications. One of the medium of interaction is speech. The main challenges in human machine interaction is detection of emotion from speech.  There are several applications of speech emotion recognition system. Emotion can play an important roll in decision making, if emotion can be recognized from speech then a system can act accordingly. An efficient emotion recognition system can be useful in the field of medical science, robotics engineering, call center application etc. When two persons interact to each other they can easily recognize the underlying emotion in the speech spoken by the other person. Human first analyze the different characteristics of the particular speech and then using previous experience or observation he recognize the emotion of the speaker. The objective of emotion recognition system is to mimic the human perception mechanisms. Identification of emotion can be done by extracting the features or different characteristics from the speech and then a training is needed for a large number of speech database to make the system accurate. The steps towards building of an emotion recognition system are, an emotional speech corpora(collection of speeches) has been selected or implemented then emotion specific features are extracted from those speeches and finally a classification model is used to recognize the emotions.


A suitable choice of corpora plays a very important role in the field of emotion recognition. A context rich natural speech database are preferred for a good emotion recognition system.  Mainly 3 types of corpora are used for developing a speech emotion recognition system they are :


Elicited emotional speech database: This type of data are collected from speaker by creating artificial emotional situation. Advantage of this type of database is that it is very close to the natural database but there are some problems also, all emotions may not be available and if the speaker aware of that they are being recorded then the emotion expressed by him may be artificial.


Actor based speech database: This type of speech data collected from professional and trained artists. Collecting of these type of data are very easy and a wide variety of emotion are available in the corpora and .But the main problem of this type of database are it is episodic in nature and it is very much artificial in nature.


Natural speech database: This type of database created from real world data. These are completely natural in nature and very useful for recognition of real world emotion, though all emotion may not be present and it consists of background noise.


Features of a speech can be used to identify the difference between several emotional statements.   Different features represent the characteristics of a vocal tract and hearing system of humans. To build an emotion recognition system it is very much important to extract various acoustic prosodic features from speech signal. The acoustic prosodic features of speech signal are pitch, amplitude, formants and spectral features.


A classification system is an approach to set each speech to a proper emotion class according to the extracted features from them. There are different classifiers available for emotion recognition. There is no thumb rule for choosing a proper classifier most of the cases the choice of classifier made based on past references. Features extracted from each speech sample (feature vector) supplied as an input to classifiers with a linear combination of real weight vector W. This weight vector then adjusted with a proper training method. An activation function is then used to generate the output of the classifier which mapped each input to a previously set emotion class. This activation function may be linear or non linear. According to the nature of activation function classifiers can be categorized into two category namely linear classifier and non linear classifier. Linear classifier will classify accurately if the feature vectors are linearly separable. In real life scenario most of the feature vectors are not linearly separable so a nonlinear classifier is a better choice. There are various nonlinear classifiers available for emotion recognition, namely SVM (support vector machine), GMM(Gaussian mixture model), MLP(multilayer perceptron) , RNN(recurrent neural network), KNN(K-nearest neighbours), HMM(hidden Markov model).


Leave a Reply