3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms

Hajarolasvadi, Noushin; Demirel, Hasan

doi:10.3390/e21050479

3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms

Date

2019

Authors

Hajarolasvadi, Noushin

Demirel, Hasan

Publisher

Mdpi

Access Rights

info:eu-repo/semantics/openAccess

Abstract

Detecting human intentions and emotions helps improve human-robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE'05 databases. The results are superior to the state-of-the-art methods reported in the literature.

Keywords

speech emotion recognition, 3D convolutional neural networks, deep learning, k-means clustering, spectrograms

Journal or Series

Entropy

WoS Q Value

Q2

Scopus Q Value

Q1

Volume

21

Issue

5

URI

https://doi.org/10.3390/e21050479
https://hdl.handle.net/11129/10021

Collections

WoS Indexed Publications Collection
PubMed Indexed Publications Collection
Scopus İndeksli Yayınlar Koleksiyonu

Full item page

3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Access Rights

Abstract

Description

Keywords

Journal or Series

WoS Q Value

Scopus Q Value

Volume

Issue

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By