3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms

dc.contributor.authorHajarolasvadi, Noushin
dc.contributor.authorDemirel, Hasan
dc.date.accessioned2026-02-06T18:24:02Z
dc.date.issued2019
dc.departmentDoğu Akdeniz Üniversitesi
dc.description.abstractDetecting human intentions and emotions helps improve human-robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE'05 databases. The results are superior to the state-of-the-art methods reported in the literature.
dc.description.sponsorshipBAP-C project of Eastern Mediterranean University [BAP-C-02-18-0001]
dc.description.sponsorshipThis research was funded by BAP-C project of Eastern Mediterranean University under grant number BAP-C-02-18-0001.
dc.identifier.doi10.3390/e21050479
dc.identifier.issn1099-4300
dc.identifier.issue5
dc.identifier.orcid0000-0002-3120-5370
dc.identifier.orcid0009-0008-5201-5817
dc.identifier.pmid33267193
dc.identifier.scopus2-s2.0-85066604566
dc.identifier.scopusqualityQ1
dc.identifier.urihttps://doi.org/10.3390/e21050479
dc.identifier.urihttps://hdl.handle.net/11129/10021
dc.identifier.volume21
dc.identifier.wosWOS:000472675900043
dc.identifier.wosqualityQ2
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakPubMed
dc.indekslendigikaynakScopus
dc.language.isoen
dc.publisherMdpi
dc.relation.ispartofEntropy
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/openAccess
dc.snmzKA_WoS_20260204
dc.subjectspeech emotion recognition
dc.subject3D convolutional neural networks
dc.subjectdeep learning
dc.subjectk-means clustering
dc.subjectspectrograms
dc.title3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms
dc.typeArticle

Files