Deep emotion recognition based on audio-visual correlation

dc.contributor.authorHajarolasvadi, Noushin
dc.contributor.authorDemirel, Hasan
dc.date.accessioned2026-02-06T18:43:43Z
dc.date.issued2020
dc.departmentDoğu Akdeniz Üniversitesi
dc.description.abstractHuman emotion recognition is studied by means of unimodal channels over the last decade. However, efforts continue to answer tempting questions about how variant modalities can complement each other. This study proposes a multimodal approach using three-dimensional (3D) convolutional neural networks (CNNs) to model human emotion through a modality-referenced system while investigating the solution to such questions. The proposed modality-referenced system selects the input data based on one of the modalities regarded as reference or master. The other modality which is referred to as a slave simply adjusts or attunes itself with the master in the temporal domain. In this context, the authors developed three multimodal emotion recognition system, namely, video-referenced system, audio-referenced system, and the audio-visual-referenced system to explore the congruence impact of audio and video modalities on each other. Two pipelines of 3D CNN architectures are employed where k-means clustering is used in the master pipeline and the slave pipeline adapts itself in a temporal sense. The outputs of the two pipelines are fused to improve recognition performance. In addition, canonical correlation analysis and t-distributed stochastic neighbour embedding is used validating the experiments. Results show that temporal alignment of the data between two modalities improves the recognition performance significantly.
dc.description.sponsorshipBAP-C project of Eastern Mediterranean University [BAP-C-02-18-0001]
dc.description.sponsorshipThis research was funded by the BAP-C project of Eastern Mediterranean University under grant no. BAP-C-02-18-0001.
dc.identifier.doi10.1049/iet-cvi.2020.0013
dc.identifier.endpage527
dc.identifier.issn1751-9632
dc.identifier.issn1751-9640
dc.identifier.issue7
dc.identifier.orcid0009-0008-5201-5817
dc.identifier.orcid0000-0002-3120-5370
dc.identifier.scopus2-s2.0-85096127993
dc.identifier.scopusqualityQ2
dc.identifier.startpage517
dc.identifier.urihttps://doi.org/10.1049/iet-cvi.2020.0013
dc.identifier.urihttps://hdl.handle.net/11129/13743
dc.identifier.volume14
dc.identifier.wosWOS:000598689800012
dc.identifier.wosqualityQ4
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.language.isoen
dc.publisherWiley
dc.relation.ispartofIet Computer Vision
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.snmzKA_WoS_20260204
dc.subjectFacial Expression
dc.subjectFace
dc.subjectVoice
dc.titleDeep emotion recognition based on audio-visual correlation
dc.typeArticle

Files