Detecting Emotions with Pictures, Sound, and Words: A Multimodal Journey

Neelendra Shukla

doi:10.52783/tjjpt.v46.i03.9614

PDF

Published: Jul 16, 2025

DOI: https://doi.org/10.52783/tjjpt.v46.i03.9614

Neelendra Shukla, Aditi Sharma

Abstract

This study presents a multimodal emotion recognition system integrating text, audio, and facial images to identify emotions, overcoming limitations of unimodal methods. The dataset includes 18,000 balanced samples across six emotions (anger, disgust, fear, happiness, sadness, neutrality), sourced from AffectNet, augmented speech corpora, and GoEmotions, preprocessed for normalization and alignment. The model uses BERT for text, ResNet-18 for images, and MFCCs with CNN/LSTM for audio, achieving 87.4% accuracy on a 3,000-sample test set, surpassing typical 70-80% unimodal accuracy. Training spanned 16 epochs with an Adam optimizer and early stopping, evaluated via accuracy, precision, recall, and F1-score. Future work will focus on speeding up the model, expanding the dataset, and optimizing for mobile applications, with potential for real-time mental health and AI interface use.

Issue

Vol. 46 No. 03 (2025)

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details