Detecting Emotions with Pictures, Sound, and Words: A Multimodal Journey

Main Article Content

Neelendra Shukla, Aditi Sharma

Abstract

This study presents a multimodal emotion recognition system integrating text, audio, and facial images to identify emotions, overcoming limitations of unimodal methods. The dataset includes 18,000 balanced samples across six emotions (anger, disgust, fear, happiness, sadness, neutrality), sourced from AffectNet, augmented speech corpora, and GoEmotions, preprocessed for normalization and alignment. The model uses BERT for text, ResNet-18 for images, and MFCCs with CNN/LSTM for audio, achieving 87.4% accuracy on a 3,000-sample test set, surpassing typical 70-80% unimodal accuracy. Training spanned 16 epochs with an Adam optimizer and early stopping, evaluated via accuracy, precision, recall, and F1-score. Future work will focus on speeding up the model, expanding the dataset, and optimizing for mobile applications, with potential for real-time mental health and AI interface use.

Article Details

Section
Articles