Utilizing Artificial Intelligence to Generate Emotive Voicing for Audiobook

Main Article Content

Dhanush V, Chinmay Aland, Nandish A, Prasanna Chitgopekar, Ashwini M Joshi

Abstract

Audiobooks have seen immense growth and popularity, with over $5.38 billion in revenue in 2022 alone. However, most audiobook narrations still lack the emotional expressiveness present in human storytelling. This research examines the potential for utilizing artificial intelligence (AI) to generate emotive voic- ing for audiobooks. We evaluated neural network models for inferring emotion solely from textual passages. Both BERT and GPT-2 were trained to catego- rize excerpt emotions, achieving comparable accuracy. To further assess emotion detection capabilities, we analyzed valence, arousal, and dominance scores pre- dicted by each model. On this more granular metric, BERT demonstrated superior performance in capturing nuanced text emotions.For classifying emotions in audio, we leveraged a pretrained wav2vec2 model. However, when evaluating this model on existing audiobook recordings, we found it tended to categorize most clips into only two dominant emotions. Therefore,


we opted to use GPT-2’s text-based emotion predictions for labeling our training data because BERT model is more sensitive to changes in the data, but it is also less precise than the GPT-2 model. We then used Speech-T5 text-to-speech models tailored to five target emotions, training individual models on matched text-audio pairs.

Article Details

Section
Articles