Bilingual Spectral Emotion Learning Through Patch-Encoded VGG-16 Features and a Full Vision Transformer Pipeline
DOI:
https://doi.org/10.56979/1001/2025/1146Keywords:
Speech Emotion Recognition, Spectrogram Analysis, VGG-16 Feature Extraction, Vision Transformer Modeling, Bilingual English–Gujarati DatasetAbstract
The research introduces a new bilingual speech emotion recognition system which integrates patch-encoded VGG-16 spectral cues with a whole Vision Transformer pipeline to learn the affective cues on English and Gujarati speech. Mel-spectrograms are initially inputted into a frozen VGG-16 backbone to obtain high-level spatial spectral features, and these features are then split into regular patches and transformed into an embedding space to be used to represent them in a transformer-based global attention model. It is tested on four reference English emotional speech datasets, including RAVDESS, CREMA-D, SAVEE, and TESS, with the results of accuracy 99% for all. To evaluate robustness on non-controlled data, a hand-collected bilingual corpus of student recordings was created, where the model was able to assess English and Gujarati speech with accuracy on 90% and 88% percent respectively. Such findings show that the convolutional spectral extraction with contextual learning by transformers is an effective way of modeling cross-lingual emotional differences and outperforms traditional convolution-only or transformer-only models. The bilingual results also show that the model can be used to achieve stable performance with languages that have different phonetics and prosody and thus is applicable to scalable and inclusive emotion-sensitive speech technologies in practice through interactive assistants, call-center analytics and affect sensitive human-machine interfaces.
Downloads
Published
How to Cite
Issue
Section
License
This is an open Access Article published by Research Center of Computing & Biomedical Informatics (RCBI), Lahore, Pakistan under CCBY 4.0 International License



