TriAffect: Tri-Modal Conversational Emotion Recognition with Temporal State Tracking

Computational and Data Science Ph.D. student, Cayden Schalk, presents his research work on Emotion Recognition.

Understanding emotion in conversation requires combining cues from language, vocal expression, and visual behavior while accounting for how emotion evolves across speakers and dialogue turns. This problem is important for multimodal interactive systems such as emotion-aware assistants, social robots, and assistive technologies. We present TriAffect, a tri-modal extension of MemoCMT for conversational emotion recognition that jointly models what is said, how it sounds, and how speakers appear, while tracking each speaker’s emotional state across the conversation. The model combines symmetric pairwise text-audio-video fusion with speaker-aware temporal state tracking across dialogue turns. We evaluate TriAffect on the 4-class IEMOCAP benchmark with leave-one-session-out (LOSO) cross-validation and further analyze its components on MELD. On IEMOCAP, the full temporal model achieves 77.5%±2.8 balanced accuracy, outperforming the MemoCMT baseline by 5.3 points and improving over its own non-temporal version by 6.3 points. These results highlight the promise of combining tri-modal fusion with speaker-aware temporal modeling for conversational emotion recognition.

Watch the presentation here.