15
June
Master's Thesis: Enhancing Voice Activity Prediction with Visual Features for Conversational Turn-Taking
Willem Berner will present his Master's Thesis "Enhancing Voice Activity Prediction with Visual Features for Conversational Turn-Taking"
Examiner: Alexandros Sopasakis
Supervisors: Kalle Åström, Gabriel Skantze (KTH)
Abstract:
Turn-taking is a fundamental component of spoken interaction, allowing interlocutors to coordinate when to speak and when to remain silent. While humans naturally rely on both verbal and non-verbal signals for this process, many spoken dialogue systems primarily depend on audio-based cues. This raises the question of whether additional visual information can further improve turn-taking prediction. This thesis investigates whether visual features from face-to-face conversations can enhance conversational turn-taking prediction beyond what is achievable from audio cues alone. The work builds upon the Voice Activity Projection (VAP) model, a self-supervised transformer-based audio model for predicting future voice activity. This work proposes extentsions of this model by incorporating visual information extracted from the Meta Seamless Interaction dataset, consisting of dyadic conversations, with both video and audio recorded. The visual features include gaze direction, head movement, body and hand pose, and facial action units. Different architectural approaches are explored, including direct feature concatenation, cross-attention between audio and visual modalities, delta features, and trainable gating mechanisms. In addition to the objective loss, the proposed models are evaluated on interpretable turn-taking tasks, including Hold/Shift prediction, Short/Long utterance classification and shift-prediction. Results indicate that visual information can improve predictive performance compared to the audio-only baseline, with facial action units appearing particularly informative. The findings suggest that visual cues contribute meaningful information for turn-taking prediction and support the development of more natural and responsive conversational systems.
Om händelsen
Tid:
2026-06-15 10:15
till
11:15
Plats
MH:309A
Kontakt
karl [dot] astrom [at] math [dot] lth [dot] se