Abstract
Turn-taking is a fundamental aspect of human communication and can be described as the ability to take turns, project upcoming turn shifts, and supply backchannels at appropriate locations throughout a conversation. In this work we investigate the role of prosody in turn-taking using the recently proposed Voice Activity Projection model, which incrementally models the upcoming speech activity of the interlocutors in a self-supervised manner, without relying on explicit annotation of turn-taking events, or the explicit modeling of prosodic features. Through manipulation of the speech signal, we investigate how these models implicitly utilize prosodic information. We show that these systems learn to utilize various prosodic aspects of speech both on aggregate quantitative metrics of long-form conversations and on single utterances specifically designed to depend on prosody.

work
Visualizations over short/long phrases including various perturbations.
So you work on the side in a supermarket in addition to your studies?
Audio | Long | Short |
Original female |
![]() |
![]() |
Flat F0 female |
![]() |
![]() |
Flat Intensity female |
![]() |
![]() |
Low Pass female |
![]() |
![]() |
Audio | Long | Short |
Original female |
![]() |
![]() |
Flat F0 female |
![]() |
![]() |
Flat Intensity female |
![]() |
![]() |
Low Pass female |
![]() |
![]() |
Audio | Long | Short |
Original female |
![]() |
![]() |
Flat F0 female |
![]() |
![]() |
Flat Intensity female |
![]() |
![]() |
Low Pass female |
![]() |
![]() |
Audio | Long | Short |
Original male |
![]() |
![]() |
Flat F0 male |
![]() |
![]() |
Flat Intensity male |
![]() |
![]() |
Low Pass male |
![]() |
![]() |
Audio | Long | Short |
Original male |
![]() |
![]() |
Flat F0 male |
![]() |
![]() |
Flat Intensity male |
![]() |
![]() |
Low Pass male |
![]() |
![]() |
Audio | Long | Short |
Original male |
![]() |
![]() |
Flat F0 male |
![]() |
![]() |
Flat Intensity male |
![]() |
![]() |
Low Pass male |
![]() |
![]() |