Voice Activity Projection, VAP

A general, incremental and predictive model of conversational dynamics. The model process the voice activity and audio waveform of spoken dialog and outputs a probability distribution over projection-windows (discrete states) that represents speaker activity over the next 2 seconds. The main purpose is to use it as a turn-taking model for spoken dialog systems.

Papers

VAP Model
VAP Model
VAP stereo Model
VAP Stereo Model
SL Question
SIGDial22
Filler
ICPhS23
TTS
Interspeech23
Turn End
Turn End
BC Prediction
BC Prediction

Model

Voice Activity Projection (VAP)

"VAP is a general, incremental, and predictive model of conversational dynamics. It's akin to a 'Language Model' that outputs a probability distribution over projection-windows in a dialog."

Model Description

The VAP model integrates an encoder that processes raw audio waveforms and the prevailing VA data to generate latent frame frequency representations. These are then supplied to the predictor network. Acting as a causal sequence network, the predictor processes the context up to the current frame, resulting in a probability distribution across 256 projection-window states.

Main Strengths

  • General
    • No annotations
    • No feature extraction
    • No speaker normalization
  • Incremental
    • Train, evaluate, and utilize across entire dialogs
  • Predictive
    • Predict upcoming activity
    • Allocate extra processing time for speech
model image

Demo

Demo Image

Projection Window

"The calibration of conversational timing and the capacity for projecting when others are going to stop speaking are crucial elements of the conversational machine."

N.J.Enfield's book How we Talk. Chapter 4, The One Second Window.

We want to model the future voice activity in order to predict turn-taking decisions over dyadic spoken dialog. The representation of the future should be short enough to introduce minimal noise whilst long enough to predict turn-shifts in advance and to discern shifts from shorter segments of activity (backchannels).

The projection window consists of voice activity over 2 seconds of dialog. We divide the window into 4 regions of increasing duration 200ms, 400ms, 600ms and 800ms to create 8 bins. The ratio of activity is calculated over each bin and is considered active above 50% to create a discrete onehot representation of size (2, 4). The vector is mapped to an index in a codebook/vocabulary by treating it as a binary number.

projection window projection window

Events

Shift vs Hold

Shift vs Hold Visualization

Predict the next speaker during mutual silences.

We distinguish two regions: pre-offset and post-offset, ensuring only one speaker is active to prevent ambiguous overlap moments. Determining the last and next speaker helps identify a Shift or a Hold.

Shift-Prediction

Shift Prediction Visualization Negative Shift Prediction Visualization

Predict the next speaker during active speech.

We establish a 0.5s prediction region over the last activity segment prior to a Shift event. Positive shift-predictions are identified from this. Negatives are sourced from a speaker's active regions distant (2s) from the listener's activity.

BC-Prediction

Backchannel Prediction Visualization BC Prediction Visualization

Predict upcoming backchannels (BCs).

A backchannel (BC) is a brief listener activity indicating attention or acknowledgement, such as "oh", "yeah", or "mhm".

BCs are identified as short, isolated listener activity segments. They must be brief (less than 1s) and isolated, marked by 1s pre-silence and 2s post-silence.

Short vs Long

Short vs Long Visualization

Determine utterance length (Short vs Long) at its onset during a shift event.

All BC events are labeled Short while all Shift events are considered Long. The prediction zone covers the initial 0.2s of the utterance. This helps decide if a user's activity during the system's speech should lead to interruption or if the system can proceed.

Zero-shot Evaluation

How to predict turn-taking actions from a general distribution over future voice activity?

All zero-shot methods follow a similar approach where researchers select subsets over the total states which correspond to a clear outcome. For instance, during a mutual silence, we want to find the most likely next speaker. We choose subsets where either only speaker A is active or only speaker B is active. By defining these two subsets, we compare their probabilities and determine the most likely outcome.

The examples below primarily visualize the state for a particular listener (blue) and the current speaker (yellow), but it's symmetrically accurate in reverse.

Shift vs Hold

Shift vs Hold Visualization

Predict the next speaker during mutual silence.

We select symmetrical subsets where only one speaker is active, ensuring that the last two bins are active. The final bins cover a 1.2s duration and signify the next turn's probable speaker. Given the silence, there's uncertainty when the activity starts, so the first two bins might also be active. We define four subsets for each speaker as illustrated above.

Shift-Prediction

Shift Prediction Visualization Shift Prediction Visualization 2

Predict the next speaker during active speech.

Shift-prediction parallels "Shift vs Hold" but includes activity in the current speaker's initial two bins. Given the active region, the uncertainty lies in determining the utterance's end. Therefore, we incorporate all states where the current speaker's initial two bins indicate activity. This method means we compare the listener's 12 states to the current speaker's 4 states (from Shift vs Hold) for predicting an upcoming turn-shift.

BC-Prediction

BC Prediction Visualization

Predict an upcoming backchannel during active speech or mutual silences.

The BC-prediction subset represents a short listener activity segment and the current speaker continuing their turn. We define two constraints: the listener must have at least one active bin from the starting three, and the current speaker's last two bins must be active. During evaluations, we set a threshold to convert probabilities into a discrete action.

Short vs Long

Short vs Long Prediction Visualization

Predict if a recent onset belongs to a Long or Short activity segment.

This prediction utilizes the subsets defined in BC-prediction but evaluates them at an activity segment's onset, with the speaker roles reversed. PS. The "bc-probabilities" are defined for both speakers and can represent a backchannel prediction when the speaker is listening or an end of activity indication if they are active. DS.