VAP

Model

Voice Activity Projection (VAP)

"VAP is a general, incremental, and predictive model of conversational dynamics. It's akin to a 'Language Model' that outputs a probability distribution over projection-windows in a dialog."

Model Description

The VAP model integrates an encoder that processes raw audio waveforms and the prevailing VA data to generate latent frame frequency representations. These are then supplied to the predictor network. Acting as a causal sequence network, the predictor processes the context up to the current frame, resulting in a probability distribution across 256 projection-window states.

Main Strengths

General

No annotations
No feature extraction
No speaker normalization

Incremental

Train, evaluate, and utilize across entire dialogs

Predictive

Predict upcoming activity
Allocate extra processing time for speech

Transformer

GPT-like, decoder-only, unidirectional
4 layers
4 heads
256 dim
AliBi attention

Paper
Github

Attention is all you need (Vaswani et al, 2017)

CPC

CNN-1D

5 layers
256 dim
Sequential

LSTM

1 layer
256 dim

Papers

Demo

Projection Window

"The calibration of conversational timing and the capacity for projecting when others are going to stop speaking are crucial elements of the conversational machine."

N.J.Enfield's book How we Talk. Chapter 4, The One Second Window.

We want to model the future voice activity in order to predict turn-taking decisions over dyadic spoken dialog. The representation of the future should be short enough to introduce minimal noise whilst long enough to predict turn-shifts in advance and to discern shifts from shorter segments of activity (backchannels).

The projection window consists of voice activity over 2 seconds of dialog. We divide the window into 4 regions of increasing duration 200ms, 400ms, 600ms and 800ms to create 8 bins. The ratio of activity is calculated over each bin and is considered active above 50% to create a discrete onehot representation of size (2, 4). The vector is mapped to an index in a codebook/vocabulary by treating it as a binary number.

Events

Shift vs Hold

Predict the next speaker during mutual silences.

We distinguish two regions: pre-offset and post-offset, ensuring only one speaker is active to prevent ambiguous overlap moments. Determining the last and next speaker helps identify a Shift or a Hold.

Shift-Prediction

Predict the next speaker during active speech.

We establish a 0.5s prediction region over the last activity segment prior to a Shift event. Positive shift-predictions are identified from this. Negatives are sourced from a speaker's active regions distant (2s) from the listener's activity.

BC-Prediction

Predict upcoming backchannels (BCs).

A backchannel (BC) is a brief listener activity indicating attention or acknowledgement, such as "oh", "yeah", or "mhm".

BCs are identified as short, isolated listener activity segments. They must be brief (less than 1s) and isolated, marked by 1s pre-silence and 2s post-silence.

Short vs Long

Determine utterance length (Short vs Long) at its onset during a shift event.

All BC events are labeled Short while all Shift events are considered Long. The prediction zone covers the initial 0.2s of the utterance. This helps decide if a user's activity during the system's speech should lead to interruption or if the system can proceed.

Zero-shot Evaluation

How to predict turn-taking actions from a general distribution over future voice activity?

All zero-shot methods follow a similar approach where researchers select subsets over the total states which correspond to a clear outcome. For instance, during a mutual silence, we want to find the most likely next speaker. We choose subsets where either only speaker A is active or only speaker B is active. By defining these two subsets, we compare their probabilities and determine the most likely outcome.

The examples below primarily visualize the state for a particular listener (blue) and the current speaker (yellow), but it's symmetrically accurate in reverse.