A general, incremental and predictive model of conversational dynamics. The model process the voice activity and audio waveform of spoken dialog and outputs a probability distribution over projection-windows (discrete states) that represents speaker activity over the next 2 seconds. The main purpose is to use it as a turn-taking model for spoken dialog systems.
"VAP is a general, incremental, and predictive model of conversational dynamics. It's akin to a 'Language Model' that outputs a probability distribution over projection-windows in a dialog."
The VAP model integrates an encoder that processes raw audio waveforms and the prevailing VA data to generate latent frame frequency representations. These are then supplied to the predictor network. Acting as a causal sequence network, the predictor processes the context up to the current frame, resulting in a probability distribution across 256 projection-window states.
We want to model the future voice activity in order to predict turn-taking decisions over dyadic spoken dialog. The representation of the future should be short enough to introduce minimal noise whilst long enough to predict turn-shifts in advance and to discern shifts from shorter segments of activity (backchannels).
The projection window consists of voice activity over 2 seconds of dialog. We divide the window into 4 regions of increasing duration 200ms, 400ms, 600ms and 800ms to create 8 bins. The ratio of activity is calculated over each bin and is considered active above 50% to create a discrete onehot representation of size (2, 4). The vector is mapped to an index in a codebook/vocabulary by treating it as a binary number.
Predict the next speaker during mutual silences.
We distinguish two regions: pre-offset and post-offset, ensuring only one speaker is active to prevent ambiguous overlap moments. Determining the last and next speaker helps identify a Shift or a Hold.
Predict the next speaker during active speech.
We establish a 0.5s prediction region over the last activity segment prior to a Shift event. Positive shift-predictions are identified from this. Negatives are sourced from a speaker's active regions distant (2s) from the listener's activity.
Predict upcoming backchannels (BCs).
A backchannel (BC) is a brief listener activity indicating attention or acknowledgement, such as "oh", "yeah", or "mhm".
BCs are identified as short, isolated listener activity segments. They must be brief (less than 1s) and isolated, marked by 1s pre-silence and 2s post-silence.
Determine utterance length (Short vs Long) at its onset during a shift event.
All BC events are labeled Short while all Shift events are considered Long. The prediction zone covers the initial 0.2s of the utterance. This helps decide if a user's activity during the system's speech should lead to interruption or if the system can proceed.
How to predict turn-taking actions from a general distribution over future voice activity?
All zero-shot methods follow a similar approach where researchers select subsets over the total states which correspond to a clear outcome. For instance, during a mutual silence, we want to find the most likely next speaker. We choose subsets where either only speaker A is active or only speaker B is active. By defining these two subsets, we compare their probabilities and determine the most likely outcome.
The examples below primarily visualize the state for a particular listener (blue) and the current speaker (yellow), but it's symmetrically accurate in reverse.
Predict the next speaker during mutual silence.
We select symmetrical subsets where only one speaker is active, ensuring that the last two bins are active. The final bins cover a 1.2s duration and signify the next turn's probable speaker. Given the silence, there's uncertainty when the activity starts, so the first two bins might also be active. We define four subsets for each speaker as illustrated above.
Predict the next speaker during active speech.
Shift-prediction parallels "Shift vs Hold" but includes activity in the current speaker's initial two bins. Given the active region, the uncertainty lies in determining the utterance's end. Therefore, we incorporate all states where the current speaker's initial two bins indicate activity. This method means we compare the listener's 12 states to the current speaker's 4 states (from Shift vs Hold) for predicting an upcoming turn-shift.
Predict an upcoming backchannel during active speech or mutual silences.
The BC-prediction subset represents a short listener activity segment and the current speaker continuing their turn. We define two constraints: the listener must have at least one active bin from the starting three, and the current speaker's last two bins must be active. During evaluations, we set a threshold to convert probabilities into a discrete action.
Predict if a recent onset belongs to a Long or Short activity segment.
This prediction utilizes the subsets defined in BC-prediction but evaluates them at an activity segment's onset, with the speaker roles reversed. PS. The "bc-probabilities" are defined for both speakers and can represent a backchannel prediction when the speaker is listening or an end of activity indication if they are active. DS.