Unspeech

Unsupervised Speech Context Embeddings

Unspeech embeddings are based on unsupervised learning of context feature representations of spoken language. Variance and variability in recordings of speech and its representations are a common problem in automatic speech processing tasks. E.g. speaker, environment characteristics and the type of microphone will make large differences in typical speech representations (e.g. FBANK, MFCC), making (direct) similarity comparisons difficult. We can describe such factors of variance also as the context of an utterance; speech sounds that occur close in time share similar contexts. Unspeech allows you do learn embeddings of such contexts in an unsupervised way on raw speech data: speaker IDs, channel information or transcriptions are not needed.

Uses cases:

Cluster a speech corpus in-domain, to help speaker adaption methods in HMM-GMM and (T)DNN-HMM acoustic models: without the need for speaker annotations or trained speaker embeddings.
As a context embedding in acoustic models: provide additional information to acoustic models.

Code

The Python3/Tensorflow code to train unspeech model is currently available at this gitlab repository.

Installation instructions:

git clone https://gitlab.com/milde/unspeech
pip3 install tensorflow==1.4.1 numpy matplotlib wavefile sklearn

Newer Tensorflow version than 1.4.1 currently break the code, but updates will soon be provided to fix issues with newer versions.
python3 unsup_model_neg.py is the main script to train new models and to use exisiting model, with a lot of options to control various parameters of the model, see python3 unsup_model_neg.py --help

Training

Use the --filelist option to either supply a Kaldi .scp file or a .ark file directly. Utterances with a length smaller than what is nesseary to sample a positive context pair will automatically be discarded.

Tedlium

Train on a Tedlium fbank ark file on GPU #0 (this model is referred to as unspeech-64-ted in the paper):

#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python3 unsup_model_neg.py --window_length 64 --window_neg_length 64 \
--filelist /srv/data/kaldi/egs/tedlium/s5_r2/data/train_fbank_sp/feats_unnormalized.ark  --noend_to_end  \
--embedding_transform Vgg16big --l2_reg 0.0001 --batch_size 32  --left_contexts 2 --right_contexts 2  \
--unit_normalize_var True  --tied_embeddings_transforms True --learn_rate 0.0003 --fc_size 2048

train_tedlium.sh

Example output:

The console output will periodically give you updates on the learning process, where the models is saved and example predictions (is context / is not context). After about 24 hours the displayed training accuracy should get close to 1.0, e.g.:

At step 394000 step-time 0.2279 loss 0.0474 Model saving path is: /srv/data/unspeech_models/neg/runs/1520559132feats_transVgg16big_nsampling_rnd_win64_neg_samples4_lcontexts2_rcontexts2_ flts40_embsize100_fc_size2048_unit_norm_var_dropout_keep0.9_l2_reg0.0001_featinput_filelist.english.train_dot_combine_tied_embs/tf10
Training started 24.94 hours ago.
FLAGS params in short: feats_transVgg16big_nsampling_rnd_win64_neg_samples4_lcontexts2_rcontexts2_flts40_embsize100_fc_size2048 _unit_norm_var_dropout_keep0.9_l2_reg0.0001_featinput_filelist.english.train_dot_combine_tied_embs
np.bincount: [127 129]
len: 256 256
true labels, out (first 40 dims): [(1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 1.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0)]
accuracy: 0.98828125
majority class accuracy: 0.5

Embeddings transformations:

Several embedding transforms can be used:

Vgg16
Vgg16big
ResNet
Resnet_v2_50_small
Resnet_v2_50
Resnet_v2_101
Inception_Resnet_v2
HighwayDnn

We recommend Vgg16 or Vgg16big, depending on the amount of data, since they offer good performance and are among the fastest to train. The ResNet variants might currently be buggy at this moment (there seem to be issues with Tensorflows batch normalization in a Siamese neural network architecture.)

Large scale training:

Be default, features are loaded into memory. On larger datasets, where the data does not fit into main memory, memory mapping from a disk is also supported. We recommend a fast SSD (e.g. M2 NVMe) for the memory mapped cache: the amount of random access reads will be high.

--memmap_dir If not empty, use this dir to store memmapped arrays on the filesystem. Use this if your systems main memory is not big enough to hold the complete
--memmap_reuse_cache Directly memmap the directory and its array files specified in memmap_dir (e.g. from a previous memmaped run)
--memmap_dtype dtype of the mmapped array (default: float32)

Loading features manually into a mmap cache in a Python3 shell:

load_kaldi_ark_mmap_example.py

import kaldi_io                                                                                                                                                                                       
import numpy
utts, feats = kaldi_io.readArk('/srv/home/milde/youtube-tedx/tedx_feats.ark.gz', 
                    memmap_dir='/scratch/tedx_mmap_cache', memmap_dtype='float32')

TEDx example

Train on a Tedlium fbank ark file on GPU #0 (this model is referred to as unspeech-128-tedx in the paper):

train_tedlium_tedx_128_memcache.sh


#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python3 unsup_model_neg.py --window_length 128 --window_neg_length 128 \
--memmap_reuse_cache True --memmap_dir /scratch/tedx_mmap_cache --embedding_transform Vgg16big \
--l2_reg 0.0001 --batch_size 32  --left_contexts 2 --right_contexts 2 --unit_normalize_var True \
--tied_embeddings_transforms True --fc_size 2048

Pretrained models

ted: Trained on 211 hours of English speech from the TED-LIUM V2 dataset.
cv: Trained on 240 hours of English speech data from Mozilla Common Voice dataset (V1).
tedx: Trained on 9505 hours of mostly English speech data crawled from the TEDx youtube channel.

Speech pertubed versions are extending the dataset by 3x, by using two additional playling speeds with sox when generating features: 0.9x 1.1x.

We will soon upload more pre-trained models here.