Unspeech


Unsupervised Speech Context Embeddings
svg not supported in your browser


Unspeech embeddings are based on unsupervised learning of context feature representations of spoken language. Variance and variability in recordings of speech and its representations are a common problem in automatic speech processing tasks. E.g. speaker, environment characteristics and the type of microphone will make large differences in typical speech representations (e.g. FBANK, MFCC), making (direct) similarity comparisons difficult. We can describe such factors of variance also as the context of an utterance; speech sounds that occur close in time share similar contexts. Unspeech allows you do learn embeddings of such contexts in an unsupervised way on raw speech data: speaker IDs, channel information or transcriptions are not needed.

Uses cases:

  • Cluster a speech corpus in-domain, to help speaker adaption methods in HMM-GMM and (T)DNN-HMM acoustic models: without the need for speaker annotations or trained speaker embeddings.
  • As a context embedding in acoustic models: provide additional information to acoustic models.

Code

The Python3/Tensorflow code to train unspeech model is currently available at this gitlab repository.

Installation instructions:

git clone https://gitlab.com/milde/unspeech
pip3 install tensorflow==1.4.1 numpy matplotlib wavefile sklearn 

Newer Tensorflow version than 1.4.1 currently break the code, but updates will soon be provided to fix issues with newer versions.
python3 unsup_model_neg.py is the main script to train new models and to use exisiting model, with a lot of options to control various parameters of the model, see python3 unsup_model_neg.py --help

Training

Use the --filelist option to either supply a Kaldi .scp file or a .ark file directly. Utterances with a length smaller than what is nesseary to sample a positive context pair will automatically be discarded.

Tedlium

Train on a Tedlium fbank ark file on GPU #0 (this model is referred to as unspeech-64-ted in the paper):

#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python3 unsup_model_neg.py --window_length 64 --window_neg_length 64 \
--filelist /srv/data/kaldi/egs/tedlium/s5_r2/data/train_fbank_sp/feats_unnormalized.ark  --noend_to_end  \
--embedding_transform Vgg16big --l2_reg 0.0001 --batch_size 32  --left_contexts 2 --right_contexts 2  \
--unit_normalize_var True  --tied_embeddings_transforms True --learn_rate 0.0003 --fc_size 2048

Example output:

The console output will periodically give you updates on the learning process, where the models is saved and example predictions (is context / is not context). After about 24 hours the displayed training accuracy should get close to 1.0, e.g.:

At step 394000 step-time 0.2279 loss 0.0474 Model saving path is: /srv/data/unspeech_models/neg/runs/1520559132feats_transVgg16big_nsampling_rnd_win64_neg_samples4_lcontexts2_rcontexts2_ flts40_embsize100_fc_size2048_unit_norm_var_dropout_keep0.9_l2_reg0.0001_featinput_filelist.english.train_dot_combine_tied_embs/tf10
Training started 24.94 hours ago.
FLAGS params in short: feats_transVgg16big_nsampling_rnd_win64_neg_samples4_lcontexts2_rcontexts2_flts40_embsize100_fc_size2048 _unit_norm_var_dropout_keep0.9_l2_reg0.0001_featinput_filelist.english.train_dot_combine_tied_embs
np.bincount: [127 129]
len: 256 256
true labels, out (first 40 dims): [(1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 1.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0)]
accuracy: 0.98828125
majority class accuracy: 0.5

Embeddings transformations:

Several embedding transforms can be used:

We recommend Vgg16 or Vgg16big, depending on the amount of data, since they offer good performance and are among the fastest to train. The ResNet variants might currently be buggy at this moment (there seem to be issues with Tensorflows batch normalization in a Siamese neural network architecture.)

Large scale training:

Be default, features are loaded into memory. On larger datasets, where the data does not fit into main memory, memory mapping from a disk is also supported. We recommend a fast SSD (e.g. M2 NVMe) for the memory mapped cache: the amount of random access reads will be high.

Loading features manually into a mmap cache in a Python3 shell:

import kaldi_io                                                                                                                                                                                       
import numpy
utts, feats = kaldi_io.readArk('/srv/home/milde/youtube-tedx/tedx_feats.ark.gz', 
                    memmap_dir='/scratch/tedx_mmap_cache', memmap_dtype='float32')

TEDx example

Train on a Tedlium fbank ark file on GPU #0 (this model is referred to as unspeech-128-tedx in the paper):


#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python3 unsup_model_neg.py --window_length 128 --window_neg_length 128 \
--memmap_reuse_cache True --memmap_dir /scratch/tedx_mmap_cache --embedding_transform Vgg16big \
--l2_reg 0.0001 --batch_size 32  --left_contexts 2 --right_contexts 2 --unit_normalize_var True \
--tied_embeddings_transforms True --fc_size 2048

Pretrained models

Speech pertubed versions are extending the dataset by 3x, by using two additional playling speeds with sox when generating features: 0.9x 1.1x.

We will soon upload more pre-trained models here.