Unspeech

Unsupervised Speech Context Embeddings

Unspeech embeddings are based on unsupervised learning of context feature representations of spoken language. Variance and variability in recordings of speech and its representations are a common problem in automatic speech processing tasks. E.g. speaker, environment characteristics and the type of microphone will make large differences in typical speech representations (e.g. FBANK, MFCC), making (direct) similarity comparisons difficult. We can describe such factors of variance also as the context of an utterance; speech sounds that occur close in time share similar contexts. Unspeech allows you to learn embeddings of such contexts in an unsupervised way on raw speech data: speaker IDs, channel information or transcriptions are not needed.

Uses cases:

Cluster a speech corpus in-domain, to help speaker adaption methods in HMM-GMM and (T)DNN-HMM acoustic models: without the need for speaker annotations or trained speaker embeddings.
As a context embedding in acoustic models: provide additional information to acoustic models.

News:

5. Sep 2018 - Unspeech presentation at Interspeech 2018, see also: slides, paper
June 2018 - issues with newer Tensorflow versions (>1.4.1) have been resolved, the Unspeech training code now works with Tensorflow 1.8.
3. June 2018 - Benjamin Milde and Chris Biemann, "Unspeech: Unsupervised Speech Context Embeddings", accepted at Interspeech 2018!
18. April 2018 - A preprint of our paper on unspeech is available and a preview unspeech.net website is online

Publications:

Benjamin Milde, Chris Biemann, "Unspeech: Unsupervised Speech Context Embeddings," In: Proceedings Interspeech 2018, pp. 2693-2697, 2018 (slides, paper)

Code

The Python3/Tensorflow code to train unspeech model is currently available at this gitlab repository.

Installation instructions:

git clone https://gitlab.com/milde/unspeech
pip3 install tensorflow numpy matplotlib wavefile sklearn hdbscan

You need a recent version of Tensorflow (1.5+) to run the code, we recommend Tensorflow 1.8 / python3.
unsup_model_neg.py is the main script to train new models and make use of trained models to generate unspeech features, with a lot of options to control various parameters of the model, see 'python3 unsup_model_neg.py --help'

Training

Use the --filelist option to either supply a Kaldi .scp file or a .ark file directly. Utterances with a length smaller than what is necessary to sample a positive context pair will automatically be discarded.

Tedlium

Train on a Tedlium fbank ark file on GPU #0 (this model is referred to as unspeech-64-ted in the paper):

#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python3 unsup_model_neg.py --window_length 64 --window_neg_length 64 \
--filelist /srv/data/kaldi/egs/tedlium/s5_r2/data/train_fbank_sp/feats_unnormalized.ark  --noend_to_end  \
--embedding_transformation Vgg16big --l2_reg 0.0001 --batch_size 32  --left_contexts 2 --right_contexts 2  \
--unit_normalize_var True  --tied_embeddings_transforms True --learn_rate 0.0003 --fc_size 2048

train_tedlium.sh

Example output:

The console output will periodically give you updates on the learning process, where the models is saved and example predictions (is context / is not context). After about 24 hours the displayed training accuracy should get close to 1.0, e.g.:

At step 394000 step-time 0.2279 loss 0.0474 Model saving path is: /srv/data/unspeech_models/neg/runs/1520559132feats_transVgg16big_nsampling_rnd_win64_neg_samples4_lcontexts2_rcontexts2_ flts40_embsize100_fc_size2048_unit_norm_var_dropout_keep0.9_l2_reg0.0001_featinput_filelist.english.train_dot_combine_tied_embs/tf10
Training started 24.94 hours ago.
FLAGS params in short: feats_transVgg16big_nsampling_rnd_win64_neg_samples4_lcontexts2_rcontexts2_flts40_embsize100_fc_size2048 _unit_norm_var_dropout_keep0.9_l2_reg0.0001_featinput_filelist.english.train_dot_combine_tied_embs
np.bincount: [127 129]
len: 256 256
true labels, out (first 40 dims): [(1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (0.0, 0.0), (0.0, 0.0), (0.0, 1.0), (0.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0)]
accuracy: 0.98828125
majority class accuracy: 0.5

Embeddings transformations:

Several embedding transforms can be used:

Vgg16
Vgg16big
ResNet
Resnet_v2_50_small
Resnet_v2_50
Resnet_v2_101
Inception_Resnet_v2
HighwayDnn

We recommend 'Vgg16' or 'Vgg16big', depending on the amount of data, since they offer good performance and are among the fastest to train. Note that the minimum window with is 16 frames for 'Vgg16' and 32 frames for 'Vgg16big'. The ResNet variants might currently be buggy at this moment (there seem to be issues with Tensorflows batch normalization in a Siamese neural network architecture.)

Large scale training:

Be default, features are loaded into memory. On larger datasets, where the data does not fit into main memory, memory mapping from a disk is also supported. We recommend a fast SSD (e.g. M2 NVMe) for the memory mapped cache: the amount of random access reads will be very high.

--memmap_dir If not empty, use this dir to store memmapped arrays on the file system. Use this if your systems main memory is not big enough to hold the complete
--memmap_reuse_cache Directly memmap the directory and its array files specified in memmap_dir (e.g. from a previous memmaped run)
--memmap_dtype dtype of the mmapped array (default: float32)

Loading features manually into a mmap cache in a Python3 shell:

load_kaldi_ark_mmap_example.py

import kaldi_io                                                                                                                                                                                       
import numpy
utts, feats = kaldi_io.readArk('/srv/home/milde/youtube-tedx/tedx_feats.ark.gz', 
                    memmap_dir='/scratch/tedx_mmap_cache', memmap_dtype='float32')

TEDx example

Train on a Tedlium fbank ark file on GPU #0 (this model is referred to as unspeech-128-tedx in the paper):

train_tedlium_tedx_128_memcache.sh


#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python3 unsup_model_neg.py --window_length 128 --window_neg_length 128 \
--memmap_reuse_cache True --memmap_dir /scratch/tedx_mmap_cache --embedding_transform Vgg16big \
--l2_reg 0.0001 --batch_size 32  --left_contexts 2 --right_contexts 2 --unit_normalize_var True \
--tied_embeddings_transforms True --fc_size 2048

Pretrained models

ted: Trained on 211 hours of English speech from the TED-LIUM V2 dataset.
cv: Trained on 240 hours of English speech data from Mozilla Common Voice dataset (V1).
tedx: Trained on 9505 hours of mostly English speech data crawled from the TEDx youtube channel.

Speed pertubed versions are extending the dataset by 3x, by using two additional playing speeds with sox when generating features: 0.9x 1.1x.

cv-64-sp Vgg16big, window size 64 frames (640ms)
tedx-9500-64 Vgg16big, window size 64 frames (640ms)
tedx-9500-128 Vgg16big, window size 128 frames (1280ms)

We will soon upload more pre-trained models here.

Generating features

Generate features by using the --gen_feat option and supply the same options you used for training the model. The feature generation options "--genfeat_interpolate_outputlength_padding" and "--genfeat_stride" influence padding and stride behavior. If you want to use the feature output as a ivector-replacement for training Kaldi acoustic models, you have to name the output file ivector_online(.ark) and also create a file called "ivector_period" in the same directory as the ark file, specifying the stride you used to generate the features.

You can download the pretrained model for the example below here, the example script uses the necessary naming conventions for Kaldi and assumes that unnormalized (no cvmn) 40 dim fbank vectors have been created for the TED-LIUM corpus (data/dev_fbank/unnormalized.feats.ark, data/test_fbank/unnormalized.feats.ark, data/train_cleaned_sp_hires_fbank_comb/unnormalized.feats.ark). You may need to change the kaldi_tedlium_home path.

gen_feat_unspeech64_sp_commonvoice_kaldi.sh

#!/bin/bash
run=models/1520122885feats_transVgg16big_nsampling_rnd_win64_neg_samples4_lcontexts2_rcontexts2_flts40_embsize100_fc_size1024_unit_norm_var_dropout_keep0.9_l2_reg0.0001_featinput_commonvoice_train_valid_sp.ark_dot_combine_tied_embs

kaldi_tedlium_home=/srv/data/milde/kaldi/egs/tedlium/s5_r2/
tedlium_outdir=unspeech_64_sp_commonvoice
num_filters=40
embedding_transformation=Vgg16big
output_feat_format=kaldi_bin
num_highway_layers=5
num_dnn_layers=5
embedding_size=100
hop_size=1
genfeat_stride=10
additional_params="--fc_size 1024 --window_length 64 --window_neg_length 64 --unit_normalize_var --tied_embeddings_transforms --nogenerate_speaker_vectors --nogenfeat_combine_contexts --nokaldi_normalize_to_input_length --genfeat_stride $genfeat_stride --notest_perf --genfeat_interpolate_outputlength_padding"

echo "computing feats for dev set..."
python3 unsup_model_neg.py --gen_feat --train_dir $run  --filelist ${kaldi_tedlium_home}/data/dev_fbank/unnormalized.feats.ark  --num_filters $num_filters --embedding_transformation $embedding_transformation --num_highway_layers $num_highway_layers --embedding_size $embedding_size --num_dnn_layers $num_dnn_layers --hop_size $hop_size --additional_params $additional_params --output_feat_file ${kaldi_tedlium_home}/data/${tedlium_outdir}/dev/ivector_online
echo $genfeat_stride > ${kaldi_tedlium_home}/data/${tedlium_outdir}/dev/ivector_period

echo "computing feats for test set... "
python3 unsup_model_neg.py --gen_feat --train_dir $run  --filelist ${kaldi_tedlium_home}/data/test_fbank/unnormalized.feats.ark  --num_filters $num_filters --embedding_transformation $embedding_transformation --num_highway_layers $num_highway_layers --embedding_size $embedding_size --num_dnn_layers $num_dnn_layers --hop_size $hop_size --additional_params $additional_params --output_feat_file ${kaldi_tedlium_home}/data/${tedlium_outdir}/test/ivector_online
echo $genfeat_stride > ${kaldi_tedlium_home}/data/${tedlium_outdir}/test/ivector_period

echo "computing feats for train set... "
python3 unsup_model_neg.py --gen_feat --train_dir $run  --filelist ${kaldi_tedlium_home}/data/train_cleaned_sp_hires_fbank_comb/unnormalized.feats.ark  --num_filters $num_filters --embedding_transformation $embedding_transformation --num_highway_layers $num_highway_layers --embedding_size $embedding_size --num_dnn_layers $num_dnn_layers --hop_size $hop_size --additional_params $additional_params --output_feat_file ${kaldi_tedlium_home}/data/${tedlium_outdir}/train_cleaned_sp_comb/ivector_online
echo $genfeat_stride > ${kaldi_tedlium_home}/data/${tedlium_outdir}/train_cleaned_sp_comb/ivector_period

Clustering features

You can cluster the generated ark files by using the cluster.py script. See cluster.py --help for further options. The default comparison metric is Euclidean, by default utterances are averaged to a single vector (averaging all vectors across a sequence) and normalized to unit length. Depending on how you generated the features, with both embedding transformations stacked together or just the target embedding transformation, you may need to cut them at a certain dimension using the --half_index option.

Cluster.py needs the HDBSCAN package for clustering (pip3 install hdbscan), providing a modern density based clustering algorithm. The main parameter is the minimum cluster size, which is how big a cluster needs to be in order to form. The number of clusters does not have to be known a priori and no density parameter needs to be set (eps in standard DBSCAN). Clustering is significantly speed up by using approximate nearest neighbor search (only if the Euclidean metric is used) and is tested to work with up to a million utterances, it should also be possible to cluster much more utterances without any problems.

Use the --output_utt2spk option for saving the clustering output into a text output file containing one <id> and <speakerid> per line. Note: in order to use the clustered speaker IDs in Kaldi (training acoustic models with the clustered context/speaker IDs), you need to prepend all utterance IDs with its corresponding speaker ID. The Unix utilities paste, cut and awk are useful tools for this.

You can additionally use the --utt2spk parameter to supply a gold speaker clustering (or set it to none as in the example to just do the clustering). If set, cluster.py will also calculate cluster scores for the clustering, using Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI).

cluster_tedlium.sh

#!/bin/bash
python3 cluster.py --input-ark tedlium/unspeech_64_sp_commonvoice/train/feats.ark --half_index 100 --output_utt2spk tedlium/unspeech_64_sp_commonvoice/train/clustered_utt2spk --utt2spk none --mode cluster_speaker