Unspeech embeddings are based on unsupervised learning of context feature representations of spoken language. Variance and variability in recordings of speech and its representations are a common problem in automatic speech processing tasks. E.g. speaker, environment characteristics and the type of microphone will make large differences in typical speech representations (e.g. FBANK, MFCC), making (direct) similarity comparisons difficult. We can describe such factors of variance also as the context of an utterance; speech sounds that occur close in time share similar contexts. Unspeech allows you to learn embeddings of such contexts in an unsupervised way on raw speech data: speaker IDs, channel information or transcriptions are not needed.
Installation instructions:
git clone https://gitlab.com/milde/unspeech
pip3 install tensorflow numpy matplotlib wavefile sklearn hdbscan
You need a recent version of Tensorflow (1.5+) to run the code, we recommend Tensorflow 1.8 / python3.
unsup_model_neg.py is the main script to train new models and make use of trained models to generate unspeech features, with a lot of options to control various parameters of the model, see 'python3 unsup_model_neg.py --help'
Use the --filelist option to either supply a Kaldi .scp file or a .ark file directly. Utterances with a length smaller than what is necessary to sample a positive context pair will automatically be discarded.
Train on a Tedlium fbank ark file on GPU #0 (this model is referred to as unspeech-64-ted in the paper):
#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python3 unsup_model_neg.py --window_length 64 --window_neg_length 64 \
--filelist /srv/data/kaldi/egs/tedlium/s5_r2/data/train_fbank_sp/feats_unnormalized.ark --noend_to_end \
--embedding_transformation Vgg16big --l2_reg 0.0001 --batch_size 32 --left_contexts 2 --right_contexts 2 \
--unit_normalize_var True --tied_embeddings_transforms True --learn_rate 0.0003 --fc_size 2048
The console output will periodically give you updates on the learning process, where the models is saved and example predictions (is context / is not context). After about 24 hours the displayed training accuracy should get close to 1.0, e.g.:
At step 394000 step-time 0.2279 loss 0.0474 Model saving path is: /srv/data/unspeech_models/neg/runs/1520559132feats_transVgg16big_nsampling_rnd_win64_neg_samples4_lcontexts2_rcontexts2_ flts40_embsize100_fc_size2048_unit_norm_var_dropout_keep0.9_l2_reg0.0001_featinput_filelist.english.train_dot_combine_tied_embs/tf10Several embedding transforms can be used:
Be default, features are loaded into memory. On larger datasets, where the data does not fit into main memory, memory mapping from a disk is also supported. We recommend a fast SSD (e.g. M2 NVMe) for the memory mapped cache: the amount of random access reads will be very high.
import kaldi_io
import numpy
utts, feats = kaldi_io.readArk('/srv/home/milde/youtube-tedx/tedx_feats.ark.gz',
memmap_dir='/scratch/tedx_mmap_cache', memmap_dtype='float32')
Train on a Tedlium fbank ark file on GPU #0 (this model is referred to as unspeech-128-tedx in the paper):
#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python3 unsup_model_neg.py --window_length 128 --window_neg_length 128 \
--memmap_reuse_cache True --memmap_dir /scratch/tedx_mmap_cache --embedding_transform Vgg16big \
--l2_reg 0.0001 --batch_size 32 --left_contexts 2 --right_contexts 2 --unit_normalize_var True \
--tied_embeddings_transforms True --fc_size 2048
Speed pertubed versions are extending the dataset by 3x, by using two additional playing speeds with sox when generating features: 0.9x 1.1x.
We will soon upload more pre-trained models here.
Generate features by using the --gen_feat option and supply the same options you used for training the model. The feature generation options "--genfeat_interpolate_outputlength_padding" and "--genfeat_stride" influence padding and stride behavior. If you want to use the feature output as a ivector-replacement for training Kaldi acoustic models, you have to name the output file ivector_online(.ark) and also create a file called "ivector_period" in the same directory as the ark file, specifying the stride you used to generate the features.
You can download the pretrained model for the example below here, the example script uses the necessary naming conventions for Kaldi and assumes that unnormalized (no cvmn) 40 dim fbank vectors have been created for the TED-LIUM corpus (data/dev_fbank/unnormalized.feats.ark, data/test_fbank/unnormalized.feats.ark, data/train_cleaned_sp_hires_fbank_comb/unnormalized.feats.ark). You may need to change the kaldi_tedlium_home path.
#!/bin/bash
run=models/1520122885feats_transVgg16big_nsampling_rnd_win64_neg_samples4_lcontexts2_rcontexts2_flts40_embsize100_fc_size1024_unit_norm_var_dropout_keep0.9_l2_reg0.0001_featinput_commonvoice_train_valid_sp.ark_dot_combine_tied_embs
kaldi_tedlium_home=/srv/data/milde/kaldi/egs/tedlium/s5_r2/
tedlium_outdir=unspeech_64_sp_commonvoice
num_filters=40
embedding_transformation=Vgg16big
output_feat_format=kaldi_bin
num_highway_layers=5
num_dnn_layers=5
embedding_size=100
hop_size=1
genfeat_stride=10
additional_params="--fc_size 1024 --window_length 64 --window_neg_length 64 --unit_normalize_var --tied_embeddings_transforms --nogenerate_speaker_vectors --nogenfeat_combine_contexts --nokaldi_normalize_to_input_length --genfeat_stride $genfeat_stride --notest_perf --genfeat_interpolate_outputlength_padding"
echo "computing feats for dev set..."
python3 unsup_model_neg.py --gen_feat --train_dir $run --filelist ${kaldi_tedlium_home}/data/dev_fbank/unnormalized.feats.ark --num_filters $num_filters --embedding_transformation $embedding_transformation --num_highway_layers $num_highway_layers --embedding_size $embedding_size --num_dnn_layers $num_dnn_layers --hop_size $hop_size --additional_params $additional_params --output_feat_file ${kaldi_tedlium_home}/data/${tedlium_outdir}/dev/ivector_online
echo $genfeat_stride > ${kaldi_tedlium_home}/data/${tedlium_outdir}/dev/ivector_period
echo "computing feats for test set... "
python3 unsup_model_neg.py --gen_feat --train_dir $run --filelist ${kaldi_tedlium_home}/data/test_fbank/unnormalized.feats.ark --num_filters $num_filters --embedding_transformation $embedding_transformation --num_highway_layers $num_highway_layers --embedding_size $embedding_size --num_dnn_layers $num_dnn_layers --hop_size $hop_size --additional_params $additional_params --output_feat_file ${kaldi_tedlium_home}/data/${tedlium_outdir}/test/ivector_online
echo $genfeat_stride > ${kaldi_tedlium_home}/data/${tedlium_outdir}/test/ivector_period
echo "computing feats for train set... "
python3 unsup_model_neg.py --gen_feat --train_dir $run --filelist ${kaldi_tedlium_home}/data/train_cleaned_sp_hires_fbank_comb/unnormalized.feats.ark --num_filters $num_filters --embedding_transformation $embedding_transformation --num_highway_layers $num_highway_layers --embedding_size $embedding_size --num_dnn_layers $num_dnn_layers --hop_size $hop_size --additional_params $additional_params --output_feat_file ${kaldi_tedlium_home}/data/${tedlium_outdir}/train_cleaned_sp_comb/ivector_online
echo $genfeat_stride > ${kaldi_tedlium_home}/data/${tedlium_outdir}/train_cleaned_sp_comb/ivector_period
You can cluster the generated ark files by using the cluster.py script. See cluster.py --help for further options. The default comparison metric is Euclidean, by default utterances are averaged to a single vector (averaging all vectors across a sequence) and normalized to unit length. Depending on how you generated the features, with both embedding transformations stacked together or just the target embedding transformation, you may need to cut them at a certain dimension using the --half_index option.
Cluster.py needs the HDBSCAN package for clustering (pip3 install hdbscan), providing a modern density based clustering algorithm. The main parameter is the minimum cluster size, which is how big a cluster needs to be in order to form. The number of clusters does not have to be known a priori and no density parameter needs to be set (eps in standard DBSCAN). Clustering is significantly speed up by using approximate nearest neighbor search (only if the Euclidean metric is used) and is tested to work with up to a million utterances, it should also be possible to cluster much more utterances without any problems.
Use the --output_utt2spk option for saving the clustering output into a text output file containing one <id> and <speakerid> per line. Note: in order to use the clustered speaker IDs in Kaldi (training acoustic models with the clustered context/speaker IDs), you need to prepend all utterance IDs with its corresponding speaker ID. The Unix utilities paste, cut and awk are useful tools for this.
You can additionally use the --utt2spk parameter to supply a gold speaker clustering (or set it to none as in the example to just do the clustering). If set, cluster.py will also calculate cluster scores for the clustering, using Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI).
#!/bin/bash
python3 cluster.py --input-ark tedlium/unspeech_64_sp_commonvoice/train/feats.ark --half_index 100 --output_utt2spk tedlium/unspeech_64_sp_commonvoice/train/clustered_utt2spk --utt2spk none --mode cluster_speaker