Unspeech embeddings are based on unsupervised learning of context feature representations of spoken language. Variance and variability in recordings of speech and its representations are a common problem in automatic speech processing tasks. E.g. speaker, environment characteristics and the type of microphone will make large differences in typical speech representations (e.g. FBANK, MFCC), making (direct) similarity comparisons difficult. We can describe such factors of variance also as the context of an utterance; speech sounds that occur close in time share similar contexts. Unspeech allows you do learn embeddings of such contexts in an unsupervised way on raw speech data: speaker IDs, channel information or transcriptions are not needed.
Installation instructions:
git clone https://gitlab.com/milde/unspeech
pip3 install tensorflow==1.4.1 numpy matplotlib wavefile sklearn
Newer Tensorflow version than 1.4.1 currently break the code, but updates will soon be provided to fix issues with newer versions. python3 unsup_model_neg.py is the main script to train new models and to use exisiting model, with a lot of options to control various parameters of the model, see python3 unsup_model_neg.py --help
Use the --filelist option to either supply a Kaldi .scp file or a .ark file directly. Utterances with a length smaller than what is nesseary to sample a positive context pair will automatically be discarded.
Train on a Tedlium fbank ark file on GPU #0 (this model is referred to as unspeech-64-ted in the paper):
#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python3 unsup_model_neg.py --window_length 64 --window_neg_length 64 \
--filelist /srv/data/kaldi/egs/tedlium/s5_r2/data/train_fbank_sp/feats_unnormalized.ark --noend_to_end \
--embedding_transform Vgg16big --l2_reg 0.0001 --batch_size 32 --left_contexts 2 --right_contexts 2 \
--unit_normalize_var True --tied_embeddings_transforms True --learn_rate 0.0003 --fc_size 2048
The console output will periodically give you updates on the learning process, where the models is saved and example predictions (is context / is not context). After about 24 hours the displayed training accuracy should get close to 1.0, e.g.:
At step 394000 step-time 0.2279 loss 0.0474 Model saving path is: /srv/data/unspeech_models/neg/runs/1520559132feats_transVgg16big_nsampling_rnd_win64_neg_samples4_lcontexts2_rcontexts2_ flts40_embsize100_fc_size2048_unit_norm_var_dropout_keep0.9_l2_reg0.0001_featinput_filelist.english.train_dot_combine_tied_embs/tf10Several embedding transforms can be used:
Be default, features are loaded into memory. On larger datasets, where the data does not fit into main memory, memory mapping from a disk is also supported. We recommend a fast SSD (e.g. M2 NVMe) for the memory mapped cache: the amount of random access reads will be high.
import kaldi_io
import numpy
utts, feats = kaldi_io.readArk('/srv/home/milde/youtube-tedx/tedx_feats.ark.gz',
memmap_dir='/scratch/tedx_mmap_cache', memmap_dtype='float32')
Train on a Tedlium fbank ark file on GPU #0 (this model is referred to as unspeech-128-tedx in the paper):
#!/bin/bash
CUDA_VISIBLE_DEVICES=0 python3 unsup_model_neg.py --window_length 128 --window_neg_length 128 \
--memmap_reuse_cache True --memmap_dir /scratch/tedx_mmap_cache --embedding_transform Vgg16big \
--l2_reg 0.0001 --batch_size 32 --left_contexts 2 --right_contexts 2 --unit_normalize_var True \
--tied_embeddings_transforms True --fc_size 2048
Speech pertubed versions are extending the dataset by 3x, by using two additional playling speeds with sox when generating features: 0.9x 1.1x.
We will soon upload more pre-trained models here.