Text to Speech Synthesis is a problem that has applications in a wide range of scenarios. They can be used to read out pdfs loud, help the visually impaired to interact with text, make chatbots more interactive etc. Historically, many systems were built to tackle this task using signal processing and deep learning approaches.In this article, let’s explore a novel approach to synthesize speech from the text presented by Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno and Yonghui Wu, researchers at google in a paper published on 2nd January 2019.
MultiSpeaker Text to Speech synthesis refers to a system with the ability to generate speech in different users’ voices. Collecting data and training on it for each user can be a hassle with traditional TTS approaches.
Sign up for your weekly dose of what's up in emerging technology.
Speaker Verification to Text to Speech Synthesis
A new approach with three independent components is introduced to provide an efficient solution to the multi-speaker adaptation during speech synthesis. These components are deep learning models that are trained independently of each other. Let’s understand what each of these components are doing.
Each speaker’s voice information is encoded in an embedding. This embedding is generated by a neural network trained using speaker verification loss. Speaker verification loss is calculated by trying to predict whether two utterances are from the same user or not.
The embeddings are supposed to have high similarity if and only if they are from the same user.
Note that this training need not have any information about the text that we are trying to vocalize. Moreover, once trained on a large corpus of unlabelled voices containing background noises and disturbances, the model develops an ability to learn crucial information regarding the speaker’s voice characteristics. This enables us to generate embeddings for new users without having to change the network’s parameters.It won’t even require more than a few seconds of the target user’s voice utterance of any text.
This makes the embeddings agnostic to the downstream task,allowing them to be trained independent of the synthesis models that follow.
This component is the core model of Text-to-Speech Synthesis. It takes in the sequence of phonemes as inputs and generates a spectrogram of the corresponding text input. Phonemes are distinct units of a sound of words. Each word is decomposed into these phonemes and sequence input to the model is formed. This model also consumes Speaker encodings to support MultiSpeaker Voices. Following is the high-level overview of this model.
Speaker encoding is concatenated with each layer’s output. This speaker encoding can be generated in the previous step completely unaware of the current phoneme sequence.In Fact these embedding can even be random samples from the distribution of these encodings. If we give random vectors this model will generate a synthetic voice that resembles human voices.
Training of this model is done by minimizing L2 loss of the generated spectrogram. Mel Spectrogram for targets is obtained by breaking down audio into time segments, calculating the frequency components and converting them into Mel Scale. Mel Scale is a fixed non-linear transformation of inputs. Mel Scale transforms the frequency scale into a human perceptual scale.
Reconstruction of audio from spectrograms is not as trivial as generating the spectrogram from audio samples. To generate audio we use the following vocoder network.
A sample by sample autoregressive WaveNet model is used to perform voice generation. This model takes Mel Spectrogram as input to generate time-domain waveforms. Following is the architecture of this WaveNet Model
The dilated convolution blocks used in the model are quite interesting. They enforce sequence data’s causality by restricting the convolution to only look at values from previous time steps. But this narrows down the receptive fields of neurons resulting in the requirement of very high depth models. Dilation is a nice concept that skips over a few neurons in previous time steps to increase each neuron’s range in deeper layers.
This model doesn’t need a separate representation of the target speaker as the spectrogram contains all the information. Once the model is trained on a large enough corpus containing multiple speakers’ voice, It becomes good at generating voices of unknown speakers.
Inference can be done on this model using zero shot transfer learning. We just need a few second speech sample of a new user and the model adapts to the speaker’s characteristics and generates a speaker encoding which can be used along a target text to synthesize speech.
Real Time Voice Cloning Application.
Corentine Jemine built a gui deep learning framework to do Text to Speech Synthesis using speaker verification.It enables us to clone a voice within 5 seconds and generate arbitrary speech.This application is a pytorch implementation of SV2TTS. Following is the description of the tool by the authors.
SV2TTS is defined as a three-stage deep learning framework that can generate numerical representations of a voice by using only a few seconds of audio and use it to condition a text-to-speech model trained to generalize to new voices.
The demo code on the article is reference from here
This GUI application is built using PyQt5 so running it on colab won’t be straightforward.Local System with a GPU would be a better option.
Clone the repository using
Make a virtual env and install pytorch and required packages
!pip install virtualenv !python -m venv rtvcenv !.\env\Scripts\activate Install pytorch from here !cd <path_to_rtvc_repo> !pip install -r requirements.txt
Download ffmpeg and add it to the system path
Now we need to download pretrained models from here
To do a test run, execute the demo_cli.py file.You should see something like this if all the application is setup properly
There is a GUI that abstracts all the synthesizing processes and makes this application usable by people with non-programming users as well. Tutorial video to use the GUI application can be found here.
The following snippet from demo_cli.py shows the usage of this model from code.
message = \ 'Reference voice: enter an audio filepath of a voice to be cloned (mp3, wav, m4a, flac, ...):\n' in_fpath = Path(input(message).replace('"', '').replace("\'", '')) if in_fpath.suffix.lower() == '.mp3' and args.no_mp3_support: print "Can't Use mp3 files please try again:" continue ## Computing the embedding # First, we load the wav using the function that the speaker encoder provides. This is # important: there is preprocessing that must be applied. # The following two methods are equivalent: # - Directly load from the file path: preprocessed_wav = encoder.preprocess_wav(in_fpath) # - If the wav is already loaded: (original_wav, sampling_rate) = librosa.load(str(in_fpath)) preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate) print 'Uploaded file successfully' # Then we derive the embedding. There are many functions and parameters that the # speaker encoder interfaces. These are mostly for in-depth research. You will typically # only use this function (with its default parameters): embed = encoder.embed_utterance(preprocessed_wav) print 'Created the embedding' ## Generating the spectrogram text = input('Write a sentence (+-20 words) to be synthesized:\n') # If seed is specified, reset torch seed and force synthesizer reload if args.seed is not None: torch.manual_seed(args.seed) synthesizer = Synthesizer(args.syn_model_fpath) # The synthesizer works in batch, so you need to put your data in a list or numpy array texts = [text] embeds = # If you know what the attention layer alignments are, you can retrieve them here by # passing return_alignments=True specs = synthesizer.synthesize_spectrograms(texts, embeds) spec = specs print 'Created the mel spectrogram' ## Generating the waveform print 'Synthesizing the waveform:' # If seed is specified, reset torch seed and reload vocoder if args.seed is not None: torch.manual_seed(args.seed) vocoder.load_model(args.voc_model_fpath) # Synthesizing the waveform is fairly straightforward. Remember that the longer the # spectrogram, the more time-efficient the vocoder. generated_wav = vocoder.infer_waveform(spec) # Trim excess silences to compensate for gaps in spectrograms (issue #53) generated_wav = encoder.preprocess_wav(generated_wav) # Play the audio (non-blocking) if not args.no_sound: try: sd.stop() sd.play(generated_wav, synthesizer.sample_rate) except sd.PortAudioError, e: print '\nCaught exception: %s' % repr(e) print 'Continuing without audio playback. Suppress this message with the "--no_sound" flag.\n' except: raise # Save it on the disk filename = 'demo_output_%02d.wav' % num_generated print generated_wav.dtype sf.write(filename, generated_wav.astype(np.float32), synthesizer.sample_rate) num_generated += 1 print ''' Saved output as %s ''' % filename
Run the above code and input your own recorded voice speaking any arbitrary English sentence. A fake utterance will be generated and saved.