Last updated August 24, 2021
In AI Mysteries

Creating An AI Text-to-Speech Using IBM Watson

Share

Published on June 22, 2021

by Victor Dey

The recent decade has seen some of the most groundbreaking developments in the field of artificial intelligence. Especially in recent years, data collection and analysis has bolstered up considerably with the help of interconnected devices through the internet and super-fast computer processing. Whether they be in the domain of automobiles, with self-driving cars, in the healthcare industry with artificially intelligent robot systems that can aid a doctor with surgery, the manufacturing industry and much more. Artificial Intelligence, combined with the power of Machine Learning, has provided us with a wide spectrum of implementations and uses, even to be discovered in the years to come. One of the most fundamental advancements in such has been Virtual Voice Assistants and Voice & Text recognition services. With the pace of life getting faster and busier every day, our voice has become an essential tool to command and generate results instantly. Consumer-based Virtual Assistants such as Alexa by Amazon and Siri by Apple, or Google Assistant, have become a part of our daily lives to obtain information, schedule and plan tasks, or leisure. But have you ever pondered what goes on behind the scenes? We will try to explore one of the aspects, called Text-to-Speech.

What is Text-to-Speech?

Text-to-Speech is a form of Speech Synthesis where the algorithm converts language into human speech. The main goal of Text-to-speech is to generate natural-sounding speech signals for the voice assistant agents. It can also be a feature through which your computer or phone reads on-screen text aloud to you, often used as an accessibility feature to help people who have trouble reading on-screen text, and is also convenient for those who want it to be read for them. Text-to-speech has become so omnipresent that people encounter it every day without even realizing it. Text-to-Speech, often called TTS, often find their use in Smart Speakers, Ebook Readers, Mapping and Direction-based software, Word Processors, and much more. The voice for TTS is usually computer-generated; reading speeds can be sped up or slowed down accordingly. Many tools even highlight words as they are read aloud to allow the user to see text and hear it simultaneously. Text-to-speech can also be considered an optimal tool for converting immense masses of text into playable audio data for ease of work.

About IBM-Watson Cloud

The IBM Cloud is a platform that provides a range of services, a combination of both Platforms as a Service (Paas) and Infrastructure as a Service(IaaS), for providing the integrated experience. It is one of the most open and secure public clouds for businesses. A hybrid multi-cloud platform with advanced data and AI capabilities and deep enterprise expertise across 20 different industries. It’s a full-stack cloud platform, having over 170 products and services covering essential domains in Information Technology such as Data, Containers, AI, IoT, and Blockchain. The Cloud also provides solutions that enable higher levels of compliance, security, and management, with architecture patterns and methods for rapid delivery across mission-critical workloads. It is available worldwide, across 19 countries and regions in North and South America, Europe, Asia, and Australia, so that one is enabled to deploy services locally with global scalability.

The platform consists of multiple components that work together to provide a consistent and dependable cloud experience.

Getting started with AI Text to Speech using Watson Text-to-Speech

We will try to get a flavour of what it takes to build a Text-to-Speech recognition model and how it works. The following steps will be used to create one such model :

We will first capture our text using python
We will then set up our Text-to-Speech Model Using The IBM-Watson TTS.
Create an Output Mp3 file that contains the audio to our text

The following code implementation is in reference to the official implementation, whose video tutorial you can find here.

Creating The TTS Model

First, we will install the IBM-Watson dependency library to help us call our modules. It can be installed through pip using the following command.

#installing ibm-watson library to help call services

!pip install ibm-watson

Setup The Cloud Services and Authentication

We need to set up the service first using the IBM Watson on cloudTTS module.

To do so, we’ll first go to cloud.ibm.com/catalog.

Click on services and from Category,

Tick the AI/Machine Learning checkbox to filter out the service modules.

Then click on Text to Speech, and select the free plan that offers up to 10k characters to convert per month.

After doing so, we’ll write a few lines of code in python to authenticate our model.

 #setup our text-to-speech module
 from ibm_watson import TextToSpeechV1
 from ibm_cloud_sdk_core.authenticators import IAMAuthenticator #Authenticate our Model

After it is created, from Manage, copy the API key and Url and paste it to our code.

 # Creds Text to Speech
 apikey = 'KEY HERE'
 url = 'URL HERE'

Now, we will complete our final authentication from the server using the following code.

 #setup service
 authenticator = IAMAuthenticator(apikey)
 #Create our service
 tts = TextToSpeechV1(authenticator=authenticator)
 #set the IBM service url
 tts.set_service_url(url)

Demo Testing A Basic Language Model

We will first test our created model using a single line to read and create an audio file named speech for it. We will also be calling the synthesize function from IBM-Watson to make our created model speak the input text and set our output as an Mp3 audio format.

 with open('./speech.mp3', 'wb') as audio_file:
     res = tts.synthesize('Hello World!', accept='audio/mp3', voice='en-US_AllisonV3Voice').get_result()
     audio_file.write(res.content) #write the content to the audio file

You will find the audio output in the path provided when the code is successfully executed.

Reading Text from our File

We will now use our tested model to create a text-to-audio file from the text file we have. Here I have used Winston Churchill’s speech as the text input.

   #testing our model using an audio file
 with open('/content/Churchill.txt', 'r') as f:
     text = f.readlines()
 #view the contents
 Text

It will give us the following output.

 ['We shall go on to the end, we shall fight in France, we shall fight on the seas and oceans, \n',
  'we shall fight with growing confidence and growing strength in the air, we shall defend our \n',
  'Island, whatever the cost may be, we shall fight on the beaches, we shall fight on the landing \n',
  'grounds, we shall fight in the fields and in the streets, we shall fight in the hills; we \n',
  'shall never surrender, and even if, which I do not for a moment believe, this Island or a \n',
  'large part of it were subjugated and starving, then our Empire beyond the seas, armed and \n',
  'guarded by the British Fleet, would carry on the struggle, until, in God’s good time, the \n',
  'New World, with all its power and might, steps forth to the rescue and the liberation of the old.']

Replacing the space indicators present in the text with actual spaces.

 text = [line.replace('\n','') for line in text] #replacing the line indicator with spaces
 text #view the converted file

We shall go on to the end, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing confidence and growing strength in the air, we shall defend our Island, whatever the cost may be, we shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the streets, we shall fight in the hills; we shall never surrender, and even if, which I do not for a moment believe, this Island or a large part of it were subjugated and starving, then our Empire beyond the seas, armed and guarded by the British Fleet, would carry on the struggle, until, in God’s good time, the New World, with all its power and might, steps forth to the rescue and the liberation of the old.

Next up, we will concatenate the text and feed it to our module.

text = ''.join(str(line) for line in text) #concatenate and feed it to the module.

Generating the Output

Generating our output audio file created from the text, You can choose the voice according to the language you want, and the gender of voice needed. Furthermore, you can view all the details regarding the voices and languages available from here.

 with open('./winston.mp3', 'wb') as audio_file:
     res = tts.synthesize(text, accept='audio/mp3', voice='en-GB_JamesV3Voice').get_result() #selecting the audio format and voice
 audio_file.write(res.content) #writing the contents from text file to a audio file

You will find your newly created audio file named “winston.mp3” inside the path provided!

Using a Different Language Model

You can also use the following method to create a model to read a different language as well,

I have created another audio file using Spanish text and calling Spanish language agent from IBM-Watson Cloud.

#input textcasa = """Mi nueva casa está en una calle ancha que tiene muchos árboles. 
El piso de arriba de mi casa tiene tres dormitorios y un despacho para trabajar. 
El piso de abajo tiene una cocina muy grande, un comedor con una mesa y seis sillas, 
un salón con dos sofás verdes, una televisión y cortinas. 
Además, tiene una pequeña terraza con piscina donde puedo tomar el sol en verano.
Me gusta mucho mi casa porque puedo invitar a mis amigos a cenar o a ver el fútbol en mi televisión. 
Además, cerca de mi casa hay muchas tiendas para hacer la compra, como panadería, carnicería y pescadería."

#synthesize and write output into a MP3 audio 
with open('./casa.mp3', 'wb') as audio_file:
    res = tts.synthesize(casa, accept='audio/mp3', voice='es-US_SofiaV3Voice').get_result()
    audio_file.write(res.content)

EndNotes

We have now learned how to create a model to convert our text files into MP3 audio files and implemented text-to-speech by performing the following steps. You can choose bigger text files and play with spaces and punctuations to see how the audio speed & speech differs from the original. The full Colab file for the following can be accessed from here.

Happy Learning!

References

Access all our open Survey & Awards Nomination forms in one place

Victor Dey

Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.