MITB Banner

Being Tony Stark: How To Build A Voice Assistant Of Your Own?

In this article, we will be focusing on forging a basic and easy voice assistant of our own. It would be a customizable voice assistant which you surely tweak with, as per your desires and requirements.

Share

voice assistant

I have always fantasized about flying in the iron man suit and had wondered all my day how cool that would be. Also, the personal voice assistant Tony has is super cool. So today, my responsibility is to make you feel a bit more like a tech maniac like tony! 

This article is dedicated to tony! We love you 3000 <3.

In this article, we will be focusing on forging a basic and easy voice assistant of our own. It would be a customizable voice assistant which you surely tweak with, as per your desires and requirements.

We will be making one with the following advancements:

  1. Speech Recognition
  2. Google text to speech
  3. Window Automation
  4. Web Browser automation.

Let the name be Jarvis, for the time being.

Features of our Voice assistant:

  1. Should tell me the current time
  2. Should tell me the date
  3. Should be able to Greet me when I run the program
  4. Should be able to redirect me to other pages
  5. Current window operations like max and min the window
  6. Play songs on youtube
  7. A search query on google

NOTE: 

Do not change the way the words are spelt in the if-else conditions and the entire code can be found in the GitHub repo whose link can be found at the end of the article.

A slow internet connection will lead to delayed outputs.

One needs to be specific about the time at which he is speaking the words, although I have given a delay of at least 1-2 seconds everywhere so that you would be able to speak but still, it is something to keep on your tips.

Dependency

We would be needing speech_recognition for recognising whatever we have said. Text to speech for converting the text to speech. Pyautogui will handle basic windows automatically and the same for web browsers.

Active internet connection.

pip install speech_recognition
pip install gtts
pip install pyautogui
pip install webbrowser

Importing packages

Using speech_recognition for speech recognition. Webbrowser will be used for opening the browser. Gtts for google text to speech converts text to speech. Pyautogui is used for window automation. 

Datetime for date-time access. 

from speech_recognition import Microphone, Recognizer 
import webbrowser as wb
from gtts import gTTS #text to speech
import pyautogui #for window automation
import datetime
import os
import time
import subprocess

Printing the time

now = str(datetime.datetime.now())
print(now)
Output: 020-10-16 16:19:45.872990
now = now.split()
print(now)
Output: ['2020-10-16', '16:19:45.872990']
t = now[1].split(':') #for time:  '16:19:45.872990' into list
hour = int(t[0]) # hour : 16
min = int(t[1]) # minutes: 19
#print(hour)
#time is in 24-hour clock
#greeting the user
Since the hour is 16 ie 4 pm, it will be saying good afternoon.
if hour < 12:
   greetings = 'Good Morning!' #if hour less than 12pm, Good morning
   lang = 'en'
   obj = gTTS(text=greetings,lang=lang,slow=True)
   obj.save('greetings.mp3') #saves the mp3 file
   os.system('greetings.mp3') #to play the mp3 file
   time.sleep(2) #delay of 2 seconds
   pyautogui.moveTo(1910,10)
   pyautogui.click()
elif hour > 12 and hour <+16:    # hour less than 4pm and greater than 12pm, Good afternoon
   greetings = 'Good Afternoon!'
   lang = 'en'
   obj = gTTS(text=greetings, lang=lang)
   obj.save('greetings.mp3')
   os.system('greetings.mp3')
   time.sleep(1.6) #delay of 1.6 seconds
   pyautogui.moveTo(1910, 10)
   pyautogui.click()
elif hour > 16 and hour<21: #hour less than 9pm and more than 4pm, Good Evening
   greetings = 'Good Evening!'
   lang = 'en' #en stands for english language.
   obj = gTTS(text=greetings, lang=lang, slow=True)
   obj.save('greetings.mp3')
   os.system('greetings.mp3')
   time.sleep(1.6)
   pyautogui.moveTo(1910, 10)
   pyautogui.click()

Menu

A menu-driven program which shows what are the basic features of the same. Show operations would show you the available operations in the same. Start operations are to kick start the process of different operations. Once you say start operations, it will start and show different operations on the same.

Speech recognition

The speech will be recognised using recognize_google(), which is an API calling approach. One needs to have an active internet connection to access speech recognition features.

print('-----------------------------------')
print()
print('1. SHOW OPERATIONS')
print('2. START OPERATIONS')
print('3. THANKS JARVIS') #to end the process
time.sleep(2)
r = Recognizer() #object instantiation
mic = Microphone() #object instantiation
while True: #infinite until the loop breaks
   try:
       print()
       print('-----------------------------------')
       print('Speak now')
       print('-----------------------------------')
       print()
       with mic as source:
           audio = r.listen(source)
           voice_text = r.recognize_google(audio)
           try:
               if voice_text == 'start operations': #if the said words start operations, then the following code.
                   print(voice_text)
                   r1 = Recognizer()
                   mic1 = Microphone()
                   while True:
                       print()
                       print('-----------------------------------')
                       print('What operation?')
                       print()
                       print('1.Tell Date and Time')
                       print('2.Open Social Media Handles')
                       print('3.Play Songs on Youtube')
                       print('4.Maximize the current window')
                       print('5.Minimize the current window')
                       print('6.Search query on Google')
                       print('-----------------------------------')
                       print()
                       time.sleep(1) # 1 sec of delay
                       with mic1 as source_again:
                           audio1 = r.listen(source_again)
                           command = r1.recognize_google(audio1)
                           if command == 'maximize the window':
                               pyautogui.getActiveWindow().maximize()
                           elif command == 'minimise the window':
                               pyautogui.getActiveWindow().minimize()

To play songs on youtube, it will redirect you to Youtube’s page where you can listen to the song of your own choice.

 elif command == 'play songs on YouTube':
                               url = 'https://www.youtube.com/results?search_query=play+songs'
                               wb.get().open_new(url) #opens a new window on browser
                               time.sleep(1)
                               pyautogui.click(220,220)
                               break

Instead of a dog, you write any word of your choice, to keep it simple have mentioned dog here.

elif command == 'search dogs on Google':
                               command = command.split(' ')
                               query = command[1]
                               url = f'https: //{query}'
                               wb.get().open_new(url)
                           elif command == 'Open File explorer':
                               subprocess.Popen('C:/Users/91884/Desktop')

To tell the time, the time will be told spoken via text to speech from the system itself. 

                           elif command == 'What is the time?':
                               telling_the_time = f'{hour} hours and {min} minutes'
                               lang = 'en'
                               obj = gTTS(text=telling_the_time, lang=lang, slow=True)
                               obj.save('time.mp3')
                               os.system('time.mp3')
                               time.sleep(4)
                               pyautogui.moveTo(1910, 10)
                               pyautogui.click()
                           else:
                               print(command)
                               break
               elif voice_text == 'show operations':# list of operations 
                   print('1.Tell Date and Time')
                   print('2.Open Social Media Handles')
                   print('3.Play Songs on Youtube')
                   print('4.Maximize the current window')
                   print('5.Minimize the current window')
                   print('6.Search query on Google')
               elif voice_text == 'thanks Jarvis':
                   print('Have a great day!')
                   break
               else:
                   print(voice_text)
           except Exception:
               break
   except Exception:
       break
print('Hope you liked the fun project!')
print('Follow Bhavishya Pandit on Linkedin!')
print(‘Subscribe to AIM for more such articles!’)

Now It’s Time to See the Results

Output: for opening songs on youtube

A new tab on your chrome will be opened with a list of recommended songs for you.

voice assistant

Output: for searching query on google.

A new tab will open with the following results on google.

voice assistant

Now, the most interesting part of this article, the recorded video depicting how the above program works is given below.

Conclusion

The main aim of this article is to get a better understanding of the advancements in technology and how easily we can really make things possible. The additional tech touches used here are speech recognition, text to speech, web and window automation.

From this, we surely can conclude that all that was fiction, is coming to reality today!

Future isn’t near, the future is here! 

Subscribe AIM for more such articles. You can follow me on Linkedin to stay updated on the same.
The complete code can be found at the AIM’s GitHub repository. Please visit this link to find code.

Share
Picture of Bhavishya Pandit

Bhavishya Pandit

Understanding and building fathomable approaches to problem statements is what I like the most. I love talking about conversations whose main plot is machine learning, computer vision, deep learning, data analysis and visualization. Apart from them, my interest also lies in listening to business podcasts, use cases and reading self help books.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India