Understanding Speech: Moving Beyond ASRs

Deep Learning DevCon 2020 or DLDC 2020 is another conference of the year that is hosted in partnership with Analytics India Magazine. Scheduled for 29th and 30th October, the conference has brought the leading experts and best minds of deep learning and machine learning industry from around the globe.

The first session of Day 1 was presented by Abhinav Tushar, who is the head of AI at the Bengaluru-based conversational AI startup Vernacular.ai. The primary aspect of the session named “Understanding Speech: Moving beyond Automatic Speech Recognitions” is — although text-based conversational interactions have been around in the industry for a while now, speech interactions are still in infancy.

Tushar kickstarted the talk by explaining the importance of speech and the emotions hidden behind it. He said that speech is far different than text, and it is much more complex. He said, “Speech is much more than transcriptions. And that should influence how we design conversational agents.” 


Sign up for your weekly dose of what's up in emerging technology.

Tushar mentioned that some of the factors that are impacting the responses include content, environment, speaker characteristics and paralinguistics. He also gave an instance of various “okays” spoken by various people that depicted different emotions in each different time.

He then discussed the workings of the present conversational AI-based voice bot that is built at Vernacular.ai and how it is chasing up in terms of mirroring human behaviour. 

Download our Mobile App

The working of the framework follows the mentioned steps:

  • When a user speaks, the speech goes into the speech recognition block, where it extracts the speech 
  • After that, it moves forward for frame understanding like intent classification, pre-processing and entity parsing.
  • Next step is a content management and dialogue management process.
  • Then the final step proceeds, where the text is transformed into speech and sent to the user. 

Further, Tushar discussed the various stages of extra-lexical conversational behaviour that includes snapshot-based, flow-based and persuasive. 

  • Snapshot-based understands the behavioural snapshots and performs simple actions. This feature includes bail out on certain cues, detect personal characteristics and switch prompt, etc.
  • In the Flow-based stage, the system works across multiple turns and can perform basic repairs. The feature includes tracking the consistent expression of discomfort, change in-flow experience based on the situation, etc.
  • In the Persuasive stage, the system persuades the other party and drives the conversation. The features include understanding and utilising uncertainties and preferences, model situations and manoeuvre, etc.

Furthermore, in order to build such intelligent systems, Tushar concluded that one must include components like:

  1. Stylistic and semantic models
  2. State tracking
  3. Live experimentation framework

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Is AI sexist?

Genderify, launched in 2020, determines the gender of a user by analysing their name, username and email address using AI.