Deep Learning DevCon 2020 or DLDC 2020 is another conference of the year that is hosted in partnership with Analytics India Magazine. Scheduled for 29th and 30th October, the conference has brought the leading experts and best minds of deep learning and machine learning industry from around the globe.
The first session of Day 1 was presented by Abhinav Tushar, who is the head of AI at the Bengaluru-based conversational AI startup Vernacular.ai. The primary aspect of the session named “Understanding Speech: Moving beyond Automatic Speech Recognitions” is — although text-based conversational interactions have been around in the industry for a while now, speech interactions are still in infancy.
Tushar kickstarted the talk by explaining the importance of speech and the emotions hidden behind it. He said that speech is far different than text, and it is much more complex. He said, “Speech is much more than transcriptions. And that should influence how we design conversational agents.”
Tushar mentioned that some of the factors that are impacting the responses include content, environment, speaker characteristics and paralinguistics. He also gave an instance of various “okays” spoken by various people that depicted different emotions in each different time.
He then discussed the workings of the present conversational AI-based voice bot that is built at Vernacular.ai and how it is chasing up in terms of mirroring human behaviour.
The working of the framework follows the mentioned steps:
- When a user speaks, the speech goes into the speech recognition block, where it extracts the speech
- It then moves forward into the Automatic Speech Recognition system, that includes an acoustic model, pronunciation model and language model.
- After that, it moves forward for frame understanding like intent classification, pre-processing and entity parsing.
- Next step is a content management and dialogue management process.
- Then the final step proceeds, where the text is transformed into speech and sent to the user.
Further, Tushar discussed the various stages of extra-lexical conversational behaviour that includes snapshot-based, flow-based and persuasive.
- Snapshot-based understands the behavioural snapshots and performs simple actions. This feature includes bail out on certain cues, detect personal characteristics and switch prompt, etc.
- In the Flow-based stage, the system works across multiple turns and can perform basic repairs. The feature includes tracking the consistent expression of discomfort, change in-flow experience based on the situation, etc.
- In the Persuasive stage, the system persuades the other party and drives the conversation. The features include understanding and utilising uncertainties and preferences, model situations and manoeuvre, etc.
Furthermore, in order to build such intelligent systems, Tushar concluded that one must include components like:
- Stylistic and semantic models
- State tracking
- Live experimentation framework