Automatic Speech Transcription And Speaker Recognition Simultaneously Using Apple AI

Last year, Apple witnessed several controversies regarding its speech recognition technology. To provide quality control in the company’s voice assistant Siri, Apple asked its contractors to regularly hear the confidential voice recordings in the name of the “Siri Grading Program”. However, to this matter, the company later apologised and published a statement where it announced the changes in the Siri grading program.

This year, the tech giant has been gearing up a number of researchers regarding speech recognition technology to upgrade its voice assistant. Recently, the researchers at Apple developed an AI model which can perform automatic speech transcription and speaker recognition simultaneously. 

Behind the Model

The researchers presented a supervised multi-task learning model. In this model, the speech transcription branch of the network is being trained to decrease a phonetic connectionist temporal classification loss. Also, the speaker recognition branch of the network is being trained to label the input sequence with the correct label.


Sign up for your weekly dose of what's up in emerging technology.

The model has been trained using several thousand hours of labelled training data for each task. The speech transcription branch of the network has been evaluated on a voice trigger detection task while the speaker recognition branch has been evaluated on a speaker verification task. 

How it Works

Although, the tasks of automatic speech transcription and speaker recognition are inter-related, yet they are treated independently in most of the scenarios. In this work, the researchers tried to solve this issue by representing a single network which can efficiently represent both phonetic and speaker-specific information.

Download our Mobile App

It has been known that the “Hey Siri” detector uses a Deep Neural Network (DNN) to convert the acoustic pattern of voice at each instant into a probability distribution over speech sounds. In the current project, the researchers trained 3 sets of models. ls. The first model includes 4 biLSTM layers of the encoder are tied for both tasks. This model is made to learn to represent both phonetic and speaker information in the final layer of an encoder. The second model includes only 3 biLSTM layers in the encoder along with separate final biLSTM layers for the voice trigger and speaker recognition branch. The third model is trained where here 2 biLSTM layers have tied weights, with 2 additional biLSTM layers for each branch. 

Dataset Used

The evaluation process of the model is done on a huge test-set which has been internally collected for models designed for smart speakers. The data has been recorded using live sessions in realistic home environments.

The training data for voice trigger detection is 5,000 hours of anonymised audio data that is manually transcribed. The researchers used a set of 3,000 different room impulse responses (RIRs) to simulate the reverberant speech which is internally collected in a wide range of houses and represents a diverse set of acoustic conditions.

The training data for the speaker recognition task comprises 4.5 million utterances sampled from intentional voice assistant invocations. The training set contains 21,000 different speakers, with a minimum of 20 examples and a median of 118 examples per speaker, resulting in a training set with over 5,700 hours of audio. The final dataset contains 13 million training examples with over 11,000 hours of labelled training data.

Wrapping Up

It is not new that the tech giant has tweaked its virtual assistant Siri’s machine learning algorithms to overcome several challenges in voice recognition space. Besides, the advancements in automatic speech transcription and speaker recognition, the tech giant has done researches in spoken language identification (LID) technologies for improving language identification for multilingual speakers. 

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

AIM Upcoming Events

Regular Passes expire on 3rd Mar

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 17th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, Virtual
Deep Learning DevCon 2023
27 May, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

What went wrong with Meta?

Many users have opted out of Facebook and other applications tracking their activities now that they must explicitly ask for permission.