Automatic Speech Transcription And Speaker Recognition Simultaneously Using Apple AI

Last year, Apple witnessed several controversies regarding its speech recognition technology. To provide quality control in the company’s voice assistant Siri, Apple asked its contractors to regularly hear the confidential voice recordings in the name of the “Siri Grading Program”. However, to this matter, the company later apologised and published a statement where it announced the changes in the Siri grading program.

This year, the tech giant has been gearing up a number of researchers regarding speech recognition technology to upgrade its voice assistant. Recently, the researchers at Apple developed an AI model which can perform automatic speech transcription and speaker recognition simultaneously. 

Behind the Model

The researchers presented a supervised multi-task learning model. In this model, the speech transcription branch of the network is being trained to decrease a phonetic connectionist temporal classification loss. Also, the speaker recognition branch of the network is being trained to label the input sequence with the correct label.

The model has been trained using several thousand hours of labelled training data for each task. The speech transcription branch of the network has been evaluated on a voice trigger detection task while the speaker recognition branch has been evaluated on a speaker verification task. 

How it Works

Although, the tasks of automatic speech transcription and speaker recognition are inter-related, yet they are treated independently in most of the scenarios. In this work, the researchers tried to solve this issue by representing a single network which can efficiently represent both phonetic and speaker-specific information.

It has been known that the “Hey Siri” detector uses a Deep Neural Network (DNN) to convert the acoustic pattern of voice at each instant into a probability distribution over speech sounds. In the current project, the researchers trained 3 sets of models. ls. The first model includes 4 biLSTM layers of the encoder are tied for both tasks. This model is made to learn to represent both phonetic and speaker information in the final layer of an encoder. The second model includes only 3 biLSTM layers in the encoder along with separate final biLSTM layers for the voice trigger and speaker recognition branch. The third model is trained where here 2 biLSTM layers have tied weights, with 2 additional biLSTM layers for each branch. 

Dataset Used

The evaluation process of the model is done on a huge test-set which has been internally collected for models designed for smart speakers. The data has been recorded using live sessions in realistic home environments.

The training data for voice trigger detection is 5,000 hours of anonymised audio data that is manually transcribed. The researchers used a set of 3,000 different room impulse responses (RIRs) to simulate the reverberant speech which is internally collected in a wide range of houses and represents a diverse set of acoustic conditions.

The training data for the speaker recognition task comprises 4.5 million utterances sampled from intentional voice assistant invocations. The training set contains 21,000 different speakers, with a minimum of 20 examples and a median of 118 examples per speaker, resulting in a training set with over 5,700 hours of audio. The final dataset contains 13 million training examples with over 11,000 hours of labelled training data.

Wrapping Up

It is not new that the tech giant has tweaked its virtual assistant Siri’s machine learning algorithms to overcome several challenges in voice recognition space. Besides, the advancements in automatic speech transcription and speaker recognition, the tech giant has done researches in spoken language identification (LID) technologies for improving language identification for multilingual speakers. 

Download our Mobile App

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week. 

How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.