MITB Banner

Automatic Speech Transcription And Speaker Recognition Simultaneously Using Apple AI

Share

Last year, Apple witnessed several controversies regarding its speech recognition technology. To provide quality control in the company’s voice assistant Siri, Apple asked its contractors to regularly hear the confidential voice recordings in the name of the “Siri Grading Program”. However, to this matter, the company later apologised and published a statement where it announced the changes in the Siri grading program.

This year, the tech giant has been gearing up a number of researchers regarding speech recognition technology to upgrade its voice assistant. Recently, the researchers at Apple developed an AI model which can perform automatic speech transcription and speaker recognition simultaneously. 

Behind the Model

The researchers presented a supervised multi-task learning model. In this model, the speech transcription branch of the network is being trained to decrease a phonetic connectionist temporal classification loss. Also, the speaker recognition branch of the network is being trained to label the input sequence with the correct label.

The model has been trained using several thousand hours of labelled training data for each task. The speech transcription branch of the network has been evaluated on a voice trigger detection task while the speaker recognition branch has been evaluated on a speaker verification task. 

How it Works

Although, the tasks of automatic speech transcription and speaker recognition are inter-related, yet they are treated independently in most of the scenarios. In this work, the researchers tried to solve this issue by representing a single network which can efficiently represent both phonetic and speaker-specific information.

It has been known that the “Hey Siri” detector uses a Deep Neural Network (DNN) to convert the acoustic pattern of voice at each instant into a probability distribution over speech sounds. In the current project, the researchers trained 3 sets of models. ls. The first model includes 4 biLSTM layers of the encoder are tied for both tasks. This model is made to learn to represent both phonetic and speaker information in the final layer of an encoder. The second model includes only 3 biLSTM layers in the encoder along with separate final biLSTM layers for the voice trigger and speaker recognition branch. The third model is trained where here 2 biLSTM layers have tied weights, with 2 additional biLSTM layers for each branch. 

Dataset Used

The evaluation process of the model is done on a huge test-set which has been internally collected for models designed for smart speakers. The data has been recorded using live sessions in realistic home environments.

The training data for voice trigger detection is 5,000 hours of anonymised audio data that is manually transcribed. The researchers used a set of 3,000 different room impulse responses (RIRs) to simulate the reverberant speech which is internally collected in a wide range of houses and represents a diverse set of acoustic conditions.

The training data for the speaker recognition task comprises 4.5 million utterances sampled from intentional voice assistant invocations. The training set contains 21,000 different speakers, with a minimum of 20 examples and a median of 118 examples per speaker, resulting in a training set with over 5,700 hours of audio. The final dataset contains 13 million training examples with over 11,000 hours of labelled training data.

Wrapping Up

It is not new that the tech giant has tweaked its virtual assistant Siri’s machine learning algorithms to overcome several challenges in voice recognition space. Besides, the advancements in automatic speech transcription and speaker recognition, the tech giant has done researches in spoken language identification (LID) technologies for improving language identification for multilingual speakers. 

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India