Last year, Apple witnessed several controversies regarding its speech recognition technology. To provide quality control in the company’s voice assistant Siri, Apple asked its contractors to regularly hear the confidential voice recordings in the name of the “Siri Grading Program”. However, to this matter, the company later apologised and published a statement where it announced the changes in the Siri grading program.
This year, the tech giant has been gearing up a number of researchers regarding speech recognition technology to upgrade its voice assistant. Recently, the researchers at Apple developed an AI model which can perform automatic speech transcription and speaker recognition simultaneously.
Behind the Model
The researchers presented a supervised multi-task learning model. In this model, the speech transcription branch of the network is being trained to decrease a phonetic connectionist temporal classification loss. Also, the speaker recognition branch of the network is being trained to label the input sequence with the correct label.
The model has been trained using several thousand hours of labelled training data for each task. The speech transcription branch of the network has been evaluated on a voice trigger detection task while the speaker recognition branch has been evaluated on a speaker verification task.
How it Works
Although, the tasks of automatic speech transcription and speaker recognition are inter-related, yet they are treated independently in most of the scenarios. In this work, the researchers tried to solve this issue by representing a single network which can efficiently represent both phonetic and speaker-specific information.
It has been known that the “Hey Siri” detector uses a Deep Neural Network (DNN) to convert the acoustic pattern of voice at each instant into a probability distribution over speech sounds. In the current project, the researchers trained 3 sets of models. ls. The first model includes 4 biLSTM layers of the encoder are tied for both tasks. This model is made to learn to represent both phonetic and speaker information in the final layer of an encoder. The second model includes only 3 biLSTM layers in the encoder along with separate final biLSTM layers for the voice trigger and speaker recognition branch. The third model is trained where here 2 biLSTM layers have tied weights, with 2 additional biLSTM layers for each branch.
The evaluation process of the model is done on a huge test-set which has been internally collected for models designed for smart speakers. The data has been recorded using live sessions in realistic home environments.
The training data for voice trigger detection is 5,000 hours of anonymised audio data that is manually transcribed. The researchers used a set of 3,000 different room impulse responses (RIRs) to simulate the reverberant speech which is internally collected in a wide range of houses and represents a diverse set of acoustic conditions.
The training data for the speaker recognition task comprises 4.5 million utterances sampled from intentional voice assistant invocations. The training set contains 21,000 different speakers, with a minimum of 20 examples and a median of 118 examples per speaker, resulting in a training set with over 5,700 hours of audio. The final dataset contains 13 million training examples with over 11,000 hours of labelled training data.
It is not new that the tech giant has tweaked its virtual assistant Siri’s machine learning algorithms to overcome several challenges in voice recognition space. Besides, the advancements in automatic speech transcription and speaker recognition, the tech giant has done researches in spoken language identification (LID) technologies for improving language identification for multilingual speakers.