The world today is experiencing a heavy use of speech-to-text technology. it comes as a convenience to many and has a great use across all kinds of professions, students, and just daily lives in general. We are interacting with digital assistants more than ever today.
The Tech Behind Speech Recognition Engine
Speech recognition software can analyze sounds you make by filtering what you say, digitising it to a format it can read. Not just that, today it also has the capability to understand the meaning behind it. Using programmed algorithms and trained data, it can then provide an output guessing what was said.
In order to make that voice message and convert it to text, two crucial elements involved are a microphone to pick up what you say and an internet connection. When the microphone receives the speech that it is made to listen to, it sends the speech data to a central server. Here it has the relevant database which the device gets access to. The software at the server breaks it down into tiny parts called phonemes, which are the smallest elements of a language and a representation of the sounds we make and put together to form meaningful expressions. For similar sounding words, the software identifies with respect to the context of the input and selects what suits them best and gives that as an output.
There are some speech waves that that certain words create. While making the software, these speech waves are largely correlated with the text of those words. The best match of the text to the word you spoke is found out and finalized.
1.ADC: The analog-to-digital converter (ADC) translates the analog wave which are vibrations that are created as you speak, into digital data that the computer can understand. This is done by precise measurements of the sound waves generated by the input speech and are taken and measured at different intervals.
2. Noise removal: This digitised sound from ADC is filtered by the software and unwanted ambient noise is removed. It can also additionally be used to separate into different bands of frequency. The sound is also then normalized to be adjusted to a constant volume. The sound must also match the speed of the dummy sound that is already fed to the software or the system during its training. This means that the sound used by people should match this stored sound.
3.Signal division: The incoming signal is divided into small segments of a hundredth or thousandth of a second. The small segments of the signal are then matched with known phonemes. There are roughly 40 phonemes in the English language, while other languages have a varied number of phonemes.
4. Comparing with the trained data: The software has to then examine phonemes in the context of the other phonemes around them. It runs the contextual phoneme plot through a complex statistical model and compares them to a large library of known words, phrases and sentences. A statistical model is used to run the contextual phoneme and compared to a large library that is fitted inside of known words and sentences. A built program then identifies what word exactly did the user says and presents in it in the form of text.
According to John Garofolo, Speech Group Manager at the Information Technology Laboratory of the National Institute of Standards and Technology, the two models that dominate the field today are the Hidden Markov Model and neural networks. These methods involve complex mathematical functions, but essentially, they take the information known to the system to figure out the information hidden from it.
The Hidden Markov Model Comes Into Play
The Hidden Markov Model is a statistical model that provides a simple and effective framework for modelling time-varying spectral vector sequences. Most present-day vocabulary continuous speech recognition system uses this model.
In this model, each phoneme is like a link in a chain and a complete chain forms a word. The chain branches off in different directions as the program tries to match the sound with the phoneme that is probable to appear next. During this process, each phoneme is assigned with a probability score, based on its training data.
With a difference in accents and dialects, human language keeps on changing. The earlier speech to text systems needed them to speak each word separated, with a pause, because they could not handle the variations of the words with respect to accents and pronunciations. The speech to text softwares or technology today is much more sophisticated and uses complex and robust statistical models to determine the most appropriate text output.
1.Accents and languages: If the software is only used by one user it is trained specifically for how that person talks and for obvious reasons become more difficult as it is trained to recognise more people because not everyone has the same accent and speed with which they talk. It becomes even more difficult when the number of languages are involved is more than one.
2.Noise cancellation: The noise cancelation has to also be paid special attention to. The software does not inherently come with distinguishing between a noise, it has to be specifically programmed for it. Users should work in a quiet room with a quality microphone positioned as close to their mouths as possible. Low-quality sound cards, which provide the input for the microphone to send the signal to the computer, often do not have enough shielding from the electrical signals produced by other computer components. For this, data has to be collected which consists of a set of ambient noises convey to the software that filters them out, in the program.
3.Speech pitch: People also generally naturally shift the pitch of their voice to accommodate for noisy environments, for example, talk loudly in a place with loud music, and the speech recognition systems can be sensitive to these pitch changes. This has to be taken care of while programming. The system also suffers from overlapping speech, when two or more people speak at the same time.
4.Homonyms: Homonyms are words with different meanings but similar in sounds. No matter how robust the statistical models used in your system, it is very difficult for them to recognise which of the two similar sounding words you mean. Although developments have been made to understand the context in which the word is used and come up with the correct choice of the two homonyms for that particular statement, there has been no concrete result and is still one of the hurdles that the speech to text technology has to face.
There are plenty of words and the system cannot search through all the words available to come up with the most accurate result. It is a lot of metadata. Some speech recognition softwares exclusively made relevant to the field has terms relevant to those fields, making it a lot easier to cut down on the data to learn from.
How Google Text To Speech Works
According to Google, they employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. They further minimize their memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. The vocabulary items are then injected into a decoder graph and bias language model on-the-fly, to deal with information that is device-specific.
Their system achieves a 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time. With plenty of technology emerging in the area of text to speech, it is sure going to have an even larger dominance in the upcoming years.