Today a lot of unstructured data is being generated in the form of text, images, videos and speech. This data could contain valuable information that companies can utilize to make the right decisions. In this article, we focus on one such form of unstructured data which is speech.
We present a use case, where we analyzed speech in clinical trials to automate a significant part of the operational processes, which has the potential to reduce the quality control costs by half.
What Is Speech Analytics?
Speech has several aspects to it. Some of the elements of speech like words, speech rate, tone, emotions etc. are discernible by humans.
There are other elements that humans don’t identify so easily like minor variations in pitch and speech rates.
Speech analytics is the characterization of speech based on these factors to derive actionable business insights from the data.
There are several ways in which speech can be analyzed, based on the type of application:
Full transcription involves conversion of speech into text format in applications like Siri or in transcribing meetings (for example, between a doctor and a patient), conferences, etc. Converting speech into text allows it to be searched more easily.
Speaker diarization involves separation of certain sections of speech based on the speaker. While transcribing speech with more than one speaker, like a meeting or a conference, it is important to not just convert speech to text but to identify who the speaker is.
Keyword detection entails identification of certain specific keywords in an audio. Customer care centers can detect certain keywords like “unhappy” and “disappointed” and use them to monitor agent performance.
Speaker authentication/identification (voice fingerprinting)
Speaker authentication/identification (voice fingerprinting) involves identifying unique characteristics in every speaker’s voice that allow us humans to differentiate between and identify speakers. Some fraud detection applications capture these unique features and create voice fingerprints during customer care interactions and compare against known blacklists.
Emotion detection involves identification of the emotional state of the speaker. This can help identify irate customers during customer care interactions, among other applications.
Other characteristics of conversation
These are pauses, noise, etc. Characteristics like loud noises or long pauses could be indicators of a bad customer care conversation.
Depending on the type of business problem, the analysis framework would have one or more of the above.
Problems Faced During Clinical Trials
Testing the efficacy of drugs for mental illnesses involves the doctor having detailed discussions with the patients to evaluate their mental state at various stages of the treatment.
The clinical trials evaluate both the quality of the interviews and then whether or not the drug meets its targets. Interview quality evaluation typically involves experts listening to audio recordings of the interviews and scoring it on various quality metrics. This manual review is quite expensive.
The objective here is to use speech analytics to assist the manual reviewers and significantly cut down the costs associated with review time.
Role of Speech Analytics
The first step was for us to remove any background noise so that the spoken dialog is clearly heard. We then split the files into sections of alternating speech and silence. Following this, we grouped the speech sections into clusters, each representing different speakers.
We then extracted several hundred features from the audio files starting from direct features like duration and amplitudes to more abstract features like speech rates, frequency wise energy content and MFCCs. Among other things, these features also helped capture information that was characteristic of a person, similar to how a human would identify a person by their voice.
The objective was to predict an interview quality score, a single number constructed by combining several qualitative aspects of the interview quality. We computed this score manually for a few audio files and then developed machine learning algorithms to identify inherent patterns and predict this score for all other audio files. We used various supervised machine learning techniques – logistic regression, boosted trees, random forests, support vector machines, etc. The best performing algorithm improved accuracy of identifying bad interviews by more than 50% compared to the random baseline, meaning the cost of identifying potentially bad interviews was halved. In other words, in the same amount of time, one could identify and review twice the number of bad interviews and gain rich insights which will eventually help the quality of clinical trials significantly.
Speech analytics is an area with potential applications in almost all businesses that have any form of verbal interaction from call centers to classrooms. With the increase in computing power,
and big data technologies, analyzing large volumes of unstructured speech data is becoming increasingly mainstream. When used appropriately, it can give a company significant reduction in cost as well as strong competitive advantage. Some functions like customer care have started incorporating speech analytics but there is still a long way to go before the full potential is realized.
About the Authors/Tiger Analytics
Patanjali V, the primary author, is a Lead Data Scientist at Tiger Analytics. He leads advanced analytics engagements that involve complex/unstructured data.
Anand Bharadwaj, the co-author, is a Director at Tiger Analytics. He has 18+ years of experience in the consulting industry and loves to ensure business value realization of analytics solutions
Tiger Analytics, (www.tigeranalytics.com) provides Big Data and advanced analytics solutions to help businesses make data driven business decisions. We bring deep expertise in data sciences along with understanding of business needs and state-of-the-art technologies to solve business problems