Advanced text mining by extracting insights from various forums helps in addressing and analyzing customer feedback

10The client, the world’s leading computer technology company faced issues of classifying customer complaints:

  • The client was classifying customer issues and complaints manually based on customers complaints posted on the company’s multitude of Tech Forums and call center agent transcripts
  • The client wanted to automate the classification process of textual data
  • The client wanted a comprehensive dictionary based on a thorough understanding of the entire dataset so that this dictionary could be used for standardizing other mediums through which customer queries can be classified such as email, speech, and social media

The company used to captures volumes of “conversations” across different customer feedback forums as well as agent notes related to customer calls that are stored in databases. It had a large team that extracted samples of these “conversations” and codes these conversations into one or more pre-defined code frames and categorizes them accordingly.

The engagement scope outlined building a “text mining and codification engine” through the application of Natural Language Processing to automatically categorize conversations based on different buckets. This categorization should in turn help Microsoft analyse the text data objectively and quantitatively. The focus was to improve the accuracy of the categorization throughout the engagement lifecycle. The task is also to identify key metrics and formulate visualization capabilities for insights from unstructured (primarily textual) data.


The overall solution had four major milestones as defined below.

text m

Step 1: Data Access

BMI accessed textual data for the 6 levels of classification using a secured connection and downloaded it onto a secure location within BMI premises

Step 2 : Data Loading

Merged and cleaned forum and agent transcript data to create 130,000 rows

Data received had a breadth of topics and had 6 levels of classification for each row. Level 5 has the highest number of distinct nodes at 4277 followed by Level 3 at 616 nodes 

Step 3: Classification Design

Evaluated TAXIS at 3 levels

Used comprehensive Natural Language Processing techniques to ensure most categories are captured –

  • Verb replacement
  • Lemmatization
  • Normalization

Used Stemming approach to reduce inflected verb forms

Enhanced the results further by combining the earlier techniques, For e.g. Basic and NLP refined

Converted multi-class records into single class by applying basic as well as weightage technique

Step 4: Development Accuracy

Development set was fed into the classification engine (internal benchmark set at 80%)

Step 5: Validation Accuracy

Validation set was fed into the classification engine (internal benchmark set at 70%)

Step 6: Reporting

Reports were designed, as per the parameters specified in the reporting framework

Step 7: QA Design and Process

Tested using unseen or unlabeled data to ensure  whether the auto-coded results can be generalized and be used to accurately classify data points


The text mining framework was able to classify the texts automatically. There were five levels of classification ensuring that the result was accurate. This enabled the company to take quick informed decisions based on customer feedback.

  • Achieved accuracy close to 60% in the 1st iteration
  • Team built a comprehensive dictionary based on a thorough understanding of the entire dataset
  • The technology major has been able to identify the major points of concern for its customers
  • Time and cost for identification of customer issues have been reduced, as classified information (from unofficial forums and mediums is now available) is now readily available
  • Identified overlapping categories and organized under one category to represent the data
  • Identified duplicate categories and removed the additional categories
  • Overall, helped the client revisit their categorization process and established appropriate categories
  • Provided a roadmap to enable classification of issues on general blogs and forums
  • In future, this process can be executed in a similar manner to potentially cover data from other sources like email, speech and other forms of social media

Download our Mobile App

Blueocean Market Intelligence
Blueocean Market Intelligence is a global analytics and insights provider that helps corporations realize a 360-degree view of their customers through data integration and a multi-disciplinary approach that enables sound, data-driven business decisions. Since we live in a highly dynamic and multi-dimensional world, we believe the most effective business decisions come from a synthesis of data streams and not from one-dimensional sources. Using our 360 DiscoveryTM approach, we ensure the comprehensive use of all available structured and unstructured data sources, enabling us to bring the best to bear against each engagement. Strong decision support is enabled through a combination of analytics, domain expertise, engineering and visualization skills brought together in harmony. Leading companies have benefited from our partnership with financial growth, 360 views of their markets and competition, and improved customer acquisition, satisfaction and retention.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring