Most evolutions are silent. They take place in front of our eyes and the change is so gradual that we are not even aware of the evolution. It is only when we step aside and compress time and look at the progression that has occurred over time that it becomes obvious that we have been witnessing an evolution.
There is an evolution in textual analytics that has been occurring. That evolution is seen in the Figure -
Each of the steps of evolution have addressed and solved a clear and present requirement. But each step in the evolution has addressed only limited aspects of textual analytics.
Let’s have a look at each step in the evolution.
In the beginning were comments. Programmers found that they needed to capture comments so they defined a field in their data structure called “comments” and stuffed comments into the field.
They were lots of problems with comments. Defining the field length was always problematic. The comment was always longer or shorter than what was specified. It was – universally – an uncomfortable fit.
But the real problem with the comments field was that once data was placed in the comments field you couldn’t do anything with it. However, historically the definition of the comments field was the first step that was taken to achieving textual analytics.
Next came blobs. Blobs were a specifically defined field that held lots of undefined data. For the most part, blobs solved the problem of the irregular length of the comments field. But blobs did not address the problem with the usability of data within the blob. You could have all sorts of data in your blob but you really couldn’t do anything with it.
However it must be recognized that blobs were a step in the evolution to textual analytics.
Sound-ex and stemming
Next came sound-ex and stemming. Sound-ex and stemming were the first steps taken to go into the blob and start to try to make sense of the text found in the blob. Sound-ex attempted to group words according to a common sound that was made when the word was pronounced. Related to sound-ex was stemming. Stemming was the practice of reducing words to their common Greek or Latin stem. In many cases reducing words to a common foundation helped in understanding syntax.
Sound-ex and stemming were unquestionably pioneering steps in the search to start to make sense of text. Sound-ex and stemming were the next steps in the evolution of textual analytics.
However, as important as sound-ex and stemming were, there is far greater complexity to text and language than these aspects.
The next step in the evolution of textual analytics was that of word tagging. In word tagging a document was read and words that were of importance were tagged. Words could be tagged for many purposes. There were keywords. There were action words. There were red flag words. In truth there are many reasons why a given word could be tagged.
As such tagging words was a real step forward in the progression to textual analytics.
But tagging words had its own set of drawbacks. The primary drawback was that in order to do tagging you had to know what words needed to be tagged before you read the document. In some cases this was not a large drawback. But in other cases this was a real drawback. Many words that needed to be tagged in fact were not, making the process of tagging less than wholly satisfactory.
However, in the evolution of textual analytics tagging represented a very large and positive step forward.
NLP and Taxonomies
Next in the evolution came NLP (natural language processing). NLP used taxonomies to gather sentiment. Taxonomies were much more powerful and much more general than tagging (although it can be argued that the usage of taxonomies and ontologies are an advanced form of tagging.) With NLP it was possible to gather enough information to start to do real textual analytics. In many ways textual analytics really began with taxonomies and NLP.
However as powerful and as attractive as NLP was (and is) there were still some missing ingredients.
In any case NLP and taxonomies represent a tremendous and positive evolutionary step.
The most evolved form of textual analytics is that of textual disambiguation. Textual disambiguation builds on the advances made by the advent of taxonomies and NLP. Textual disambiguation certainly provides all the capabilities of sentiment analysis. But textual disambiguation provides two important advances over NLP. Those advances are – the focus on context and the ability to recognize word and sentence structures.
To understand the value of context, consider this simple example.
“I don’t like scrambled eggs.” Sentiment analysis tells us that the author does not like scrambled eggs. And that is a useful piece of information.
“I don’t like scrambled eggs because they are too expensive.” Textual disambiguation tells us that the author doesn’t like scrambled eggs. But textual disambiguation tells us something more. Textual disambiguation tells us why the author doesn’t like scrambled eggs. The scrambled eggs are too expensive.
There could be many reasons why the author doesn’t like scrambled eggs. The eggs may be runny. The eggs may be cold. The author may be a vegetarian. The eggs may have too much cholesterol. There could be MANY reasons why the author does not like scrambled eggs. By textual disambiguation goes one step further than NLP and textual disambiguation gives us a different level of information than NLP.
In addition textual disambiguation also has the capability of discerning context for word and sentence structure.
There are a couple of interesting observations to be made about the evolution that has occurred. Programmers began to create comments fields as early as 1965. Textual disambiguation is alive and well in 2016. The evolution has taken a paltry 50 years or so. As far as evolutions are concerned, 50 years is a very short amount of time. The textual analytics evolution has occurred very, very quickly, insofar as evolutions are concerned.
Another observation is that the evolution is continuing, like all evolutions. Even though textual disambiguation is in an advanced state, by no means is the evolution complete.
Register for our upcoming events:
- Meetup: NVIDIA RAPIDS GPU-Accelerated Data Analytics & Machine Learning Workshop, 18th Oct, Bangalore
- Join the Grand Finale of Intel Python HackFury2: 21st Oct, Bangalore
- Machine Learning Developers Summit 2020: 22-23rd Jan, Bangalore | 30-31st Jan, Hyderabad
Enjoyed this story? Join our Telegram group. And be part of an engaging community.
Provide your comments below
What's Your Reaction?
William H. Inmon (born 1945) is an American computer scientist, recognized by many as the father of the data warehouse. Bill Inmon wrote the first book, held the first conference (with Arnie Barnett), wrote the first column in a magazine and was the first to offer classes in data warehousing. Bill Inmon created the accepted definition of what a data warehouse is - a subject oriented, nonvolatile, integrated, time variant collection of data in support of management's decisions.