MITB Banner

Textual & Spreadsheet Data – Effective Data Science Series: 5 of 5

Share

When the data scientist goes after structured and machine-generated data, experience has shown that there are not many positive results. Instead, the most fertile grounds are textual data and spreadsheet data. Fig 1 depicts textual and spreadsheet data.

But – as has been discussed – there is a barrier to accessing and analyzing textual and spreadsheet data. Textual and spreadsheet data is not “well behaved”. Spreadsheet and textual data is erose, and common data base management systems do not hold or interact well with erose data.

But it is noted that just because textual data forms a basis for business value does not mean that ALL textual data is useful for finding business value. Fig 2 notes that there is some amount of textual data that is not fit to serve as a basis for finding business value.

Some textual data is informal. Some textual data is hearsay. Some textual data is casual. So textual data must be vetted as to its suitability to serve as a basis for business value.

The same is true for spreadsheet data. Fig 3 shows that some spreadsheets are not suitable to serve as a basis for finding business value.

Some spreadsheets are informal. Some spreadsheets are casual. Some spreadsheets are created at 9:00 am and are deleted at 10:00 am. There are many reasons why a spreadsheet may not be a good candidate to serve as a basis for finding business value.

Fig 4 shows that there is a continuum of spreadsheets.

In actuality probably only 10% or less of the spreadsheets the corporation has are fit to serve as a basis for finding business value.

Once the organization has vetted both textual data and spreadsheet data, the next step is to employ technology that allows the data to be transformed into a standard data base management system. There are two very different technologies that are required. For text, there is textual disambiguation, as seen by Fig 5.

And for spreadsheets there is spreadsheet disambiguation, as seen in Fig 6.

At a high level, textual disambiguation and spreadsheet disambiguation appear to be similar, because they both achieve the same function. They both convert unstructured data into a standard data base management system. But once you look inside the two technologies they are nothing alike.

Textual disambiguation deals with the vagaries of language and text, while spreadsheet deals with the idiosyncrasies of spreadsheets.

Once text and/or spreadsheets have been disambiguated, they are turned into a standard data base. And after they have been turned into a standard data base, then (and only then) the data scientist can stat to do his/her analysis.

It is disambiguation technology that breaks down the shield of opaqueness that surrounds text and spreadsheets.

Share
Picture of William Inmon

William Inmon

William H. Inmon (born 1945) is an American computer scientist, recognized by many as the father of the data warehouse. Bill Inmon wrote the first book, held the first conference (with Arnie Barnett), wrote the first column in a magazine and was the first to offer classes in data warehousing. Bill Inmon created the accepted definition of what a data warehouse is - a subject oriented, nonvolatile, integrated, time variant collection of data in support of management's decisions.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.