Last updated February 21, 2022
In AI Origins & Evolution

The five pillars of unstructured data governance

Companies must ensure that data quality is not hindered when exposed to new procedures and systems.

Published on February 21, 2022

by Avi Gopani

Data availability is a common issue, especially in India; they are sparse for regional language NLP or government data datasets. But government data and organisational data are usually present in large volumes, untapped and going to waste. Moreover, unstructured data does not have defined, consistent fields. It may not even have any numbers and text, making it difficult to leverage the data for data science and machine learning projects.

Recent digital transformation initiatives have pressed on the importance of cleaning and structuring this data to derive business value from it. But to ensure accurate and secure insights from data, it is important to govern it correctly. Data governance practices include understanding the enterprising data and ensuring it is free from potential risks while generating insights. While governing unstructured data can be tricky, it is not impossible. Analytics India Magazine has outlined a few key tricks and tools to improve unstructured data governance.

Establishing governance guidelines/ Developing a governance model

Unstructured data often contains important and delicate information the same way structured data does, making it important for organisations to treat unstructured data just as carefully. The key step in any data governance is creating governance guidelines outlining the ownership and access to data, how the data will be treated, assigning tools to stakeholders, and controlling data. This assures data privacy and compliance within the organisation.

The “Governing Unstructured Data: Microsoft-enabled Data Classification and Protection” talk at the 2021 Edgile conference in collaboration with Microsoft discussed some key governance techniques for unstructured data, suggesting developing a governance model. The ‘Government Pyramid’ they operate in has Operation-Department at the bottom, Tactical in the middle and Strategic data at the top. The operation department generates 80% of the data to understand how the data will be used, followed by studying the tactics of outputting the unstructured data into spreadsheets or files to put in other systems. This helps understand the risks of having data between systems, which later, the strategic governance committee will audit to ensure security tools are in place and there has been no breach. Lastly, leadership will decide the future of this data.

Simple labelling schema

It is important to keep the labelling schema as simple as possible to ensure the usage and adoption of the unstructured data. The data is labelled in four categories at Microsoft depending on its risk intake; public, internal, confidential, and restricted. These are further referred to for use-cases. The data is tagged for internal usage only, receipt only, and for business unit/ business unit teams within confidential and restricted.

Validate every data-source

Companies usually have tons of personal organisation data to leverage, but that is not always enough. Most businesses also acquire data from external sources for a more holistic data repository. When it comes to external data sources, it is integral to ensure the data can be trusted, creating the biggest governance challenge. The first step is to clarify the company’s values and governance standards upon which any new vendors will be examined. They also need to consult with the legal team on these policies and regional regulations that need to be met. These include factors about the data provider, where the data has been acquired from (to ensure the data is both trustworthy and legal) and how the data has been prepared. Organisations can also take extra steps of vetting the data source through their recent customers, and IT audits.

Analyse the quality of data

Once the data source has been verified, organisations need to conduct their own test of the data quality. This is because the analytics-based business solutions the company will be using are hugely based on the quality and validity of the data. If the data is wrong, the product will be wrong too. Hence, companies must ensure that data quality is not hindered when exposed to new procedures and systems. Companies can determine data quality based on the sources, accuracy, meanings, number of empty values, consistency, the quantity of dark data, and time-to-value.

Secure good data and dispose of bad data

After the data quality has been derived and organisations have good and bad data, the next governance step is to secure the good data while disposing of the non-useful information. Organisations should secure unstructured data as they would structured data. Some common techniques to securing unstructured data, including using trusted networks, perimeter monitoring, data encryption and assigning data to an owner, can help identify areas vulnerable to breaching and secure them; further, ensure traceability within user logins to track who has access and control over the data.

Bad data should be eliminated in its entirety. In fact, experts suggest it should be deleted in its raw form during the part of the data preparation process. Physical data can be disposed of by cleaning or digital shredding.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Avi Gopani

Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

The Impact of Lok Sabha Election on India’s AI Progress

Vidyashree Srinivas

The BJP aims to safeguard citizen safety and privacy, leaning towards regulation, while the Congress views AI advancements as an opportunity to create jobs.