MITB Banner

The five pillars of unstructured data governance

Companies must ensure that data quality is not hindered when exposed to new procedures and systems.
Share

Data availability is a common issue, especially in India; they are sparse for regional language NLP or government data datasets. But government data and organisational data are usually present in large volumes, untapped and going to waste. Moreover, unstructured data does not have defined, consistent fields. It may not even have any numbers and text, making it difficult to leverage the data for data science and machine learning projects.

Recent digital transformation initiatives have pressed on the importance of cleaning and structuring this data to derive business value from it. But to ensure accurate and secure insights from data, it is important to govern it correctly. Data governance practices include understanding the enterprising data and ensuring it is free from potential risks while generating insights. While governing unstructured data can be tricky, it is not impossible. Analytics India Magazine has outlined a few key tricks and tools to improve unstructured data governance. 

Establishing governance guidelines/ Developing a governance model

Unstructured data often contains important and delicate information the same way structured data does, making it important for organisations to treat unstructured data just as carefully. The key step in any data governance is creating governance guidelines outlining the ownership and access to data, how the data will be treated, assigning tools to stakeholders, and controlling data. This assures data privacy and compliance within the organisation.

The “Governing Unstructured Data: Microsoft-enabled Data Classification and Protection” talk at the 2021 Edgile conference in collaboration with Microsoft discussed some key governance techniques for unstructured data, suggesting developing a governance model. The ‘Government Pyramid’ they operate in has Operation-Department at the bottom, Tactical in the middle and Strategic data at the top. The operation department generates 80% of the data to understand how the data will be used, followed by studying the tactics of outputting the unstructured data into spreadsheets or files to put in other systems. This helps understand the risks of having data between systems, which later, the strategic governance committee will audit to ensure security tools are in place and there has been no breach. Lastly, leadership will decide the future of this data.

Simple labelling schema

It is important to keep the labelling schema as simple as possible to ensure the usage and adoption of the unstructured data. The data is labelled in four categories at Microsoft depending on its risk intake; public, internal, confidential, and restricted. These are further referred to for use-cases. The data is tagged for internal usage only, receipt only, and for business unit/ business unit teams within confidential and restricted.

Validate every data-source

Companies usually have tons of personal organisation data to leverage, but that is not always enough. Most businesses also acquire data from external sources for a more holistic data repository. When it comes to external data sources, it is integral to ensure the data can be trusted, creating the biggest governance challenge. The first step is to clarify the company’s values and governance standards upon which any new vendors will be examined. They also need to consult with the legal team on these policies and regional regulations that need to be met. These include factors about the data provider, where the data has been acquired from (to ensure the data is both trustworthy and legal) and how the data has been prepared. Organisations can also take extra steps of vetting the data source through their recent customers, and IT audits.

Analyse the quality of data

Once the data source has been verified, organisations need to conduct their own test of the data quality. This is because the analytics-based business solutions the company will be using are hugely based on the quality and validity of the data. If the data is wrong, the product will be wrong too. Hence, companies must ensure that data quality is not hindered when exposed to new procedures and systems. Companies can determine data quality based on the sources, accuracy, meanings, number of empty values, consistency, the quantity of dark data, and time-to-value.

Secure good data and dispose of bad data

After the data quality has been derived and organisations have good and bad data, the next governance step is to secure the good data while disposing of the non-useful information. Organisations should secure unstructured data as they would structured data. Some common techniques to securing unstructured data, including using trusted networks, perimeter monitoring, data encryption and assigning data to an owner, can help identify areas vulnerable to breaching and secure them; further, ensure traceability within user logins to track who has access and control over the data. 

Bad data should be eliminated in its entirety. In fact, experts suggest it should be deleted in its raw form during the part of the data preparation process. Physical data can be disposed of by cleaning or digital shredding. 

PS: The story was written using a keyboard.
Share
Picture of Avi Gopani

Avi Gopani

Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India