The world is data-driven. And according to BARC’s BI Trend Monitor 2020, data-driven culture is the third most important trend today. The amount of data is growing in all areas of our lives, with people and companies continually generating volumes of data at an increasing speed, variety and complexity.
Robust data technology stacks are needed to deal with the various data-related functions like spam filters, online shopping recommendations, autocomplete for emails, biometrics categorisation for sleep tracking, or route optimisation for daily drives.
While most companies gain insight from their data, only those who can adequately handle data and leverage it for their purpose have a competitive advantage. Data stack is becoming increasingly complicated with variations in the velocity and volume of data. Data-driven culture includes following numbers and advancing data interpretation skills, critical thinking, and creating reliable data to base decisions on.
“The importance of data quality and master data management is very clear: people can only make the right data-driven decisions if the data they use is correct. Without sufficient data quality, data is practically useless and sometimes even dangerous,” states BARC. “Good AI/ML implementation is reliant on good underlying data,” according to Gradient Flow.
Various reviews of COVID models proved that the models were essentially useless because of bad data with issues related to lack of standardisations, duplication, and mislabelling data. The cost of bad data is estimated to be $15 million annually for each organisation.
Top tier venture capitalists are funding data quality startups, like Databricks and Scale, that deal with bad data and make data quality features into their product suite.
Achieving Good Data Quality
High-quality data meets the users’ specific needs. Mastering data management initiatives requires organisations to take a holistic approach by addressing data quality people, processes, and technology.
The organisations should have clear responsibilities for data-based domains and data-based roles such as customer data, financial figures, data owner, and operational data quality assurance, respectively. They should also adopt specific processes for data quality assurance through a data quality cycle. Lastly, the technology infrastructure should support people in their operations through software features and architecture.
Determining Data Quality
One of the prerequisites for good data is determining data quality in the context of specific domains. The first step is taking inventory of the data assets and choosing a pilot sample data set to assess in the next step. Next, the data set can be evaluated on its validity, accuracy, completeness, and consistency, and how redundant, duplicated, and mismatched the data is. Lastly, establishing a baseline on the small data set that can be scaled further. Rule-based data management is an approach that allows organisations to define rules for specific requirements, establish data quality targets, and compare them with the current levels.
Data Quality Management
Data quality roles consist of a human eye overlooking the data, conducting tests, and writing rules to ensure good data quality. An examination of US job postings by Gradient Flow revealed that the responsibility for maintaining data quality is divided among various roles such as analytics managers, data scientists, or software architects. In addition, OpenAI has job postings for full-time data engineers itself.
Improving Data Quality
An essential step in ensuring data quality is data profiling, or understanding the data with the help of tools that can summarise critical metadata about datasets.
Tools like data cleansing and repair help find the root causes of errors, like deduplication and automatically repair them. Data professionals can manually repair those that the machine can not.
According to Collibra, these are the five essential steps to improve data quality.
Metadata management is essential to leverage cross-organisational agreement on defining various informational assets for converting data into an enterprise asset. Data governance is a package of processes to standardise the management of data assets within an organisation.
Data catalogue makes it easy for users to discover and understand data and choose good data. Data matching identifies possible duplicates or overlaps to break down data silos and drive consistency. Lastly, data intelligence is the ability to understand data and use it correctly.