Managing corporate data usually necessitates using several software tools such as CRM, email marketing tools, ERPs, etc. Each of these programmes collects data differently. In addition, there are third-party data; however, to make sense of the endless data stream for data-driven decisions – it needs to be organised as a single source.
This is where data unification comes in handy. The process aims to merge all of the organisation’s data spread across various operating systems and formats and standardise it to be treated as a single source. As per the industry report, the amount of data created, captured, and consumed globally is expected to reach 180 zettabytes (ZB) in 2025 from 64 ZB in 2020. This further calls for data unification, but organisations are facing numerous challenges:
Ensuring Clean Data
Data unification is not just about organising data into a single source. It further calls for maintaining data accuracy. For data to be accurate, it must satisfy two criteria – form and content.
Consider, for example, different formats of dates can be problematic. Dates stored in the US format will be “8/10/2021”, but for a country like India, it is “10/8/2021”. Secondly, “New York City” is sometimes captured as “NY” or “NYC” – the consistency of data content needs to be maintained. Otherwise, grouping and summarising data will again be cumbersome. One can avoid this mess by using a customer data platform, which automatically updates (and adds) information to enhance accuracy. When integrating data, it also detects duplication.
Data remains in Silos
A disconnect between different departments at an organisation makes valuable information inaccessible and invisible to other departments and software systems that can benefit from the same. In a nutshell, data in silos is a sure-shot path to opportunities lost. The right foot forward is to reconnect various departments with a customer data platform that breaks down data silos and makes data available to everyone in a company.
Wrong Schema Approach
Data unification must be schema last, but organisations fail to understand this simple rule. Data is collected via multiple sources. Moreover, the number of attributes across these various sources is vast. Therefore, any attempt to establish a global schema in advance would be futile. Up-front schema building is not advisable either. The only viable method is to build a schema “bottom-up” from the local data sources. To put it another way, the global schema is created “last.”
Lack of Collaboration
Professional computer scientists responsible for building data structures and pipelines are left to understand the nuances of data as well. Consider, for example, data from “Tata SIA Airlines Limited” and “Vistara” might be confusing for a data scientist to understand that they are from the same organisation. However, a collaboration between domain experts and computer scientists can resolve such ambiguous circumstances.
Different sets of rules govern traditional tools and systems. As the data size grows, multiple rules creep in. It’s better to provide training data and train machine learning models to deal with scale problems.
American computer engineer and A.M. Turing Award winner (2014), Michael Stonebraker, describes the Seven Tenets of Scalable Data Unification:
- Ingesting data: This needs to be from different operational data systems of an organisation.
- Performing data cleaning: sometimes -99 is often a code for “null,” and some data sources might have obsolete addresses for customers.
- Performing transformations: Take, for example, Dollars to Rupees or airport code to city_name.
- Performing schema integration: For example, “salary” in one system is “wages” in another.
- Performing deduplication: I am “John Wick” in one data source and “M. R. Wick” in another.
- Performing classification or other complex analytics: Suppose one wishes to classify a firm’s ‘spend’ transactions to discover where it is spending money. It requires data unification for ‘spend’ data, followed by a complex analysis of the result thus obtained.
- Exporting unified data to the other downstream systems.
In a highly competitive global scenario, understanding the customer base is only half the battle won. Scaling them should be the right and the topmost priority. Unless the large amount of data flowing through systems is unified, predicting the future course for the business will continue to remain an uphill task.