Today, it is imperative for organisations to adapt to an increasingly data-driven world and build analytic agility. However, it’s easier said than done, given the varied sources of information organisations handle and complex data handling mechanisms, including data movement, data discovery, cleansing and preparing trusted data for analytics etc. The challenge is magnified two-fold when you are unsure where your data is coming from and what it means. In the Data Engineering Summit 2022, Kirthi Ganapathy, customer engineering manager at Google Cloud, shared insights, key learnings and best practices around intelligent management of metadata, security and governance in a diverse and largely distributed data environment.
What is data governance?
Data governance, at its most basic level, is the practice of enhancing an organisation’s data to make it discoverable, understood, protected and trusted. Every enterprise should think about the entire data lifecycle starting with data intake and ingestion, cataloguing persistence, retention, storage, management, sharing, archiving, backup, recovery, disposition, and data removal and deletion.
Data governance framework has four main pillars:
- Data discoverability: Data classification, data lineage, metadata and catalogue and data quality
- Data management: Lifecycle and records management, reference data, master data and SRE
- Data protection: Masking, encryption, access management, audit and compliance, residency and recoverability
- Data accountability: Ownership, policies and standards, domain-based governance and ethics
“Data governance encompasses the ways that people, processes and technology can work together to enable auditable compliance with defined and agreed upon policies across different technical solutions and different infrastructure boundaries,” Kirthi said.
“What organisations really want is to be able to derive insights from the data they have, without any restrictions, without necessarily moving it and in a way that makes sense to them,” Kirthi said.
An intelligent data fabric enables organisations to centrally manage, monitor and govern the data across data lakes, data warehouses, and datamarts with consistent controls, providing access to trusted data and powering analytics at scale. It offers unified metadata-led data management through a single pane of glass, centralised security and governance, enabling distributed ownership with global control, built-in intelligence to unify distributed data without data movement, and an open platform with support for open source tools and a robust partner ecosystem.
What is a data mesh?
Data mesh is a type of data architecture that makes data accessible, available, discoverable, secure and interoperable. It combines two principles: domain-driven decentralisation and data as a product.
In domain-driven decentralisation, data is owned by the people who understand it best. For example, the finance team owns the finance data, and the HR team owns the HR and employee data. So no single centralised entity owns the whole organisation’s data.
In the second approach, data is considered a product. A team owns data just like a team would own the set of services and their business. In other words, you treat other teams as internal customers of your data.
Now let us delve into how to build a data mesh architecture. Building a data mesh involves:
- Organising data to map to your business: Logically organising data based on how it is used instead of where it is stored.
- Uniformly manage and govern data: Setup standardised policies for access control, data quality, classification and lifecycle management.
- Access data from a variety of tools: Access distributed data from google cloud-native and open source tools with automatic metadata propagation and a unified experience.
Google Cloud Way
“We have three data domains here, sales data, CRM data or customer data and product data, each of which can be implemented as a different data lake, with its respective data pipelines, enabling the respective product teams to set up a very fine-grained permission control, including at a sub lake or ozone level on each of these data lakes independently, as defined by the organisation best practices,” said Kirthi.
She further stated that with this architecture:
- Your organisation gets the freedom to store data where you want, choose the best analytics tools and have flexibility in pricing and consumption model to meet financial governance needs.
- Built-in data intelligence leveraging the best in class AI/ML capabilities to automate data management and reduce manual toil.
- Enable standardisation and unification of metadata, security policies and data classification.