The key focus for any data-driven business is to ensure that the underlying data can be trusted. Additionally, with an increasing number of ML and AI-driven applications, Ops has become a critical component in stabilising pipelines. “Building and maintaining trust in the modern data stack is a challenging yet interesting problem to solve,” said Varun Saraogi, principal data engineer at TheMathCompany, at The Data Engineering Summit 2022. In the session titled, Building Trust in the data & All Ops, he emphasised a change in the mindset for building a data-centric thought process.
Data trust is the confidence that data is healthy and ready to act on. However, this confidence cannot be taken on faith and has to be quantified.
Varun listed a few criteria to build data trust:
- Data quality: Accuracy, completeness, consistency etc
- Data pipelines: Timeliness, alerting, resolution
- Data cataloguing and lineage: Discoverability
- Data privacy and security: PHI/ PII/Others, encryption, etc
- Automation and reusability: All Ops
The age of black box is gone
Varun said businesses are skeptical of data at some level. Technological advancements, need for faster decisions and the increasing number of stakeholders in the data pipeline complicate matters further. The data can get distorted at various stages:
- From ingestion to consumption,
- Multiple layers in transforming the data
- Multiple teams managing this data
Varun outlined strategies to build trust in data:
Get everyone involved in the data lifecycle: “We have to
decentralise the data process and ensure that everyone who touches data feels equally involved and responsible. There is a need for collaboration where everyone in the life cycle is involved,” he said.
Built systems and culture around data quality: A unified approach to data requires transparent data management processes and documented and communal data quality standards.
Shared data quality rules across the organisation: This include automated checks embedded in data systems and building policies that set clear expectations for how people interact and maintain data.
“DevOps has solved many challenges in software engineering and now in data platform systems as well. DevOps principles widely adopted in the organisation have given us a clear view on the significance of looking deep down. It has given us a better understanding of the implementation and management of such systems,” said Varun.
DataOps is the process of automating the end to end data flow and enables teams to work independently. It reduces error rates and increase quality while offering clear measurement, monitoring and transparency of results. Data observability leads to reliable data pipelines and brings transparency in monitoring, alerting, tracking and triaging incidents.
DataOps helps reduce turnaround time of projects, increase automated tests, improve data quality and visibility into data pipelines- all contributing to building data trust.
“To build data trust, we need to help business and tech team understand what data looks like. We also need to ensure that data is discoverable and available for the end customer analytics reach. Apart from DataOps, MLops also plays a critical role in building the trust in the system,” Varun said. He concluded the session with guiding principles to build data trust:
- Ops first approach
- Reusability in data pipelines
- Testing the data pipelines
- Data catalogue and data lineage
- Collaborative data quality management
- Data privacy and security
- Alerting and monitoring
REGISTER HERE TO ACCESS THE CONTENT