For this week’s ML practitioners series, Analytics India Magazine(AIM) got in touch with Nikhil Dhawan, Director of Engineering, MLOps at Dentsu International. He has previously led data engineering teams at KPMG. He has a Bachelor of Computer Science from Guru Nanak Dev University, Amritsar and a Master of Information Technology from IIIT-Bangalore.
In this interview, Nikhil shared his experiences from working with data for nearly a decade.
AIM: Let’s settle this forever. What is the difference between a Data Scientist, Data Engineer and an ML Engineer?
Nikhil: There can be an entire article written on this topic and still might not fit all the cases. I believe it varies between each company. Data engineers are responsible for setting up the environment for movement and transformation of data and can include storage of that data.
Data Scientists are people with knowledge about scientific and statistical methods to get insights from that data, including predicting the future behaviour using the current trends. A few years back, software engineers were learning about operations and handling infrastructure for deployment. On the other side, Ops teams were learning development with infra as a code. These two streams led to a DevOps role.
MLOps lies between Data Engineer and Data Scientist similarly. Data Engineers are learning about infrastructure to support model lifecycles and building continuous training pipelines. Data scientists want to learn to deploy their models and use them to score the incoming data.
ML Engineers create a production-grade data pipeline using infrastructure that converts the raw data to input required for a data science model, hosts and executes the model, and gives output as a scored dataset to downstream systems. An ML Engineer can come from both a data engineer and data scientist background.
AIM: What fascinates you about data?
Nikhil: For me, it’s less about the data and more about the business use case. Data is the new oil, just as dirty too, and needs to be cleaned first depending on the fuel you want to extract from it. The fuel is decided by what business outcomes you want to get from it.
We are over the initial big data era where the focus was to build huge infrastructure and collect as much data as you can. Many businesses, especially in Australia, have realised that the overhead to maintain all the infrastructure, the pipelines to Hadoop, and highly qualified resources for its upkeep far outweighs the benefits unless you have the right use cases planned from the beginning. Fortunately, cloud vendors stepped in at the right time. They brought huge infrastructure with time-based billing, leading to small upfront investment and scaling decisions made easier at later stages. This, combined with high availability, security, compliance with regulations and local laws, led most clients, including financial institutions, to move completely to the cloud.
AIM: What books and other resources have you used in your data engineering journey?
Nikhil: Most of my knowledge has come from being hands-on and working on the tools/services, but certifications have provided me with a way to prove it to myself and the world. I had an experience of more than a year on Hadoop before I got certification as a Hadoop developer from Cloudera. I learnt the inner workings of the Hadoop ecosystem while studying. I regard certification equal to getting theoretical knowledge of a subject and hands-on experience as practical methods of applying that in real life; both are necessary for a comprehensive understanding. So far, I have certifications from Azure and GCP, and I’m on track to get AWS certification.
AIM: What does your data engineering approach look like?
Nikhil: My approach to data engineering is to automate the tasks as much as possible. The more we can make data engineering work with the user’s existing ways, the easier it is for clients and businesses to continue using it. Data engineering solutions should work for business users rather than the other way around. If the source data is generated and shared by email, add the trigger to the emails rather than force users to upload it to a portal. Suppose the data is generated regularly. In that case, the solution should handle duplication, archival and retention etc., rather than relying on users to follow some new process to make sure their data goes to update the correct dashboard.
AIM: Few words for those who want to get into data engineering roles
Nikhil: Traditional data engineering is very quickly moving to the cloud. Most transformation projects are also moving on-premise data to cloud services. I would recommend starting with any of the top three cloud platforms. All the cloud vendors already have managed big data products as part of their offerings, e.g. Azure HDInsights, AWS EMR etc.
Other than cloud, few concepts beginners should know about are
- Python, the most versatile language in the DE/DS world.
- Linux/bash scripting will always come in handy.
- Git for version control, a must-have.
- CI/CD tools. E.g. Gitlab CI/CD or Jenkins
- Docker/Kubernetes for containerization
- IaaC. E.g. terraform to automate cloud builds etc.
AIM: What does your machine learning toolstack look like?
Nikhil: Python is the de facto go-to language for its ability with data wrangling and data science, along with frameworks like flask/fastapi that allow you to quickly build a PoC with API to serve these capabilities. In the cloud, AWS is considered the most mature solution and has vast offerings, including AWS Sagemaker that we are exploring next. Many of our clients are already tied into the Azure ecosystem by having Office365 or dynamics CRM. Microsoft leverages it, and services like Blob storage, DataFactory, Azure functions, Databricks, AKS and Synapse Analytics are what we use to address the most common business use cases in our domain. Only one of our solutions is on GCP, where we use a serverless service app engine to host the web app.
AIM: As A Director Of Engineering, MLOps, what does a typical day look like?
Nikhil: My new role is Director of Engineering & MLOps, at Dentsu International, a media, marketing and customer experience management company. This role covers ownership of the data moving from client’s source system into our data science capability and providing output to client’s decisions systems. The raw data that flows in is used to build and train the models that predict future behavior. The models or the scored outcomes are shared with the clients to help make business decisions. A typical day consists of getting business requirements from our partners and clients across Dentsu’s various agencies. I usually have meetings with the client’s IT team to understand the source systems and data source. I spent time understanding the data models and setting up the infrastructure required to build automated pipelines to power the data science engine.
AIM: MLOps is on the rise. As an industry insider, what is the ground reality? Is the hype real?
Nikhil: I believe the hype is real. There is an increasing demand for people who have experience in model lifecycle management and model deployments etc. This is slightly different from the need for data scientists or data engineers; both are still required for full analytics capability in a team.
At dentsu, we prefer anyone with MLOps or AI product exposure; data science knowledge is not a must. The best candidates come from software engineering backgrounds who have done masters or other relevant courses in data science. But they are either too hard to find or costly to hire and retain. Since we are using cloud-native pipelines and services (e.g. Azure MLOps), we tend to lean towards general cloud experience as a minimum and later train them on specific services.
AIM: Why should companies invest more in MLOps?
Nikhil: Large tech firms have used data science and its various techniques to learn about consumer behaviour for a long time. They have optimised their recommendation engines, have bundled products together, improved targeting for the right customers, increased the basket size and so on. They had the budget to dedicate resources for research, partnership with academic institutes that focused highly on statistical knowledge and theory. They also had a significant engineering function to build infrastructure and tooling required to build on research outcomes.
Smaller or business-focused firms don’t have this luxury. There is a big task list on any data science project that ranges from data acquisition, data ingestion, determining or starting with initial algorithms, testing multiple variants including tuning the model and hyperparameters, preparation of the datasets for each experiment, validating and comparing the outputs etc. Finally, once we get the best possible trained model, the engineering task is to deploy the model to score or predict on live data to improve business functions. These are very labour-intensive tasks, prone to manual errors. MLOps helps automate most of these functions to free up the developer resources. Using best practices in MLOps, a company can save money and keep costlier data science resources focused on prediction and other such chores.
Also Read: Andrew Ng’s Take On MLOps
AIM: Is MLOps, the beginning of the end for data scientist-as-a-career hype?
Nikhil: It is very difficult to answer this question as a yes or no, but I think the roles and responsibilities are changing. There are not enough data scientists. Unfortunately, most of the data science resources spend too much time in data preparation work. There is also a big disconnect between what a junior data scientist wants to do vs what a company expects them to do. Many data engineering tasks like exposing the models as an API or feeding the outputs to existing IT systems have started to creep into their task lists. Given that many data scientists do not have much experience in general programming, it becomes difficult to manage.
“There is also a big disconnect between what a junior data scientist wants to do vs what a company expects them to do.”
MLOps is now splitting from the data scientist role, where it focuses more on productionising part of the data science to support the business decisions. We will still need data scientists to describe the outcome of each algorithm and the reasoning behind it. That is true even when AutoML is used to run the experiments by trying on most of the possible algorithms and variants on the given dataset. We will also need data scientists to tackle an emerging field called “Explainable AI”. Both business and governing bodies can no longer rely on AI (being a black box) to decide an outcome that might directly impact a person. They want to know exactly why a decision was made and what data contributed to it. As data science has gained focus, it has raised questions about choosing the correct dataset, something that AutoML can’t do. Too many examples are out there that show clear bias for certain races or gender due to less diverse data being used to train the models, and its responsibility lies on people rather than the machines.
AutoML is helping in executing the experiments at scale, and MLOps helps you build data pipelines around the models. While this might mean we won’t need as many data scientists as before, the science part of data science will still be left with experts in the field.