Advertisement

Active Hackathon

Who Are Machine Learning Engineers?

Machine Learning Engineer

The Harvard Business Review article ‘Data Scientist: The Sexiest Job of the 21st Century‘ created a ripple across the industry. Naturally, everyone began upskilling for the new hot job role. Furthermore, organisations went on to hire data scientists to keep up to the race. However, when data scientists came on board, people expected them to have a magic solution to every problem. They were expected to be business analysts, software engineers, mathematicians, statisticians etc. packaged in one human being. Hence, the unicorn breed was expected to know multiple skills like business analysis, SQL, DevOps, programming etc.

Surely, people could be the jack of all trades. However, that trait was good enough for proof of concepts or pilot projects and not for productionising a data-driven system. To elaborate, data science involves a lot of statistical analysis, mathematical modelling and intuition etc. Hence, people with a background in science and quantitative background dominated these roles and rightly so, since they have tons of experience in analysis and modelling and along with some experience in programming. However, when it comes to robust, real-time systems, they lacked the necessary experience and erudition. This was especially true with an increase in scale and complexity of data (big data). Hence, to augment the Data Scientists with the necessary skills, the eponym ‘Data’ Engineer emerged.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

The Emergence Of Data Engineering

Data Engineers typically deal with the task of creating and maintaining data pipelines which ingest and process data for the consumption of data scientists. This role emerged from the traditional ETL developers (sometimes database developers). However, with changing paradigms, the tools and technologies grew leaps and bounds. Data started pouring in volumes and variety at high velocities leading to the emergence of the Lambda Architecture. ETL is fast evolving into ELT with technologies like Hadoop and Spark.

Thus, the data science team now consisted of analysts, data scientists and data engineers. The data engineers offloaded the data scientists of the data collection, processing and cleansing part of the data science life cycle. This enabled the latter to focus on business understanding, model development etc. However, model deployment/converting to data product in the real world remained a challenge for data science teams. Here, a breed of professionals called ML engineers emerged.

By Farcaster at English Wikipedia, CC BY-SA 3.0,

The Need For ML Engineers

The democratisation of AI with tools like Azure Machine Learning greatly simplified the data science life cycle. Here is an example of a prototype of IoT and ML in action together. In these examples, you can see the first cut prototypes of an ML system in action. Any data scientist and data engineer can build such systems. Moreover, these systems are static, i.e. these articles do not elucidate on model retraining. 

Hence, a natural question would be “Why to retrain models?” 

The answer is the concept of ‘drift.’ To understand the concept of drift, we need to see why ML systems are fundamentally different from traditional software systems.

In a traditional software system, we have an input and a logic written to compute an output. However, in ML systems, we have output and inputs and the system figures out a pattern/relation between them. For instance, let’s say the system is an equation of the straight line.

y=mx+c

In traditional systems, we have m, x, and c to compute y. However, in ML systems we have y and x while we figure out m and c to extrapolate the values of y in future. This forms the basis of inductive reasoning.

Intuitively, ML systems are dependent on the underlying distribution of data. Naturally, a small change in the distribution of input data will throw the system off track, since the relation between the input and output variables change. This is called the concept of drift in Machine Learning.

The Emergence Of ML Engineers

This problem of drift one of the areas that ML engineers deal with by establishing DevOps practices (can be called as DataOps) to ML systems. However, DataOps is fundamentally different from DevOps.

In traditional software systems, DevOps take care of code versioning, maintenance and deployment in production systems. As far as versioning and maintenance are concerned, all they need to do is maintain code and monitor system health and security. However, in the ML systems, there is an additional burden of data versioning and model versioning to track the training history of the models. Moreover, from a security standpoint, any smart user can fool the ML model by figuring out a pattern in which the system responds.

Toolset & Skillset

Now, since the skillset is different, it is but natural that the toolset will vary. As far as deployment is concerned, we have API like Flask in python. Furthermore, there are frameworks like MLFlow from Databricks which can take care of model governance and deployment simultaneously.

However, ML engineering is more about mindset than skillset or toolset (of course they are essential). It’s a mindset to take on the uncertainty of the real world. It is not about maintaining traditionally large systems, but a data infrastructure and model infrastructure together. Hence, this role is a combination of Data Engineer, Data Scientist and a Software Engineer.

This article is a part of the AIM Writers Programme. If you wish to write for us, email us at info@analyticsindiamag.com

More Great AIM Stories

Prasad Kulkarni
Prasad Kulkarni is a part of the AIM Writers Programme. He is a Senior Software Engineer in Data Analytics. His interests include Big Data, Predictive Analytics and ML Engineering. He has significant experience in Azure Data Stack. Moreover, he is a passionate writer.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

Data Science Skills Survey 2022 – By AIM and Great Learning

Data science and its applications are becoming more common in a rapidly digitising world. This report presents a comprehensive view to all the stakeholders — students, professionals, recruiters, and others — about the different key data science tools or skillsets required to start or advance a career in the data science industry.

How to Kill Google Play Monopoly

The only way to break Google’s monopoly is to have localised app stores with an interface as robust as Google’s – and this isn’t an easy ask. What are the options?

[class^="wpforms-"]
[class^="wpforms-"]