The Unattractive Part Of Managing A Data Science Life Cycle

Share

Published on January 20, 2020

by Ram Sagar

The internet companies in the early 2000s had a tough time setting up servers and delivering servers to customers. The startup founders had to rack up servers and deal with all the hurdles from skills to costs. In 2006, when Amazon decided to eliminate all these server woes and offered AWS, people were sceptical. Today, there are hardly any data-driven companies that don’t have services of AWS, Azure or Google Cloud.

The advancement of machinery accompanied by the rise in competition to churn more revenue from targeted customer base has christened the phenomena that is data science.

A decade later today, every organisation aims to have a data science department of its own. However, the efficacy of these data-driven companies relies on how well they manage their data science life cycle.

A typical data science lifecycle outlines the major stages that projects typically execute, often iteratively:

Business Understanding
Data acquisition and understanding
Modelling
Deployment
Customer acceptance.

The tools and services offered to extract and load data are plenty. However, the most unattractive yet crucial part of any data science life cycle is documenting the progress, milestones in a structured way to make it feasible for teams across the organisation to draw insights.

How The Top Players Are Doing It

Microsoft’s Team of Data Science Process is an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently.

Here are a few strategies that Microsoft follows for a structured execution within a data science lifecycle:

Store all code and documents in a version control system (VCS) like Git, TFS, or Subversion to enable team collaboration.

Closely track tasks and features because tracking of the code for individual features allows one to obtain better cost estimates.

For this, Microsoft recommends using an agile project tracking system like Jira, Rally, and Azure DevOps.

To facilitate efficient knowledge sharing across the organisation, create a separate repository for each project on the VCS for versioning, information security, and collaboration.

These templates make it easier for team members to understand work done by others and to add new members to teams. It is easy to view and update document templates in markdown format.

Google too, has come up with its own set of tools to make the life of a data scientist easy. Along with many of its visualisation and query APIs, Google has introduced model cards that help organise the essential facts of machine learning models in a structured way.

For example, a model card for a language translator, may provide guidance around jargon, slang and dialects, or measure its tolerance for differences in spelling. There are likely many forms that transparent documentation can take, and we encourage a flexible approach that allows for variation in model type and evaluation specifics.

Model cards are aimed at experts and non-experts alike. Developers can use them to design applications that emphasise a model’s strengths while avoiding or informing end-users of its weaknesses. For journalists and industry analysts, they might provide insights that make it easier to explain complex technology to a general audience. And they might even help advocacy groups better understand the impact of AI on their communities.

Google has a bunch of products that bring managing data easier than ever. Services like BigQuery, AI Platform Notebooks and Google Cloud Dataproc are changing how we analyse and use data.

Products like Google Cloud Dataflow and Google Cloud Pub/Sub make it easier for your code to use vast amounts of data to deliver amazing context-rich experiences.

To summarise, every successful organisation that leverages data implements the following guidelines:

Leverage and improve best-of-breed components from an existing code base to the maximum extent feasible.
Use an agile-inspired strategy; making one product line better at a time.
To enable services hosting the models to be independently upgraded without breaking their downstream or upstream services.
Enable new technologies to be A/B testable in production.

As one can see, adopting data-driven strategies into product life cycle isn’t merely a question of learning how to train a model. Instead one should also understand how models fit into existing systems and processes. Models should increase resources — both computational and personnel accordingly.

Access all our open Survey & Awards Nomination forms in one place