The Great Divide Between Data Science And Engineering

Past few months has brought me closer to architecting and implementing systems that deal with big volumes of data. For example – a financial product company that’s trying to make sense of their users (through their activity) so that they can offer the appropriate financial instrument to them. Another product platform that intends to curate and classify media (print and online) content to provide its customers’ information on their reach and brand perception.

It’s evident that the shape and size of data are changing. The size is definitely growing.  The data fed into the system is no longer homogeneous hence has different shapes – a tweet, a chat, a blog, an article, a comment etc.  The engineering systems are expected to process all this data of various shapes and big volumes. This results in interesting engineering challenges to build a platform for dealing with such data and produce expected outcomes. The challenges do require a Data Scientist role in the engineering teams.

Often data scientist role is not tightly integrated with the engineering team like the other roles. There is always a “divide”. Typically, companies have a common data science team who deal with various data science needs across the organization. Such a cross-cutting team setup for any role has always proved to be less fruitful in the past. The issue is more profound in the case of data science role due to the nature of the solution needed.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

The setup results in an unproductive outcome. Typically data scientist work with data on their file system with languages like R or Python (and related tools) primarily focussing on developing a production quality model.

Data scientist’s models are usually learning models. The models are dependent on the data used to build them. When the teams are different, the data provided to them is a sample and the volume is also low. Hence the data scientist faces the problem of not enough data.

Download our Mobile App

On the engineering team side, the problem is the opposite, there is too much data and the model needs to work fast on that volume. When the engineering team takes the model and integrates the same with the actual production data the results are quite different. The engineering team’s refusal to deploy and support the model, which they believe does not work in production, is the final undesirable outcome of the whole effort.

The issue is not on either side but the gap in between.

Data Scientist’s models are dependent on the data used to build them. This is vital for the engineering team to understand. This means that the data to be supplied to the data scientist must be the representative of the production data.

Another key understanding is that these are learning models – making the feedback very important. No other branch of software engineering, we have been commonly using, is a learner! A mathematical equation calculated does not need to be double checked for its output. Having a learner in production means visibility is needed on the operation it’s performing. This visibility is needed for the data scientist. All these needs imply that the data scientist be part of the core engineering team.

Further, like the system administrators, the data scientists are also the users of the system not just members of the development team. For system administrators, the platform is built with monitoring dashboards; alerting mechanisms, logging etc. We now need to think what do data scientists need.

When we run the model, they ask for “can you provide raw scores and raw documents?” – to verify the working of the model or tune it further. Hence visibility into the raw production data and the calculation results are necessary. These can be exposed through dashboards and separate calculation logs.

For example, an auto-tagged document listing produced by the model is shown on the end-user dashboard. The same dashboard can be altered to show the intermediate calculation results against each document when the data scientist persona views them. Another approach is to generate calculation logs with the document Identifiers. This log, when fed to a script, can pull up documents from the live system store for viewing by the scientists. The same functionality can also be built into the system eliminating the need for scripting.

The data warehouse or the reporting system should capture additional data related to the model processing like the tags, probabilities, coefficients, thresholds etc. This allows the data scientist to see patterns across the large volumes of documents.

Enabling data scientists with such visibility of operations, calculations and access to the actual production data will result in a model which produces the expected outcomes. The data scientists are not only aware of the shape and volume of production data, but also get awareness of the constraints that exist in the production environment. Being part of the team, they can also help the engineering team build the views and visibility they need. With these tools, they can experiment and tune the models to the desired threshold. As mentioned earlier, we have a learner running in the production system, not an expert, and we need all the tools to watch its operation!

For engineering teams, this means we have a new persona to cater to on our live systems. Non-functional and functional requirements need to include the needs of this new persona – data scientist.

The architecture and implementation features of such a system are very different from a system that considers data scientist and their models as just an external interface for integration.

What do you think?

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Srikanth Seshadri
I am an engineer from days when memory had to be de-allocated explicitly! I'm greatly passionate about distributed computing, and my career is a mix of product development and custom application development for clients. I currently work at Sahaj Software Solutions as a hands-on developer taking ideas to production.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox