“I worked on a hackathon problem for 48 hours straight. I didn’t win it. But I learned about something extraordinary – machine learning.”Abhishek Choudhary
For this week’s ML practitioner’s series, Analytics India Magazine(AIM) got in touch with Abhishek Choudhary, a lead data engineer at Bayer Pharmaceutical. In this interview, Abhishek shares his rich experience of deploying machine learning models in the real world.
AIM: How did your machine learning journey begin?
Abhishek: I finished my schooling from a very small town in Chhattisgarh, named Chirmiri. Then I did Computer Science Engineering from Raipur, Chhattisgarh. It was a pretty big jump for me. My data journey began around 2012 ~ 2013. I was working on a large data product. I was busy writing too many threads (in Java) to somehow process the data, and ended up with an overcomplicated project. During a casual meetup in Bangalore, I learned about Hadoop and that some companies were using it in production. I started learning Hadoop immediately and built a fast prototype to give a demo to my team. It worked.
I believe my machine learning (ML) journey began purely by chance. I was participating in a Hackathon for 48 hours straight and was working on a risk prediction problem. I ended up having 100s of if-else conditions after 30 hours of coding. I was tired and not sure of my code, so I started looking for alternative algorithms and found out about Machine Learning. I had no idea about it and had no time to understand it deeply. So I worked from bottom-up. I simply used the Linear regression algorithm and validated output, and was shocked by the performance. I didn’t win the Hackathon but I learned about something extraordinary – machine learning.
AIM: What were the initial challenges and how did you address them?
Abhishek: Starting out, my initial hurdles came in the form of programming language. I was a Java Developer but to learn more about ML, I ended up learning R-programming, Matlab, and then Python. Also back then, the devices were not powerful enough. Not to forget the poor internet speed. Downloading large data to work on ML problems was really hard. I tried to build an Android App, based on the ML model and my mobile crashed as soon as I opened the App. I raised a bug in the Android-ML community but couldn’t get much help as the ML community too was in a nascent stage. There were not enough resources available around ML/Big Data unlike today. So, I started learning the source-code and code comments to understand better.
With time, the challenges moved from slow internet speeds to model deployment. For instance, I have worked on an ML-based Bidding Pipeline (Ad-tech). Here the goal is to bid for millions of keywords in real-time. The bidding values were influenced by many external and internal factors. The platform was reading a massive amount of weekly data along with real-time data. The platform must react to the real-time data and predict bids in low latency.
Here are the challenges in real world ML deployment:
- Machine learning with real-time data is hard and complicated.
- Building a solid fall-back mechanism to compensate for incorrect predictions that can directly impact revenue.
- Automated deployment of ML models.
- Finding a lightweight metrics and monitoring system to observe performance and real-time response.
- A/B testing framework.
- Building and maintaining complex infrastructure that can support different technologies with more than 99.9% uptime.
AIM: Tell us about your role at Bayer Pharmaceutical.
Abhishek: I am a Lead Data Engineer at Bayer Pharmaceutical. I am responsible for building an Analytics & Machine Learning platform for Real-World Data in healthcare. A typical day usually is coffee with programming and Data/ML infrastructure development. I spend most of the time building applications around Data and Machine Learning. Interacting with the team to understand the requirements and address them ASAP.
“Data Science workflow is quite a repetitive domain, so code reusability and clean code are essential.”
AIM: What does your machine learning toolkit look like?
Abhishek: I am still exploring new technologies. But here are a few from the top of my mind:
- For data processing, I use Scala
- Python for Data Pipeline and Machine Learning
- Kubernetes for infrastructure and deployment
- Airflow/Dagster for Pipeline Scheduling
- Scipy, PySpark for ML
- Apache Superset for Analytics & Dashboarding
- TrinoDb for distributed SQL
AIM: What kind of software engineering principles should a data scientist or a data engineer know?
Abhishek: Unit Test cases are extremely important for deploying Data Science pipeline in Production. Data Science workflow is quite a repetitive domain, so code reusability and clean code are essential. Don’t complicate the code and if it’s too big, refactor it in smaller steps. Think around pipelines and tasks, and each task should be an isolated immutable state of the pipeline. Model deployment is more around infrastructure as it requires solid metrics and model performance validations frequently. The model deployment should be automated and there should be an auto validation in the infrastructure before model deployment in production.
AIM: What does the future of ML look like from your vantage point?
Abhishek: I think going forward Decision Tree as a technique will flourish. There will be more progress with regards to Optimized Dynamic Programming based algorithms. And, when it comes to domains, healthcare will be the hottest of them all.
AIM: What’s your advice for the ML aspirants?
Abhishek: Before jumping to Data Science/ML, build a solid foundation around SQL, Database and Algorithms. Data cleaning is way more complicated than one can imagine. Try all possible kinds of data and learn the techniques to clean or transform it in specific ways.
Here are my top book recommendations:
- Designing Data-Intensive Applications by Martin Kleppmann
- Domain-driven design by Eric Evans
- Storytelling with Data by Cole Nussbaum
- An Introduction to Statistical Learning by Trevor Hastie et al.,
- The Elements of Statistical Learning by Trevor Hastie et al.,
- Naked Statistics by Charles Wheelan
- The Algorithm Design Manual by Steven Skiena
- Harvard’s CS 109 Data Science – Harvard
- Machine Learning courses on Coursera