“Very early into my career, I had realised that the majority of decisions in our lives are very similar to an optimisation exercise.”
For this week’s ML practitioner’s series, Analytics India Magazine got in touch with Sandip Bhattacharjee, Chief Data Scientist at Tabsquare.ai. Sandip is also a 4x Kaggle expert. In this interview, he shares his experience from his data analytics journey that spans over a decade.
AIM: Can you talk about your education and your introduction to the world of data science?
Sandip: As a student, I was always fascinated by the utility of statistics and econometrics to explain different aspects of consumer behaviour. And, as far as my academics are concerned, I am an Economist with a Masters in Economics from JNU with special papers in Statistics and Applied Econometrics. I did my graduation with a major in Economics along with minors in Mathematics and Statistics. My academics and fascination for statistics eventually ushered me into the world of data science.
Very early into my career, I had realised that the majority of decisions in our lives are very similar to an optimisation exercise. For instance, when I am choosing between multiple routes to my office, I am either trying to minimise time taken or maybe trying to maximise my car’s fuel economy. I found direct correlation between real- life examples and machine learning algorithms, which too, are founded on the principles of ‘Loss Functions’. These algorithms also go through an optimisation exercise where the loss function is used to reduce the error in predictions. Once this was realised, I had experienced my moment of epiphany. The fact that this line of thought can be leveraged to use cases such as shaping consumer behaviour with data, only increased my fondness for Data Science.
AIM: Can You Talk About Your Data Science Journey?
Sandip: When I started my career in 2007, we didn’t have the word ‘Data Science’. Back in those days, use of Statistical models for descriptive, predictive and prescriptive analytics were all bundled into ‘Analytics’. My initial work in Analytics required me to build predictive models that mainly focussed around using family of Linear Regression (GLM, HLM, Ridge Regression etc.) and classical forecasting techniques (ARIMA, ARIMAX, ARCH, GARCH, VARMAX) using SAS as the main analytics tool.
Over time I had to train myself on newer age Machine Learning techniques. I started with. Andrew Ng’s course on Deep Learning to prepare the basic foundations of ML before diving deep into studying a wide variety of topics to stay abreast with the state of the art in ML. Along with this, I also had to train myself on open source programming languages like R, and Python.
AIM: As a Chief Data Scientist, what does your typical day look like?
Sandip: Currently, I am working at Tabsquare.ai as VP and Chief Data Scientist. We are working on some of the most challenging and fascinating technology problems in the restaurant industry right now. This involves building state of the art (SOTA) in Menu Engineering, real-time recommendation engine and AI generated 1:1 promotions for customers. While in most cases the solution providers would stop at telling the next best action for their clients, the implementation of these solutions leaves a lot to be desired.
At Tabsquare, we collect more than 4 million data points each day. The biggest challenge is to create robust and scalable AI solutions that can utilise this data to serve millions of customers in real-time. In most cases we are dealing with sub-millisecond latency of results being deployed on edge devices like mobiles, tablets, kiosks. We often have to hit the right balance between model performance and high throughput/low latency requirements.
AIM: Can you talk about the challenges you have faced?
Sandip: A successful data science project is 80% getting the ‘Data’ right and rest 20% is ‘Science’. In one of my projects 16 hours prior to an actual client presentation, we realised that our feature engineering pipeline had a case of target leakage within the k-fold CV regime. The model results were looking too good to be true and upon deeper inspection of all components we discovered this fatal flaw. The entire team stayed up through the night, corrected the feature engineering pipeline and updated the models.
The updated models turned up nice and we were able to have a successful meeting. Even though this was a frustrating experience, it was one of the most valuable lessons for the entire team. In this context, accurate data is the fuel that powers the most sophisticated ML algorithms.
Contextualising a problem is very important because not every time you will find a straightforward application of what you read in a research paper and neither every business problem is about reaching the best possible value of an evaluation metric like we have in data science competitions.
That’s why I suggest aspirants to get into a habit of doing Exploratory Data Analysis (often called EDA) for any problem. EDA forms a solution foundation of good feature engineering and subsequently high-quality models. Once the foundation is created you can start making your way to advanced topics. For experienced professionals, keeping pace with new techniques and technologies is quite important. This involves reading up on the latest and greatest, knowing the foundation of the new techniques and trying to find time to code ML solutions end-to-end.
“While recruiting data scientists, the most important aspect I look for is ‘First Principles Thinking’.”
AIM: What does it take to make a good Data Scientist?
Sandip: If you want to stay competitive in this field, one should always find time to do some hands-on coding. It has helped me contextualise the last mile challenge of deploying scalable ML solutions much better. This in turn has helped me manage my teams and clients in a much more efficient manner.
While recruiting data scientists, the most important aspect I look for is ‘First Principles Thinking’. Once a candidate can break down a complicated problem into its basic building blocks it becomes much easier to formulate a set of steps that leads to the complete solution. I am never looking for someone who knows everything under the sun. However, when someone lists down certain types of ML models as their expertise, I would expect 100% clarity in concepts related to them. Clarity in data structures is also very crucial. In terms of programming I place more importance on semantics rather than syntax. Syntax is something even the most experienced programmers Google up on a daily basis. Finally, experience with code versioning tools is a plus.
AIM: You are a Kaggle expert, what role do you think competitive platforms play in the DS ecosystem?
Sandip: This is often a highly polarising topic where one section believes that these competitions have no real value because in the real world you never get clean data like what you get in competitions. Whereas, the other section believes that doing competitions regularly gives you an edge.
Being a 4X Kaggle expert, I hold a slightly different view in this regard. Data science competition platforms like Kaggle give you a view to a small yet very important section of a complete data science project. This corresponds to the sections of doing effective EDA, various feature engineering techniques and multiple ways of building highly accurate models. However, what these competitions don’t teach you is how to clean and massage the data to make it usable. Most importantly, they don’t teach the last mile challenge of deploying the scalable models and the art of stakeholder management. I personally use platforms like Kaggle to complement my learning from working on real life business problems. In the grand scheme of things both experience of solving real life business problems and knowledge gained from doing data science competitions has positive synergies on each other.
AIM: What Does Your ML Toolkit Look Like?
Sandip: My ML toolkit is a mixture of Python and Spark and looks like follows:
- Libraries: scikit-learn, Scipy, Statsmodels, LightGBM, xgboost, Tensorflow, cv2 and Transformers (by Huggingface). On Spark, I use MLlib a lot
- Hardware: My personal Deep Learning Hardware setup consists of one custom made Desktop – 64GB RAM, 8GB NVIDIA GTX 1070Ti, 256GB SSD. The other is a Lenovo Legion laptop – 16GB RAM, 6GB NVIDIA RTX 2060, 1TB SSD.
- Cloud: I do utilise the free GPU and TPU quota provided by Kaggle & Google Colab. For office work, I have mostly worked on GCP and there we can customise the hardware as per the need of the specific project.
AIM: Any tips and recommendations for Data Science aspirants?
Sandip: If you are someone who is just getting started with data science, I would recommend the following:
- Prof. Andrew Ng’s course: to get the fundamentals sorted
- Deep Learning’ by Ian Goodfellow, Yoshua Bengio and Aaron Courville: for mathematical foundations of Deep Learning without any code.
- Courses/books by Dr. Adrian Rosebrock: for Computer Vision.
- Full Stack Deep Learning course by Pieter Abbeel, Sergey Karayev and Josh Tobin: to move from solving DL problems from local system/Kaggle notebooks to full scale production level.
That said, for anyone starting off in Data Science, my suggestion would be to avoid trying to learn everything at once. One can follow a structured process where you can start with basics first. Start with basic regression and classification models and master the concepts end to end.
To prepare for this ever-evolving field of data science one must have the mentality of a student and the zeal to keep learning. What is considered ‘advanced’ ML concept today may become ‘basic’ ML concept a few years down the line. I follow a three-pronged strategy to keep myself abreast with new developments in data science.
- Keep reading new academic papers and articles in AI/ML
- Keep programming skills handy by doing small personal projects and data science competitions whenever possible
- Finally, contextualise the new skills learnt on the previous two steps to the actual day-to-day business problems that we are trying to solve.
Irrespective of how experienced you are in this field, the hunger to learn something new everyday is a key aspect in a Data Scientist’s journey.