Chemical engineering and applied mathematics are very rare combinations. Sumanta Mukherjee, a research scientist at IBM, possesses this rare broad knowledge base. Sumanta is an experienced research scientist with a track record of accomplishments in the information technology and services industries.
In addition, Sumanta is a researcher with expertise in machine learning, data science, mathematical modelling, computational biology, bioinformatics, and algorithm design. Analytics India Magazine caught up with him to gain insights into his perspectives on some of these topics.
AIM: Given that the beginning of your career was not in data science, you have climbed up the ladder certainly well. What would you say were the obstacles in starting your path in data science, and what approach did you take to overcome them?
Sumanta Mukherjee: I have a diverse career path. I started my career as a chemical engineer. Then pursued higher study in computational science, followed by a PhD in applied mathematics.
Post completion of every degree, I have worked with industries for a few years. I have worked as a process engineer, software developer, and currently, researcher.
After completion of my PhD, I have joined IBM Research, Bangalore. I am grateful to the great set of colleagues I had at my workplace. IBM Research has a very diverse, open, and inclusive environment. Therefore, most of my learning was via interaction with the experts in the field and while solving a targeted problem.
– From my experience, the best way to learn a topic is by solving a problem and discussing it with people who have experience in that field and making continuous attempts to improvise your solution.
– Data science is no different. One big benefit is free access to a large community and freely available resources. However, data science is expanding at a tremendous pace, which is a challenge to keep up. It demands continuous reading and updating yourself with the trend.
– A strong grasp of mathematics, statistics, and programming helps a lot. There are two important dimensions to data science,
- The first one is the algorithmic and mathematical aspect, and
- The second one is solving a problem on a large scale.
– Keeping up with both is difficult. So, better keep your attention on one specific dimension.
AIM: How significant is participation in hackathons and similar competitions when pursuing a career in data science?
Sumanta Mukherjee: It is very important, and the benefits are multi-faceted
- It is all about honing your skills. Practice makes a person better.
- These competitions give outreach to a larger community.
There are also data science-specific competitions, like Kaggle. Anyone seriously pursuing a data science career should be a part of the Kaggle community.
AIM: As someone with a research background and considerable experience working with research laboratories, could you emphasise the importance of research and the areas where companies should focus their efforts in machine learning?
Sumanta Mukherjee: My answer to this question will be biased. My experience is restricted to the IBM research lab, composed of a very able set of individuals.
I think industries are doing very well in finding challenging questions for the research community.
One purpose is to use data science and ML to support the current industry, and the other is to explore new questions. Most industries focus on addressing the first purpose where there is a direct business value. The second purpose is more academic, but it may help improve the future of science and industry. Therefore, I hope industries in India increase their academic collaborations to achieve a balanced and sustainable future.
One specific challenge to the application of data science is ethical restriction. Data can reveal many insights which may violate ethics. Therefore, defining rules and regulations around the application of data science and an effort to build algorithms that respect ethical restrictions should be prioritised.
AIM: Your research and industry experience has focussed on applied mathematics and energy efficiency. When effective energy management is critical, how do you believe data scientists can help solve these problems in today’s environment?
Sumanta Mukherjee: I indeed joined IBM research, the smart energy group, but currently, I am a part of the retail-supply-chain team.
Data science is a tool to understand and comprehend a large volume of data. Data is in a plethora today. In any field, the volume of data is increasing exponentially. In this context, I will emphasise the two primary goals of data science,
(1) estimation and
(2) knowledge mining (eXplainable AI).
Estimation helps in taking a reactive approach to addressing a problem, while knowledge mining may help us adopt a proactive strategy to address a problem.
– If we ask the right question, data science can help us in finding a comprehensive answer. Data science is a tool to help the progress of science and technology if used correctly.
AIM: Which machine learning/deep learning algorithm is your go-to and why?
Sumanta Mukherjee: Every algorithm has a different purpose. The selection of an algorithm depends on the problem. Often, we need to customise the input-output to cast the problem appropriate for an algorithm. Sometimes we may need to tweak the algorithm to cater to the problem.
– In the structured data domain, one algorithm stands out – XGBoost. There are many competing alternatives, but it is always my first algorithm of choice to address structured data regression/classification problems. The large adoption of this algorithm in the applied machine learning community is due to its stability, scalability, and easy library interface. In addition, many explainability tools help in deriving insights from the trained model.
AIM: What suggestions would you provide to someone seeking their first data science position?
- Be a part of the active community and actively participate in the community discussion.
- Today, knowledge is free, and learning relevant skills completely depends on one’s interests. Do a fresher course from Coursera or Udemy. I suggest Andrew Ng’s Coursera course. It is a very good starting point.
- Learn Python, the language for the data science community.
AIM: The rate of advancement in this field, particularly in deep learning, is unmatched. What will be the next frontier for algorithms based on deep learning?
Sumanta Mukherjee: Deep learning is the current trend. What makes it beautiful, the basic building block of a deep learning model is extremely simple, but when put together as a system, it can do magic. Exponential growth in participation of the NeurIPS conference is a direct indicator of its growing popularity.
- Deep learning connects functional analysis, complex systems modelling, and dynamical systems analysis together into one framework. I think we still have a long way to go to uncover its full potential.
- I expect an imminent growth in neural graph networks, reservoir computing, and the application of causality in neural architecture design.
- I expect the application of deep learning will positively influence the growth of the retail industry, healthcare section, and climate adaptation.
AIM: Many publicly available datasets can be used to enhance our machine learning abilities. What kind of projects should aspiring data scientists work on to improve their resumes for today’s job market, in your opinion?
- Natural language processing (NLP) skills are going to be in demand for some time.
- One bigger challenge in data science is solution deployment and automation. It is a definite skill one must acquire.
- Participating in various open code platforms and creating a public profile showing your coding skills helps the recruiter evaluate.
AIM: Please share with us the names of role models for you, if any. How has their work inspired you?
Sumanta Mukherjee: Richard P Feynman, is my role model since my childhood. I have always admired his way of understanding and explaining concepts. How easily we can explain it to others shows how well we understand the concept. Only when we understand something well enough (not by jargon, but by its basic functions) can we improvise the system or find flaws. Therefore, an in-depth understanding of the fundamentals of data science is essential.
AIM: Are there any research papers that you think every data scientist should read?
Sumanta Mukherjee: Research papers are very application-specific. There are tons of them, and it’s hard to list them all. I recommend articles by Geoffrey Hinton that are a must-read for those who want to work in deep learning. I closely follow the work by Bernhard Schölkopf, Yoshua Bengio, and Michael Jordan.
A few texts books for avid data scientists are listed below
Machine Learning – Tom Mitchell
Pattern Classification – David Stork, Peter Hart, Richard Duda
Machine learning: A probabilistic perspective – Kevin Murphy
Deep Learning – Aaron Courville, Ian Goodfellow, Yoshua Bengio
A Probabilistic Theory of Pattern Recognition – Luc Devroye, Laszlo Gyorfi, Gabor Lugosi
The Elements of Statistical Learning – Trevor Hastie, Robert Tibshirani, Jerome Friedman
Statistical Rethinking: A Bayesian Course with Examples in R and Stan – Richard McElreath
Elements of Information Theory – Joy Thomas, Thomas Cover
Information Theory, Inference and Learning Algorithms – David Mackay
Learning in Graphical Models – Michael Jordan