Data science is a team sport. The notion of a unicorn data scientist is exactly that, a myth.”Naveen Singla
For this week’s ML practitioner’s series, Analytics India Magazine(AIM) got in touch with Naveen Singla, VP, Data Science at Bayer Crop Science, who leads a team of 100+ data scientists. Naveen has a Bachelor’s degree in Electrical and Electronics from IIT Delhi and has a Ph.D. in Electrical Engineering from Washington University in St. Louis, USA. His research focuses on the use of Bayesian inference algorithms for reliable communication on noisy channels. In this interview, Naveen discusses how he approaches a data science problem, the tools and more.
AIM: Tell us a bit about your educational background?
Naveen: I have always been drawn towards mathematics and decided to pursue a career in engineering and applied mathematics. I completed a Bachelor’s in Technology (B.Tech.) in Electrical and Electronics Engineering from IIT, Delhi, India, in 2000. During my time there, I remember learning about Claude Shannon’s theory of communication, specifically, his celebrated channel coding theorem, which elegantly related deep mathematical concepts to practical digital communication. This was a key moment that inspired me to pursue a Ph.D. in Electrical Engineering from Washington University in St. Louis, USA. My Ph.D was on the use of Bayesian inference algorithms for reliable communication on noisy channels.
AIM: How did your journey in data science begin?
Naveen: From my days in academia to my various roles in industry, my work has always been at the nexus of data, algorithms and computing. As my career has progressed, the scale, complexity and impact of my work has steadily grown. During my PostDoc, I worked on developing a biometric for identifying humans, with potential covert applications. My first job in industry was focused on mining information from financial ticker data, utilising heterogeneous high-performance compute architectures comprised of CPUs, FPGAs and GPUs aimed at providing high-frequency trading services to investment institutions. For the past ten years, I have been at Bayer’s Crop Science division where my work is similar in spirit: discerning insights from our data that are beneficial to farmers for sustainably improving food production.
AIM: What were the initial challenges and how did you address them?
Naveen: There were a few specific challenges in the early days, most of them have been sufficiently addressed and a few, key ones, still remain as opportunities.
- Knowing what the right questions are to ask before investing in collecting data or building models and infrastructure. This still remains an art and is most effectively addressed by working with subject matter experts.
- Being able to collect and access data relevant to the problems we were trying to solve. The data economy we take for granted today due to proliferation of digital tech has largely provided a path to solving this issue.
- Having the frameworks to manage data and compute, effectively utilising data and/or pipeline parallelisms. This was a challenge with no effective solution and really limited compute beyond throwing more hardware at the problem.
- Having the right algorithm frameworks and libraries to quickly build and test models. In my early days I had to write an SVM solver due to the limited availability of open-source solvers. Today, there exist a whole host of powerful frameworks, libraries, solvers, etc. for developing algorithms.
AIM: What does your typical day look like at Bayer?
Naveen: I lead a team of 100+ data scientists working with all lines of business at Bayer’s Crop Science division to deploy data science products that transform or improve how we do business today.
My team works on all aspects of the data science lifecycle including idea generation, data discovery, model development & testing, deployment & ops, and partners with engineering teams on foundational elements such as our data science and data engineering technical stack. That’s why the work is so dynamic and interesting; one day I am working with executives on how we can provide novel services to farmers, next day it could be how we scale our machine learning on genomics data to trillions of data points and yet another day could be brainstorming how we improve diversity in our team. That’s what makes this role so challenging and rewarding.
The knowledge that our work is impacting something so fundamental as food security and environmental sustainability is immensely humbling and rewarding.
AIM: How does your team approach a Data Science problem?
Naveen: Data science is a team sport. The notion of a unicorn data scientist is exactly that, a myth. You have to have the subject matter experts work closely with the data scientist who in turn works hand-in-hand with the data engineer and machine learning engineer. The most important element in any data science endeavor is understanding the problem; what am I trying to improve, who is the user, what metric will I use to measure change, etc.
An initiative my team worked on was to cluster the arable land across mainland United States according to environmental attributes such as weather, soil and topography. Although this was an unsupervised learning algorithm, it was still crucial to work with the subject matter experts and understand how the clusters were intended to be used. This knowledge aided in feature engineering, resulting in more easily interpretable and usable clusters. Another key element of this initiative was the partnership between data scientists and data engineers – the initial solution was not scalable, and our data scientists worked closely thereafter with the engineers to develop a novel clustering solution that was able to scale to the 100’s of billions of data points across the entire arable US land. We have since expanded this to a global solution. This iterative process is also typical of data science projects whether it is during validation with users, small-scale testing or product deployment.
AIM: What does your machine learning toolkit look like?
Naveen: Our tech stack choice is driven by our principles of using open source and fit-for-purpose technologies. As such, R and Python are our programming languages of choice with a 50-50% split usage across data scientists, trending more towards favoring Python usage. Data scientists develop models using standard ML libraries natively available within R or Python or utilise frameworks, most commonly PyTorch and TensorFlow, for their ML needs. All of our work uses AWS or GCP services as foundational elements upon which data scientists build, test, deploy and operate their models.
AIM: There is a lot of hype around ML. Which domain do you think will come out on top in the next 10 years?
Naveen: Yes, and the hype is leading to improper use of machine learning leading to failures which end up hurting machine learning adoption and improvement.
As far as applications are concerned, ironically, machine learning is differentiating itself in two contrasting application areas – personalisation and automation. In the former, you are developing solutions tailored to the needs of an individual whereas automation requires you to provide homogeneous solutions that scale to various people, scenarios and environments.
Currently, supervised machine learning is all the rage, but it is fundamentally limited by the capability of the human labeler. Unsupervised learning overcomes this limitation but is still limited by what data are available. In my opinion, reinforcement learning will stand the test of time; the ability to directly interact with the environment and learn from that interaction enables continuous evolution and the only limiting factor then is compute.
AIM: What do outsiders get wrong about this field?
Naveen: Two things. First, equating data science with data scientists. As I mentioned previously, data science is a team sport that requires the subject matter experts, data scientists, human-centric designers, data engineers, etc., to come together for the right solution. Second, believing that data science does not impact them. Whether it’s our professional or personal lives, data science is already profoundly impacting every aspect of our lives and impacting it for the better. However, this will be a journey where the humans have to guide the machines to make our lives better. We need to change the narrative from humans vs. AI to humans + AI!
AIM: What’s your advice to aspirants targeting data science roles at Bayer Crop Science?
Naveen: Technical skills are table stakes and while in the past, a University education was a surrogate for technical acumen, that has changed now. A great way to advertise your technical skills is to create a portfolio of your work on GitHub or elsewhere. A piece of code shows more about your thinking than any interview ever could. The other key qualities we look for in data scientists are curiosity and empathy. While the former sows change, the latter enables change. At the end of the day, our work is improving the lives of people, farmers in Bayer Crop Science’s case, and understanding that and understanding people is crucial to that.
Here are a few additional tips:
- Learn continuously: There are a plethora of courses available on learning platforms such as Coursera, EdX, Datacamp, etc., from introductory level to advanced users covering the gamut of machine learning, programming and engineering. Use them!
- Learn by doing: Pick a programming language and one of several data sets that are publicly available on sites such as Kdnuggets, Kaggle, Carnegie Mellon University, UCI’s ML repository, ImageNet, etc. to build and sharpen your skills.
- Learn mathematics: While an advanced degree is not necessary, having a good, higher than basic level understanding of what goes inside the box for machine learning is necessary especially as you try to solve problems that do not fit a predefined mold.