Interview With Kaggle Triple Grandmaster Rob Mulla

For this week’s ML practitioner’s series, Analytics India Magazine(AIM) got in touch with Robert Mulla, a Kaggle triple grandmaster ranked #34 on the global leaderboards. Rob has a Bachelors and Masters degree in Electrical Engineering. After a career as an electrical engineer, he went back to school on a part time basis to pursue Masters in Information and Data Science from UC Berkeley. Rob currently works as a Senior Data Scientist at Astrazeneca. In this interview, Rob shares a few snapshots from his journey as a data scientist.

AIM: How did your fascination with algorithms begin?

Rob: It started when I was working at the power utility. We had an enormous amount of data gathered about the power grid. I found myself working in spreadsheets so large that they would crash my computer. That’s when I started to research programming languages like R and Python. I had heard about Machine Learning but didn’t know much. Later, I found Andrew Ng’s lectures from his Machine Learning course on YouTube. I was hooked and just wanted to learn more about data science and machine learning.

AIM: What were the initial challenges and how did you address them?

Rob: I think since data science is such a broad and quickly changing field. Initially it can feel overwhelming just picking where to start. I found that coding was something I really enjoy, but at first it was challenging to get to a point where I felt comfortable with what I was doing. Thankfully coding is the type of skill that you can always improve if you are just willing to be diligent and consistently stretch yourself to learn new techniques.

I learn best when I am working on a specific problem. It could be anything really, a school assignment, a Kaggle competition, or even just a personal project. If I find it interesting, then learning the skills required to solve the problem tends to come naturally. I don’t think I’m any smarter than the average person- but I do have a lot of grit and tend to stick with a problem until I am happy with the solution.

AIM: What books and other resources have you used in your journey?

Rob: I was fortunate enough to be able to take graduate courses on data science while working. That was immensely helpful for me to get up to speed. I read many books during that time that weren’t technically specific to data science, but really shaped my understanding of the field:  including Daniel Kahneman’s “Thinking, Fast and Slow” and “How to Lie with Statistics” by Darrell Huff. I also spent a lot of time watching YouTube tutorials and reading on stack overflow.

AIM :What has drawn you towards Kaggle?

Rob: Initially, I was drawn more towards the community of Kaggle. I enjoyed making Notebooks and sharing my ideas with the world – and getting real time feedback. Comments and upvotes were just a way for me to gauge if what I was producing resonated with people. I also found that creating notebooks has been a great way for me to push my storytelling and data visualization techniques. Kaggle notebooks continue to help me find new ways to display and explore novel data sets.

After I became Notebooks Grandmaster, my focus shifted more towards the competitions side of Kaggle. I have really enjoyed digging deep into the competition problems. Working on competitions can be frustrating, especially when you work really hard on ideas that don’t pan out. But when you do have ideas that work- it feels really good!

“In any competition there are hundreds or thousands of very smart people simultaneously working on the problem – and you won’t be able to succeed every time”

AIM: How do you tackle a competition or any data science problem?

Rob: More recently, I find myself first spending a few days exploring the data and understanding the competition metric. Understanding the metric is always key to doing well and is often overlooked, especially when it’s an uncommon one. Next, I move on to a baseline model and then start to work on more advanced techniques to improve upon it.

I try to think about a Kaggle competition like a huge grid search being done across all the teams. It’s easy to get sucked into focusing on public notebooks or discussion threads that focus on what “works” in the competition. Sometimes, what is shared publicly is enough for anyone to do well in a competition. You have to watch out, though: because it also causes you to find yourself in a local optimum. If you focus too much on what other people are doing, you may miss out on a better approach. It’s all about juggling time, resources and ideas across the time of the competition. And, of course, it takes a little bit of luck.

AIM: What fascinates you about the Kaggle community? 

Rob: I just love how diverse the community is, how welcoming it is to newcomers, and how fun it can be to work on a problem alongside other people who are passionate about data and data science. Often, when I’m reading a solution writeup by a top team after the competition is over, I’m blown away by how brilliant yet simple the solution is. These “aha” moments always inspire me in my next competition. 

AIM: Tips to get to the top on Kaggle

Rob: I mentioned this before – but I do think it mostly comes down to determination and grit. Sure, you need a good foundational understanding of programming and statistics – but what really sets the top Kagglers apart is their relentless determination when working on a problem. In any competition, there are hundreds or thousands of very smart people simultaneously working on the problem – and you won’t be able to succeed every time, but if you stick with it you’ll become familiar with the techniques commonly used by top teams, and eventually you’ll break through.

AIM: Tell us about your role at your current company

Rob: Currently, I work as a data scientist at a pharmaceutical company. We explore ways in which data science and machine learning can be leveraged to aid in the development of cancer drugs. Previously I started working with large datasets at an electrical utility- we would use weather and energy consumption data to forecast future demand on the power grid. I also spent a while working in the hospitality industry as a data scientist in revenue management. I’m very grateful that I’ve been able to work in a diverse set of fields – all working with data or in data science positions. It’s allowed me to be comfortable with how different companies operate with their data. It’s also fun to have worked on such a wide range of problems.

AIM: What does your machine learning tool stack look like?

Rob: I code exclusively in Python, but beyond that I try to stay open to different frameworks and libraries. Of course, numpy and pandas are foundational to almost everything I do. I love matplotlib, seaborn, and bokeh for visualization. For tabular machine learning problems, I’m comfortable working with xgboost, lightgbm and catboost. More recently, all the Kaggle competitions I’ve taken part in use some sort of deep learning. I’ve done some work in TensorFlow, but now prefer to use PyTorch for anything deep learning related. I’m a fan of PyTorch lightning for organizing my PyTorch projects.

AIM: What ML techniques, use cases, and applications do you think will stand the test of time?

Rob: I think many techniques have already stood the test of time, we just take them for granted of how integrated they have become in our daily lives. Most people use Siri, Alexa and Google assistant – which is driven by deep learning. Our social network apps feed us information based on machine learning algorithms. Pretty soon, we will see cars become more and more autonomous – although I still think we are a long way away from completely autonomous vehicles that can handle every edge case driving scenario. I think the big leap will be taking these already proven techniques that the big tech companies have been using for years and democratizing them so they become more common in other industries. I don’t think the limitation right now is really on the technology per se, but more on companies’ ability to frame business problems in a way that can be solved by machine learning. Machine learning is very powerful – but you need to be thoughtful in determining and framing exactly how to utilize its power towards solving meaningful problems.

AIM: Which domain of AI will come out on top in the next 10 years?

Rob: Personally, I think some of the coolest innovations right now are in the domain of reinforcement learning. Algorithms that can learn from past experiences and continue to improve really fascinate me. Currently, most productionized machine learning algorithms are trained on data and then deployed to predict on new data – but they don’t improve until they retrain. Reinforcement learning has a chance to be a gamechanger in that space. I don’t know how close we are to seeing those algorithms commonly used in industry – but I’m excited to see how it grows over the next 10 years. I think Kaggle has noticed this and their new “Simulation Competitions” focus on problems where reinforcement learning can be applied.

AIM: What do outsiders get wrong about this field?

Rob: It depends on what you mean by “outsiders”- I have personally seen a lot of self proclaimed “data science experts” who have done much more damage by applying machine learning incorrectly. One of the most common things I see being done is improper cross validation of models. It’s very easy to unwittingly introduce a leak into your machine learning pipeline. When leaks are introduced, it can severely overstate the predictive power of a machine learning model. Kaggle is great because it helps data scientists learn how to avoid these types of mistakes. Since the final leaderboard is graded on a holdout test set, you won’t succeed unless you set up proper, leak free, cross validation.

AIM: Any additional tips for the beginners? 

Rob: If you are completely new to coding, I would recommend taking an introductory course on Python. After that, I would recommend taking the leap and creating a public notebook on Kaggle or competing in a Kaggle competition. “Learning through doing” in my experience has been the best approach.  I think, more than anything, it’s just the willingness to continuously learn. We are all in the process of improving our skills. Just look to improve each day and stick with it. I also find it helpful to set realistic goals, maybe it’s to get a bronze medal in a Kaggle competition – or write a blog post about a data science project you’ve been working on. If you are consistently working on small projects like that, then over time you will look back and realize you’ve grown a lot.

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

More Stories

MORE FROM AIM
Victor Dey
Python 3.9 vs Python 3.10: A Feature Comparison

In this article, we will compare the features of two of the most recent versions of the Python programming language, Python 3.9 and Python 3.10, with their respective examples and try to explore what is different and new.

Victor Dey
Hands-On Guide to PaDELPy for ML Model Building

PaDELPy is an open-source library that provides a Python wrapper for the PaDEL-Descriptor and a molecular descriptor calculation software. The PaDEL-Descriptor can be used to work on scientific data to help calculate the molecular fingerprint of specific molecules used to build scientific machine learning models.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM