“Majority of Data Scientists I have met do not have formal data science education.”
For this week’s ML practitioner’s series, Analytics India Magazine got in touch with Kaggle Grandmaster Sergey Yurgenson, the Director of Advanced Data Science Services at DataRobot and a former world no.1 on Kaggle leaderboards. In this interview, Sergey shares his insights from a prolific data science journey that has spanned over a decade.
How It All Began
Sergey got his PhD in physics from St.Petersburg State University (formerly Leningrad State University) and believes that his background in Physics contributed to his understanding of data science. “Most importantly, it [Physics] teaches to connect a theory or model with reality — any model should be based on observable facts (data), and any model is good only if it can predict a new phenomenon or explain observable facts,” added Sergey.
Sergey spent several years conducting Physics research during his time at the university. The research involved a significant volume of data analysis. It was not machine learning, quipped Sergey, but something simpler, like parametric curve fitting. Later he changed his field of research completely and joined the prestigious Harvard Medical School as a researcher in the department of neurobiology.
At Harvard, Sergey focussed his efforts on studying the visual cortex. His experiments involved many sophisticated optical devices, image collection and analysis, and relatively high throughput data collection. That said, Sergey finds fundamental differences between physics data and biology data.
“Physics education is very easily translatable into the data science domain. It is heavy in math, statistics, data analysis, and, usually, provides some programming experience.”
“Data in physics is relatively reproducible. There is a probability distribution in many processes, but, taking that into account, one should expect to see a similar outcome of an experiment, conducted under similar conditions. Variance in biology is significantly larger. At some level of details, each brain is different; each subject may behave differently. Thus, conclusive data analysis is more difficult, and results may not be as reliable,” explained Sergey.
Currently, Sergey is leading the DataRobot Data Science professional services group. DataRobot has created — what he believes to be — the first and the best Auto ML software in the market today.
Auto ML, continued Sergey, allows clients to solve many data science problems faster with fewer resources. In scenarios where the clients run out of solutions, Sergey and his team would help clients to utilise the full potential of DataRobot’s AutoML platform to generate business value. “My team would build, implement and put models in productions. We would help to brainstorm use cases and calculate business value. As a result, I have an opportunity to deal with a wide spectrum of industries and use cases: from insurance and banking to retail and HR, from demand prediction and price elasticity to churn and staffing optimisation.”
Despite a lot of experience with traditional data analysis, Sergey’s journey as a Data Scientist started with Kaggle.
About His Kaggle Journey
“I never had a goal “to study data science”. My goal was always to “solve specific problems”.”
With Kaggle, Sergey has added another feather to his cap by becoming the world no.1 on global Kaggle rankings. “I found Kaggle by accident,” reminisced Sergey. For several years, he continued, he was participating in Matlab coding competitions, which had a unique set of rules that made any code, submitted for the competition, instantaneously available to all competitors, and everyone was allowed to modify the code and resubmit it.
“It created a very complicated and fast competition dynamic. After one of those competitions, I realised that my competitive spirit was not completely satisfied, and I started looking for other competition platforms. That is when I stumbled upon Kaggle,” he said.
Sergey’s first Kaggle competition was “RTA Freeway Travel Time Prediction“. This competition required participants to predict travel time on Sydney’s M4 freeway from past travel time observations. In addition to better-informing network managers and Australian motorists, insights from the competition will improve the general efficiency of the road transport system in Sydney and increase functionality on the government’s live traffic website.
Participants in this competition are required to forecast the travel time on the M4 freeway for 15 mins, 30 mins, 45 mins, one hour, 90mins, two hours, six hours, 12 hours, 18 hours and 24 hours ahead.
But, Sergey did not know any machine learning at the time of his first competition. So, he had built a model using correlation coefficients and some manually created waveforms. To his surprise, he finished second on the global leaderboard. “I was hooked, and my next success happened a couple competitions later,” he said.
It was one of the first data science competitions with a scientific goal – first “Dark Matter” competition.
During the gap between his first competition and the one on DarK Matter, Sergey had a good three months to learn quite a bit about ML algorithms. For most of the time, he was using Matlab to implement these algorithms. For the “Dark Matter” competition, he applied PCA (Principal Component Analysis) to images and ensembles of several dozen simple neural networks. He finished second, yet again.
This success was followed by several top 10 finishes and eventually topping the “Predicting a Biological Response” competition. During his decade-long journey on Kaggle, Sergey also had the opportunity to team-up and learn from great data scientists like Xavier, Giba, Owen, Bluefool, and DataRobot founders Jeremy and Tom.
“…the main purpose of the benchmark model is to make sure that my submissions are created correctly.”
Sergey has participated in 66 competitions with 18 gold medals. So, we enquired what his typical routine looked like. To which, he explained that his approach is based on incremental improvements.
The first step, he continued, is to create a benchmark model. Usually, it is Random Forests or XG Boosting with minimal feature engineering. In case of a competition, the main purpose of the benchmark model is to make sure that the submissions are created correctly and to check the difference between the local model performance and the competition leaderboard.
If the difference is significant and/or there is a time component in the dataset, then the second step is to create a training-validation partition that reflects the structure of the competition model evaluation and provides results similar to competition leaderboard results. This is usually followed by feature engineering and model tuning.
“I do not implement all ideas I have simultaneously, but try them one by one to evaluate the benefit of each one separately. I keep a list of “ideas to try” to make sure I try them all. Many ideas are based on Kaggle discussions or kernels; some come to me while I work on other data science projects or read literature, not necessarily related to the competition. Sometimes, I set the problem aside for several days. That allows me to free my mind and then look at the problem from a new angle,” explained Sergey.
Sergey reiterated how working with Matlab was good enough to become a Kaggle GM. There is so much talk about Python vs R and deep learning, but Sergey keeps it simple or to be precise and uses what is necessary to get the job done. He uses R and from time to time, and usually picks Python over R.
“For several years, Random Forest (RF) was my favourite algorithm. It is extremely “stable” and reliable. It has a smaller number of tuning parameters than most other algorithms, and one does not need to take special measures to avoid overfitting. Out of the box, one will get decent results for most data science problems. In the end, XGBoost, LightGBM and other boosting algorithms usually will outperform RF. Thus, these algorithms are my go-to nowadays,” he added.
On The Future Of ML
Given the hype around ML and many fancied projections, we asked Sergey what according to him would be the future of machine learning as a domain. “It is a fool’s game to try to predict what happens with AI in 10 years,” quipped Sergey. He further added that machine learning can never completely replace humans in the decision making process. Therefore, ML research and development will be evaluated not as a stand-alone product, but how well it works together with humans. That said, he would like to see if Reinforcement Learning would make any progress going forward.
Ruminating on the perceptions of outsiders on ML, Sergey explained that outsiders pay too much attention to human-like AI behaviour and to the machine-vs-human competition. Outsiders are impressed when machines beat humans in chess or Go or when computers paint in the style of impressionists. Thinking about Artificial Intelligence, we are thinking about Artificial Human Intelligence. Without knowledge of any other intelligence, it is usually assumed that to be Intelligent means to behave like Humans. “One of the side effects is under-appreciation of AI and ML in situations, where Humans do not excel, like forecasting or predictive modelling in high-dimensional space,” observed Sergey.
The abundance of resources can be overwhelming to those who are aspiring to be data scientists. So, how does one become a good data scientist? Sergey cites his own journey as an example and how he never had a goal “to study data science”. His goal was always “to solve specific problems”. This in turn defined the type of resources he used. According to Sergey, fewer textbooks and more research articles and internet sources – Kaggle discussions, blog posts, StackOverflow, library documentations would suffice most of the data science pedagogy.
“Do not be afraid to not know something. There is no data scientist who knows everything. If you think you know everything, you are definitely missing something. Actually, the more you realise limitations of your knowledge, the better data scientist you are,” advised Sergey.
He even looks for the same approach while hiring for his data science team. “Usually, I am not looking for specific expertise beyond some basic data science knowledge. What I am looking for is the ability to think and the ability to reason. If you tell me that you were using a specific ML algorithm or data preprocessing in your project then I will always ask why,” concluded Sergey.