For this week’s ML practitioner’s series, Analytics India Magazine got in touch with Luca Massaron, a Kaggle master, who was a former top 10 Kaggler and a man of many talents. It is not so uncommon to have data scientists with multiple walks of life. Data science inherently is an interdisciplinary domain. However, Luca Massaron, a Google Developer expert, a veteran of the field and one of the best selling ML authors, has an uncommonly not-so-common journey!
A master’s degree in political science, an MBA, author of popular machine learning books and a data science expert, these are few of the many things that Luca Massaron is known for. He is a veteran of this field and his interest for machine learning and algorithms dates back at a time when the term data science was still completely unknown, and data mining was a more common term indicating someone interested in digging out information from databases using algorithms and specialised software like SAS or SPSS.
During the four years of his University studies on statistical inference and methodology of research, Luca actually spent more time studying international politics, philosophy and history. After obtaining a masters degree in Political Sciences from the University of Bologna, Italy, he went on to obtain an executive MBA degree while pursuing his professional career. During these initial days, Luca’s understanding of business and management played a key role in his ascent to becoming a senior data scientist.
I’m divided between two cultures, an original humanistic one at the foundations of my education, and a practical, scientific one derived mostly from my professional experience.
After having worked as a lead data scientist for more than four years at the Axa group, setting up models for telematics (IoT data) and claims prediction (images and tabular data), Luca moved to the financial sector. He is currently a senior data science expert at illimity Bank, a digital bank that specialises in non-performing loans. As a senior data science expert in the Risk Analytics Department of illimity Bank, Luca develops models for risk, credit rating, loss prediction, fraud detection. He and his colleagues have collaborated for an interesting recent paper too, which can be found here.
At the beginning of his professional career, back in 2000, Luca worked for a Web startup where he used simple tools such as Microsoft Office and some Pascal code to play with data. This project resulted in reports that were immediately sold to local Italian branches of companies like Microsoft, Excite!, Altavista and Yahoo.
I was so surprised that data could be so valuable and of such importance that companies were paying for it.
Luca quickly realised unlocking insights from data has a great future and can be a promising path for his career.
Getting up there in the global Kaggle rankings boils down to simply one thing — perseverance.
Luca has competed in more than 170 competitions. He has also featured in the top 10 leaderboard rankings, placed 7th. Luca came across Kaggle for the first time on KDNuggets. And, as he started to participate, he discovered that his understanding of data mining was somehow ineffective for the problems proposed on Kaggle.
Participating in the “Psychopathy Prediction Based on Twitter Usage” competition gave him confidence to go big. During the course of this competition, he found out a simple but effective tricks such as fitting Random Forests model when many features had missing values, pre-processing data, and how he has been applying machine learning methods the wrong way that resulted in him falling from 3rd position on the public leaderboard to the 57th in the private one.
As he got accustomed to participating in Kaggle competitions, Luca decided to ascend to the top of Kaggle during 2013, starting with the Amazon’s – Employee Access Challenge. It took him a few months to arrive in the top ten, stabilising at the 7th global position for the remainder of the year in 2014.
Failing fast, learning fast and holding on in spite of disappointing results can help in your climbing, and here, time the real discriminator.
After a while, anyway, the marginal utility of Kaggle points and medals may decrease in your eyes.
With regards to post Kaggle success, Luca believes that participants would gradually tend to lean on to other more profitable activities such as writing papers, developing open-source or pro-bono projects with other practitioners.
Nevertheless, he emphasises that participating in a Kaggle competition is always a useful experience, even if you do not compete for points or medals. If one invests the right amount of time (neither too much nor too little) on Kaggle, the platform is still one of the best schools to practice machine learning and pick up the latest algorithms, coding skills and technicalities of data analysis.
Tools, Tips And Tricks
There is no free lunch and most algorithms basically perform the same on the long run.
It may sound old school, but Luca’s approach to a problem is as follows:
- Look at the data. Print it.
- Visually explore tables.
- Sample a few cases.
- Plot univariate, bivariate and multivariate plots.
- Then look at papers, previous Kaggle competitions and if possible talk to some domain experts
Luca believes that there is no free lunch and that most algorithms basically perform the same on the long run. So, he tries to devise many complex transformations to make sense from a theoretical point of view and why they work or don’t work with data.
Although now a Pythonist, Luca reminisced how he has gradually moved from R towards Python. “At the very beginning of my career, a very important tool that helped me learn was R. I discovered open-source software for analysis, and R in 2002, became my favorite tool for analysis and modeling because I couldn’t have access to the commercial tools such as SAS and SPSS. I remember how I spent long hours glancing at the list of R libraries, exploring vignettes and help files and trying all the examples I could,” explained Luca.
Kaggle, and its competitions, says Luca, that has led him to learn Python and leave R. But, depending on the situation, he would leverage SQL/Google BigQuery, Dask or PySpark for processing large amounts of data.
When it comes to libraries, Luca mainly uses Scikit-learn and Keras/TensorFlow for the machine learning projects. Apart from these, he finds himself using the following:
- Lightgbm and CatBoost for state-of-the-art gradient boosting.
- SHAP for local explainability.
- Scikit-optimize for Bayesian hyperparameter optimization or network architecture search.
- SQL/Google BigQuery, Dask or PySpark for processing large amounts of data.
Luca also revealed that he still likes to program on his old i5 8GB Sony Vaio from 2012, which he has used for many Kaggle competitions as a terminal for cloud instances on Google Cloud Platform or on Kaggle Kernels.
In my humble opinion, what makes one good to great is the soft ability to effectively collaborate with other engineers and business.
Luca believes that gone are the days where a lone data scientist could create a data product. Now, you need a well-orchestrated team and team playing is essential if you want to excel as an ML engineer. He doubled down on the ability to communicate with the business in order to perfectly integrate data products into the workflows.
Luca says that he has learned more about data science by spending a lot of time solving data problems at work (and some on Kaggle) and studying books on multivariate statistics, machine learning and AI. For getting a hands on experience, he gathered datasets from various sources and even created his own synthetic data in order to put theory to test. For machine learning aspirants, he recommends the following books:
- “The Elements of Statistical Learning” by Trevor Hastie, Robert Tibshirani, Jerome Friedman
- “Learning from Data” by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin.
Luca himself is the author of many best selling machine learning books, the most famous one being “Machine Learning For Dummies,” which he also credits for contributing to his learning curve. So far, he has written 12 books on data science, machine learning, algorithms and AI that have been translated from English into many languages and the second edition of “Machine Learning for Dummies,” which is due this autumn.
Build your network of collaboration through Kaggle competitions, meetups, conferences, open-source projects
Luca also stressed on how important it is for a newcomer to build a network beyond one’s workplace. These collaborations and networks, he insists, can assist professional growth in various ways.
Gunpowder, AI And The Future
As the world transitions towards AI, Luca believes that we would end up with a lot of cheap automation mostly on mobile devices. However, he would like to bet huge on natural language processing (NLP), because it can be very disruptive if AI can successfully process and manipulate the written word.
That said, Luca also underlined the current state of tools and algorithms, and their limitations in terms of feature representation and generalization of abstract concepts. On the flip side, even when a solution is successful, people cannot grasp the complexity and explainability surfaces as a challenge.
They tend to believe that AI is a kind of cheap magic, but it has some strong limitations in its efficacy.
Luca is a student of political science, and philosophy and when he was asked to comment on the role of regulations in AI deployment, he masterly drew an analogy between gunpowder and AI.
“Ideas changed human history, and also tools played their part, affecting our present reality. Innovations such as the gunpowder changed European social structure overshadowing the military role of nobles and ultimately weakened their position in society, opening the way to the modern state and the liberal-democratic regimes.”
Similarly, continued Luca, “AI could prove as disruptive as gunpowder in our societies. Besides all the good that AI could give back, we could have negative consequences where countries can exercise control over people or could possibly dominate others because of their AI supremacy. Now, the question is, do we want to avoid the negative effects of AI or simply look at it in a completely positive way?”
Unfairness in AI is only the tip of the iceberg
Luca looks at the unfairness in AI as only the tip of the iceberg and warns that there are other more unsettling sides to it. Facial recognition systems, for example, will surely increase safety in our societies, but they will also give the state a previously unknown opportunity for societal control.
Another serious problem is that of patenting AI, which he says, could prevent misuse of certain kinds of sensible technology but will also limit the possibility for rapid development of the technology.
He firmly believes that only well-crafted regulations could prevent negative side effects of AI and he calls for a discussion at many levels between data science experts on national and international level. He also reminds us that AI is a tool, and if tools are to become evil, it is because they have an evil maker or evil user that was not stopped in the first place.
When we go toward laissez-faire systems, we end up relying on natural dynamics such as the survival of the fittest/strongest.
As the global warming problem proves, we are not too good to solve complex problems by leaving individuals free to pursue at the same time their own interest and the interest of others “Many local optima do not make a global one,” quipped Luca.
Lack of proper regulations can make the blue skies and green fields of AI less blue and green. So, he strongly advocates for interference in AI because he believes that, in the long run, AI can do more good than harm to humanity.