There isn’t a dearth of ML tools today. However, for a beginner, to know about the tool stack of those who win Kaggle competitions consistently is of great help. One can later go ahead and pick the tool of their choice. In the next section, we look at the top tools, frameworks, cloud services, libraries used by the Kaggle masters and Grand Masters, which they revealed to us in our exclusive interviews. That said, we have to admit that all these top Kagglers are of the opinion that one should not fall in love with tools, and it is all right as long any tools get the job done right!
4x Kaggle GM, Abhishek Thakur says that he frequently finds himself using TensorFlow for NLP problems and PyTorch for computer vision problems.
Sign up for your weekly dose of what's up in emerging technology.
When it comes to favourite Python libraries, Thakur is in praise for Scikit-learn and how significant this library is in providing many necessary components to put a model into production.
Thakur, however, believes that there isn’t a shortage of libraries or frameworks one can use these days, and it’s all good as long as one understands what is happening in the background.
Arthur says that a basic laptop would sometimes suffice. However, sometimes he rents some GPUs of Google cloud platform with Kaggle vouchers, depending on the competition.
Here is what Arthur’s toolkit looks like:
- Hardware: MBPro (2019, 16GB, i7) or i7,32GB + 1070Ti or GCP.
- Language: Python and C++
- Framework: Keras and PyTorch
- Augmentation library: albumentations
- Feature selection library: eli5 and lofo
- Visualization: Missingno and seaborn
- Imbalanced data: imblearn
- Parameter optimization: Optuna and skopt
A Kaggle master ranked in the top 20 in the competitions’ leaderboard, Mathurin says that he prefers Python to R, though he had been using R until 2015. Mathurin who has been in this field for over a decade and a half, his renewed interest in algorithms made him switch to Python gradually.
A look at Mathurin’s toolkit, which he keeps coming back to:
- Algorithms: lightgbm, xgboost, catboost
- Cloud services: Google Colab and Kaggle kernels.
- Packages: scikit-learn, pandas, numpy
- Frameworks: Keras, TensorFlow, PyTorch and Fastai
- AutoML tools: Prevision.io, H2O, TPOT, auto sklearn
Tri Duc Nguyen Tang
Duc, who is ranked in the world top 50 and also a chief data engineer and co-founder of the Vietnamese AI startup, Palexy, says that he and his team usually use one server with 2x1080Ti with a Kaggle kernel. For a competition like DeepFake, he prefers renting a server with 4x1080Ti or AWS.
Talking about frequently used tools, Duc said that he usually finds himself using Keras-TensorFlow, OpenCV, albumentation, lgbm, scikit-learn. A data engineer by profession, Duc says that the role of a data engineer is collecting data and preparing the data pipeline, and for a data engineering team to build the necessary infrastructure and architecture for data generation, they use SQL, MySQL, Spark, Hadoop, Hive, etc.
Whereas, in case of a data scientist who is responsible for obtaining insights from data and formulating these insights into a model to communicate with the clients, data scientists use statistics, visualisation (matplotlib, seaborn), modeling (sklearn, TensorFlow, PyTorch), etc
An AI engineer and a grandmaster, Darragh usually runs code off the command line and Spyder IDE and mainly leverages AWS and prototypes on his Macbook Pro, which he believes, is enough to check if a pipeline is working well before deploying. Regarding the frameworks, Darragh has expressed his liking for PyTorch over other frameworks for the kind of freedom it offers to experiment compared to others.