When asked about his approach to data science problems, Sergey Yurgenson, the Director of data science at DataRobot, said he would begin by creating a benchmark model using Random Forests or XGBoost with minimal feature engineering. A neurobiologist (Harvard) by training, Sergey and his peers on Kaggle have used XGBoost(extreme gradient boosting), a gradient boosting framework available as an open-source library, in their winning solutions. The supremacy of XGBoost is not just restricted to popular competition platforms. It has become the go-to solution for working on tabular data. When it comes to solving classification and regression problems with tabular data, the use of tree ensemble models (like XGBoost) are usually recommended.
Today, XGBoost has grown into a production-quality software that can process huge swathes of data in a cluster. In the last few years, XGBoost has added multiple major features, such as support for NVIDIA GPUs as a hardware accelerator and distributed computing platforms including Apache Spark and Dask.
However, there have been several claims recently that deep learning models outperformed XGBoost. To verify this claim, a team at Intel published a survey on how well deep learning works for tabular data and if XGBoost superiority is justified.
The authors explored whether DL models should be a recommended option for tabular data by rigorously comparing the recent works on deep learning models to XGBoost on a variety of datasets. The study showed XGBoost outperformed DL models across a wide range of datasets and the former required less tuning. However, the paper also suggested that an ensemble of the deep models and XGBoost performs better on these datasets than XGBoost alone. For the experiments, the authors examined DL models such as TabNet, NODE, DNF-Net, 1D-CNN along with an ensemble that includes five different classifiers: TabNet, NODE, DNF-Net, 1D-CNN, and XGBoost. The ensemble is constructed using a weighted average of the single trained models predictions. The models were compared for the following attributes:
- Efficient inference
- Rate of hyper-parameter tuning (shorter the optimization time, the better).
Datasets used: Forest Cover Type, Higgs Boson, Year Prediction, Rossmann Store Sales, Gas Concentrations, Eye Movements, Gesture Phase, MSLR, Epsilon, Shrutime and Blastchar.
To their surprise, the authors found the DL models were outperformed by XGBoost when datasets were changed. Compared to XGBoost and the full ensemble, the single DL models are more dependent on specific datasets. The authors attributed the drop in performance to selection bias and differences in the optimization of hyperparameters. Now, the obvious next step would be to check the ensemble models. But, which combination? Is it a combination of XGBoost and DL models or an ensemble of non-DL models? The authors suggest picking up a subset of models for ensemble based on the following factors:
- The validation loss(lower the loss, the better)
- Highest confidence models (by some uncertainty measure) and
- Random order.
XGBoost vs. Other ML Algorithms(Source: Vishal Morde)
Tree based methods like XGB are sample efficient at making decision rules from informative, feature engineered data is one competing theory on the success of XGBoost. It is considered extremely fast, stable, faster to tune and robust to randomness, which is well suited for tabular data. The preferential treatment of XGB over deep learning can be further understood through the lens of manifold learning.
Meanwhile, Dmitry Efimov, who heads the ML centre of excellence at American Express, said the Intel researchers missed out on the preprocessing aspect of neural networks. “From the problems we have solved recently, it’s pretty clear that if you just apply simple normalization to the tabular data and train any neural network, the decision trees would outperform. But if you apply more effort to preprocess data and reduce noisy information from the data, neural networks will outperform. The main question is how much effort you want to apply,” he explained. Addressing Efimov’s argument on “the right kind of preprocessing“, Bojan Tunguz, a Kaggle GM and a well known face in the ML community, said that a ‘highly competent’ data scientist can massage data and take advantage of any algorithm’s unique characteristics. “Heck, I can do it in such a way to get a logistic regression to outperform XGBoost!” said Tunguz.
The debate around deep learning and its relatively simpler alternatives like XGBoost is nothing new. Though this paper tries to explore something that is already widely known (or assumed), the authors do cut some slack for DL models. The Intel researchers admit that results can vary with the hyperparameter optimization process, and the good results are because of the initial hyperparameters of XGBoost (which are already robust), or, the XGBoost model may have some inherent characteristics that make it easier to optimize. The Intel team believes that combining neural networks and XGBoost can outperform other models. Even Tunguz, in his Linkedin post, said “a weighted blend of XGBoost and neural networks is usually the way to go for the majority of problems.”
While significant progress has been made using DL models for tabular data, the authors concluded they still do not outperform XGBoost, and further research is warranted.
- In many cases, the DL models perform worse on unseen datasets.
- The XGBoost model generally outperformed the deep models.
- No DL model consistently outperformed the others.
- The ensemble of deep models and XGBoost outperforms the other models in most cases.