Listen to this story
|
The best solution for making sense of tabular data, XGBoost, has just been upgraded. XGBoost 2.0 brings forth a plethora of new features and enhancements aimed at revolutionising the machine learning landscape.
Click here to check out the full release.
XGBoost 2.0 introduces a novel feature under development, focusing on vector-leaf tree models for multi-target regression, multi-label classification, and multi-class classification. Unlike the previous approach of building separate models for each target, this feature allows XGBoost to construct a single tree for all targets, offering several advantages, including prevention of overfitting, smaller model sizes, and the ability to consider correlations between targets.
Users can combine vector leaf and scalar leaf trees during training through a callback. It’s important to note that this feature is a work in progress, and some aspects are still under development.
Read: XGBoost is All You Need
New Device Parameter
A significant change is the introduction of a new ‘device’ parameter, replacing existing parameters like ‘gpu_id,’ ‘gpu_hist,’ ‘gpu_predictor,’ ‘cpu_predictor,’ ‘gpu_coord_descent,’ and the PySpark-specific ‘use_gpu.’ Users can now use the ‘device’ parameter to select their preferred device for computation, simplifying the configuration process.
Default Tree Method
Starting from XGBoost 2.0, the ‘hist’ tree method is set as the default. In previous versions, XGBoost would automatically choose between ‘approx’ and ‘exact’ based on input data and the training environment. The new default method aims to improve model training efficiency and consistency.
GPU-Based Approximate Tree Method
XGBoost 2.0 offers initial support for the ‘approx’ tree method on GPU. While performance optimisation is ongoing, the feature is considered feature-complete, except for the JVM packages.
Users can access this capability by specifying ‘device=”cuda”‘ and ‘tree_method=”approx”.’ It’s important to note that the Scala-based Spark interface is not yet supported.
Memory Footprint Optimization
This release also introduces a new parameter, ‘max_cached_hist_node,’ allowing users to limit CPU cache size for histograms. This helps prevent aggressive caching of histograms, especially in deep trees. Additionally, memory usage for ‘hist’ and ‘approx’ tree methods on distributed systems is reduced by half.
Improved External Memory Support
External memory support receives a significant boost in XGBoost 2.0. The default ‘hist’ tree method now utilises memory mapping, enhancing performance and reducing CPU memory usage. Users are encouraged to try this feature, particularly when memory savings are required.
Learning-to-Rank Enhancements XGBoost 2.0 introduces a new implementation for learning-to-rank tasks, offering a range of new features and parameters to improve ranking performance.
Notable additions include parameters for pair construction strategy, control over the number of samples per group, experimental unbiased learning-to-rank support, and custom gain functions with NDCG.
Column-Based Split and Federated Learning Significant progress has been made in column-based split for federated learning, with support for various tree methods and vertical federated learning. GPU support for this feature is still in development.
PySpark Enhancements
The PySpark interface in XGBoost 2.0 has received numerous new features and optimisations, including GPU-based prediction, data initialisation improvements, support for predicting feature contributions, Python typing support, and improved logs for training.