In machine learning, we mainly deal with two kinds of problems that are classification and regression. There are several different types of algorithms for both tasks. But we need to pick that algorithm whose performance is good on the respective data. Ensemble methods like Random Forest, Decision Tree, XGboost algorithms have shown very good results when we talk about classification. These algorithms give high accuracy at fast speed. Both the two algorithms Random Forest and XGboost are majorly used in Kaggle competition to achieve higher accuracy that simple to use.
Through this article, we will explore both XGboost and Random Forest algorithms and compare their implementation and performance. We will see how these algorithms work and then we will build classification models based on these algorithms on Pima Indians Diabetes Data where we will classify whether the patient is diabetic or not. We will then evaluate both the models and compare the results. The dataset can be downloaded from Kaggle.
What we will learn from the article?
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
- What is the Random Forest Algorithm and how does it work?
- What is XGboost Algorithm and how does it work?
- A comprehensive study of Random Forest and XGBoost Algorithms
- Practically comparing Random Forest and XGBoost Algorithms in classification
- What is the Random Forest Algorithm? How does it work?
The forest is said to robust when there are a lot of trees in the forest. Random Forest is an ensemble technique that is a tree-based algorithm. The process of fitting no decision trees on different subsample and then taking out the average to increase the performance of the model is called “Random Forest”. Suppose we have to go on a vacation to someplace. Before going to the destination we vote for the place where we want to go. Once we have voted for the destination then we choose hotels, etc. And then come back with the final choice of hotel as well. The whole process of getting the vote for the place to the hotel is nothing but a Random Forest Algorithm. This is the way the algorithm works and the reason it is preferred over all other algorithms because of its ability to give high accuracy and to prevent overfitting by making use of more trees. There are several different hyperparameters like no trees, depth of trees, jobs, etc in this algorithm. Check here the Sci-kit documentation for the same.
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier()
- What is XGBoost Algorithm? How does it work?
XGBoost is termed as Extreme Gradient Boosting Algorithm which is again an ensemble method that works by boosting trees. XGboost makes use of a gradient descent algorithm which is the reason that it is called Gradient Boosting. The whole idea is to correct the previous mistake done by the model, learn from it and its next step improves the performance. The previous results are rectified and performance is enhanced.
This gets continued until there is no scope of further improvements. Regularization is the feature that is dominant for this type of predictive algorithm. It is fast to execute and gives good accuracy. This algorithm is commonly used in Kaggle Competitions due to the ability to handle missing values and prevent overfitting. There are again a lot of hyperparameters that are used in this type of algorithm like a booster, learning rate, objective, etc. Check the documentation to know more about the algorithm and hyperparameters.
from xgboost import XGBClassifier
xgbcl = XGBClassifier()
- How to Build a Classification Model using Random Forest and XGboost?
First, we will define all the required libraries and the data set. Use the below code for the same.
import pandas as pd from sklearn.ensemble import RandomForestClassifier from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score,classification_report data = pd.read_csv('/content/pima-indians-diabetes-1.csv')
We will check what is there in the data and its shape. Refer to the below code for the same.
print(data)
print(data.shape)
Output:
Now we will define the dependent and independent features X and y respectively. We will then divide the dataset into training and testing sets. Use the below code for the same.
X = data.drop('class',axis = 1) y= data['class'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) print(X_train.shape) print(X_test.shape)
Output:
There are 514 rows in the training set and 254 rows in the testing set. Now we will fit the training data on both the model built by random forest and xgboost using default parameters. Then we will compute prediction over the testing data by both the models.
rfcl.fit(X_train,y_train)
xgbcl.fit(X_train,y_train)
y_rfcl = rfcl.predict(X_test)
y_xgbcl = xgbcl.predict(X_test)
We have stored the prediction on testing data for both the models in y_rfcl and y_xgbcl. Now we will evaluate the model performance to check how much the model is able to generalize. We will make use of evaluation metrics like accuracy score and classification report from sklearn.
print("Random Forest Accuracy: ", accuracy_score(y_rfcl,y_test))
print("XGBoost Accuracy: ", accuracy_score(y_xgbcl,y_test))
Output:
Random Forest Accuracy: 0.79
XGBoost Accuracy: 0.80
print("Random Forest: \n", classification_report(y_rfcl,y_test))
print("\nXGBoost: \n", classification_report(y_xgbcl,y_test))
Output:
Also, check this “Practical Guide To Model Evaluation and Error Metrics” to know more about validating the performance of a machine learning model.
Conclusion
Through this article, we discussed the Random Forest Algorithm and Xgboost Algorithm with the working. Also, we implemented a classification model for the Pima Indian Diabetes data set using both the algorithms. We did not even normalize the data and directly fed it to the model still we were able to get 80%. If we work more on data and feature engineering then this accuracy can be improved further. Also, hyperparameters can be tuned using different methods.
Both the algorithms work efficiently even if we have missing values in the dateset and prevent the model from getting over fitted and easy to implement.