# Random Forest Vs XGBoost – Comparing Tree-Based Algorithms (With Codes)

Through this article, we will explore both XGboost and Random Forest algorithms and compare their implementation and performance.

In machine learning, we mainly deal with two kinds of problems that are classification and regression. There are several different types of algorithms for both tasks. But we need to pick that algorithm whose performance is good on the respective data. Ensemble methods like Random Forest, Decision Tree, XGboost algorithms have shown very good results when we talk about classification. These algorithms give high accuracy at fast speed. Both the two algorithms Random Forest and XGboost are majorly used in Kaggle competition to achieve higher accuracy that simple to use.

Through this article, we will explore both XGboost and Random Forest algorithms and compare their implementation and performance. We will see how these algorithms work and then we will build classification models based on these algorithms on Pima Indians Diabetes Data where we will classify whether the patient is diabetic or not. We will then evaluate both the models and compare the results. The dataset can be downloaded from Kaggle.

What we will learn from the article?

• What is the Random Forest Algorithm and how does it work?
• What is XGboost Algorithm and how does it work?
• A comprehensive study of Random Forest and XGBoost Algorithms
• Practically comparing Random Forest and XGBoost Algorithms in classification
1. What is the Random Forest Algorithm? How does it work?

The forest is said to robust when there are a lot of trees in the forest. Random Forest is an ensemble technique that is a tree-based algorithm. The process of fitting no decision trees on different subsample and then taking out the average to increase the performance of the model is called “Random Forest”. Suppose we have to go on a vacation to someplace. Before going to the destination we vote for the place where we want to go. Once we have voted for the destination then we choose hotels, etc. And then come back with the final choice of hotel as well. The whole process of getting the vote for the place to the hotel is nothing but a Random Forest Algorithm. This is the way the algorithm works and the reason it is preferred over all other algorithms because of its ability to give high accuracy and to prevent overfitting by making use of more trees. There are several different hyperparameters like no trees, depth of trees, jobs, etc in this algorithm. Check here the Sci-kit documentation for the same.

`from sklearn.ensemble import RandomForestClassifier`

`rfcl = RandomForestClassifier()`

1. What is XGBoost Algorithm? How does it work?

XGBoost is termed as Extreme Gradient Boosting Algorithm which is again an ensemble method that works by boosting trees. XGboost makes use of a gradient descent algorithm which is the reason that it is called Gradient Boosting. The whole idea is to correct the previous mistake done by the model, learn from it and its next step improves the performance. The previous results are rectified and performance is enhanced.

This gets continued until there is no scope of further improvements. Regularization is the feature that is dominant for this type of predictive algorithm. It is fast to execute and gives good accuracy. This algorithm is commonly used in Kaggle Competitions due to the ability to handle missing values and prevent overfitting. There are again a lot of hyperparameters that are used in this type of algorithm like a booster, learning rate, objective, etc. Check the documentation to know more about the algorithm and hyperparameters.

`from xgboost import XGBClassifier`

`xgbcl = XGBClassifier()`

1. How to Build a Classification Model using Random Forest and XGboost?

First, we will define all the required libraries and the data set. Use the below code for the same.

```import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report

We will check what is there in the data and its shape. Refer to the below code for the same.

`print(data)`

`print(data.shape)`

Output:

Now we will define the dependent and independent features X and y respectively. We will then divide the dataset into training and testing sets. Use the below code for the same.

```X = data.drop('class',axis = 1)
y= data['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
print(X_train.shape)
print(X_test.shape)```

Output:

There are 514 rows in the training set and 254 rows in the testing set. Now we will fit the training data on both the model built by random forest and xgboost using default parameters. Then we will compute prediction over the testing data by both the models.

`rfcl.fit(X_train,y_train)`

`xgbcl.fit(X_train,y_train)`

`y_rfcl = rfcl.predict(X_test)`

`y_xgbcl = xgbcl.predict(X_test)`

We have stored the prediction on testing data for both the models in y_rfcl and y_xgbcl. Now we will evaluate the model performance to check how much the model is able to generalize. We will make use of evaluation metrics like accuracy score and classification report from sklearn.

`print("Random Forest Accuracy: ", accuracy_score(y_rfcl,y_test))`

`print("XGBoost Accuracy: ", accuracy_score(y_xgbcl,y_test))`

Output:

Random Forest Accuracy: 0.79

XGBoost Accuracy: 0.80

`print("Random Forest: \n", classification_report(y_rfcl,y_test))`

`print("\nXGBoost: \n", classification_report(y_xgbcl,y_test))`

Output:

Also, check this “Practical Guide To Model Evaluation and Error Metrics” to know more about validating the performance of a machine learning model.

Conclusion

Through this article, we discussed the Random Forest Algorithm and Xgboost Algorithm with the working. Also, we implemented a classification model for the Pima Indian Diabetes data set using both the algorithms. We did not even normalize the data and directly fed it to the model still we were able to get 80%. If we work more on data and feature engineering then this accuracy can be improved further. Also, hyperparameters can be tuned using different methods.

Both the algorithms work efficiently even if we have missing values in the dateset and prevent the model from getting over fitted and easy to implement.

I am currently enrolled in a Post Graduate Program In Artificial Intelligence and Machine learning. Data Science Enthusiast who likes to draw insights from the data. Always amazed with the intelligence of AI. It's really fascinating teaching a machine to see and understand images. Also, the interest gets doubled when the machine can tell you what it just saw. This is where I say I am highly interested in Computer Vision and Natural Language Processing. I love exploring different use cases that can be build with the power of AI. I am the person who first develops something and then explains it to the whole community with my writings.

## Oct 11-13, 2023 | Bangalore

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

#### Subscribe to our Daily newsletter

##### MOST POPULAR

This blended learning programme from IIMA aims to equip candidates with essential skills to build data confidence and make informed decisions aligned with business objectives.

### Google and Apple Compete for AI-Powered Accessibility

On the global accessibility awareness day, both Google and Apple released AI-powered accessibility features

During the recent Google I/O 2023 conference, Google introduced watermarking and metadata to all images generated by AI to promote transparency

### Data is Gold, Twitter the Goldmine to Train AI Models

Every tweet posted on the platform, becomes the property of the social media giant and can be used by others who have access to its API

### Meta Finally Cracks the LLM Code Without RLHF

So far, Meta has been reluctant with LLMs and chatbots, but is now rushing into the open-source without reinforcement learning with human feedback (RLHF)

### Republic, News 18 and Others Break Fake AI News

A fake AI photograph of an explosion near the Pentagon surfaced on the internet and media houses circulated the hoax image on their channels

### Intel Aurora: A Last Ditch Effort for Supercomputer Dominance

Intel’s long-delayed supercomputer, Aurora, might be what it needs to come back to power in the HPC market.

### AI Takes the Centre Stage at Microsoft Build 2023

One of the most notable announcements made by Microsoft at Build was the integration of Bing with ChatGPT

### Uncensored Models are Double-edged Swords That Need to be Unleashed

Embracing uncensored models is crucial for scientific exploration, freedom of expression, diversity, storytelling, and composable nature of the open-source AI community

### The Context Length Hitch with GPT Models

OpenAI rival, Anthropic AI has opened up the context window massively with its own chatbot Claude, pushing it to sound 75,000 words or 100,000 tokens.