How to use XGBoost algorithm in R in easy steps

Tavish 26 Jun, 2020 • 8 min read

Overview

Learn how to use xgboost, a powerful machine learning algorithm in R
Check out the applications of xgboost in R by using a data set and building a machine learning model with this algorithm

Introduction

Did you know using XGBoost algorithm is one of the popular winning recipe of data science competitions ?

So, what makes it more powerful than a traditional Random Forest or Neural Network? In broad terms, it’s the efficiency, accuracy and feasibility of this algorithm. (I’ve discussed this part in detail below).

In the last few years, predictive modeling has become much faster and accurate. I remember spending long hours on feature engineering for improving model by few decimals. A lot of that difficult work, can now be done by using better algorithms.

Technically, “XGBoost” is a short form for Extreme Gradient Boosting. It gained popularity in data science after the famous Kaggle competition called Otto Classification challenge. The latest implementation on “xgboost” on R was launched in August 2015. We will refer to this version (0.4-2) in this post.

In this article, I’ve explained a simple approach to use xgboost in R. So, next time when you build a model, do consider this algorithm. I’m sure it would be a moment of shock and then happiness!

How to use XGBoost algorithm in R in easy steps

What is XGBoost?

Extreme Gradient Boosting (xgboost) is similar to gradient boosting framework but more efficient. It has both linear model solver and tree learning algorithms. So, what makes it fast is its capacity to do parallel computation on a single machine.

This makes xgboost at least 10 times faster than existing gradient boosting implementations. It supports various objective functions, including regression, classification and ranking.

Since it is very high in predictive power but relatively slow with implementation, “xgboost” becomes an ideal fit for many competitions. It also has additional features for doing cross validation and finding important variables. There are many parameters which needs to be controlled to optimize the model. We will discuss about these factors in the next section.

Preparation of Data for using XGBoost

XGBoost only works with numeric vectors. Yes! you need to work on data types here.

Therefore, you need to convert all other forms of data into numeric vectors. A simple method to convert categorical variable into numeric vector is One Hot Encoding. This term emanates from digital circuit language, where it means an array of binary signals and only legal values are 0s and 1s.

In R, one hot encoding is quite easy. This step (shown below) will essentially make a sparse matrix using flags on every possible value of that variable. Sparse Matrix is a matrix where most of the values of zeros. Conversely, a dense matrix is a matrix where most of the values are non-zeros.

Let’s assume, you have a dataset named ‘campaign’ and want to convert all categorical variables into such flags except the response variable. Here is how you do it :

sparse_matrix <- sparse.model.matrix(response ~ .-1, data = campaign)

Now let’s break down this code as follows:

“sparse.model.matrix” is the command and all other inputs inside parentheses are parameters.
The parameter “response” says that this statement should ignore “response” variable.
“-1” removes an extra column which this command creates as the first column.
And finally you specify the dataset name.

To convert the target variables as well, you can use following code:

output_vector = df[,response] == "Responder"

Here is what the code does:

set output_vector to 0
set output_vector to 1 for rows where response is "Responder" is TRUE ;
return output_vector.

Building Model using Xgboost on R

Here are simple steps you can use to crack any data problem using xgboost:

Step 1: Load all the libraries

library(xgboost)
library(readr)
library(stringr)
library(caret)
library(car)

Step 2 : Load the dataset

(Here I use a bank data where we need to find whether a customer is eligible for loan or not).

set.seed(100)
setwd("C:\\Users\\ts93856\\Desktop\\datasource")
# load data
df_train = read_csv("train_users_2.csv")
df_test = read_csv("test_users.csv")

# Loading labels of train data

labels = df_train['labels']
df_train = df_train[-grep('labels', colnames(df_train))]

# combine train and test data
df_all = rbind(df_train,df_test)

Step 3: Data Cleaning & Feature Engineering

# clean Variables :  here I clean people with age less than 14 or more than 100

df_all[df_all$age < 14 | df_all$age > 100,'age'] <- -1
df_all$age[df_all$age < 0] <- mean(df_all$age[df_all$age > 0])

# one-hot-encoding categorical features
ohe_feats = c('gender', 'education', 'employer')

dummies <- dummyVars(~ gender +  education + employer, data = df_all)
df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all))
df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)df_all_combined$agena <- as.factor(ifelse(df_all_combined$age < 0,1,0))

I am using a list of variables in “feature_selected” to be used by the model. I have shared a quick and smart way to choose variables later in this article.

df_all_combined <- df_all_combined[,c('id',features_selected)] 
# split train and test
X = df_all_combined[df_all_combined$id %in% df_train$id,]
y <- recode(labels$labels,"'True'=1; 'False'=0)
X_test = df_all_combined[df_all_combined$id %in% df_test$id,]

Step 4: Tune and Run the model

xgb <- xgboost(data = data.matrix(X[,-1]), 
 label = y, 
 eta = 0.1,
 max_depth = 15, 
 nround=25, 
 subsample = 0.5,
 colsample_bytree = 0.5,
 seed = 1,
 eval_metric = "merror",
 objective = "multi:softprob",
 num_class = 12,
 nthread = 3
)

Step 5: Score the Test Population

And that’s it! You now have an object “xgb” which is an xgboost model. Here is how you score a test population :

# predict values in test set
y_pred <- predict(xgb, data.matrix(X_test[,-1]))

Parameters used in Xgboost

I understand, by now, you would be highly curious to know about various parameters used in xgboost model. So, there are three types of parameters: General Parameters, Booster Parameters and Task Parameters.

General parameters refers to which booster we are using to do boosting. The commonly used are tree or linear model
Booster parameters depends on which booster you have chosen
Learning Task parameters that decides on the learning scenario, for example, regression tasks may use different parameters with ranking tasks.

Let’s understand these parameters in detail. I require you to pay attention here. This is the most critical aspect of implementing xgboost algorithm:

General Parameters

silent : The default value is 0. You need to specify 0 for printing running messages, 1 for silent mode.
booster : The default value is gbtree. You need to specify the booster to use: gbtree (tree based) or gblinear (linear function).
num_pbuffer : This is set automatically by xgboost, no need to be set by user. Read documentation of xgboost for more details.
num_feature : This is set automatically by xgboost, no need to be set by user.

Booster Parameters

The tree specific parameters –

eta : The default value is set to 0.3. You need to specify step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. The range is 0 to 1. Low eta value means model is more robust to overfitting.
gamma : The default value is set to 0. You need to specify minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. The range is 0 to ∞. Larger the gamma more conservative the algorithm is.
max_depth : The default value is set to 6. You need to specify the maximum depth of a tree. The range is 1 to ∞.
min_child_weight : The default value is set to 1. You need to specify the minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. The range is 0 to ∞.
max_delta_step : The default value is set to 0. Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.The range is 0 to ∞.
subsample : The default value is set to 1. You need to specify the subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting. The range is 0 to 1.
colsample_bytree : The default value is set to 1. You need to specify the subsample ratio of columns when constructing each tree. The range is 0 to 1.

Linear Booster Specific Parameters

lambda and alpha : These are regularization term on weights. Lambda default value assumed is 1 and alpha is 0.
lambda_bias : L2 regularization term on bias and has a default value of 0.

Learning Task Parameters

base_score : The default value is set to 0.5 . You need to specify the initial prediction score of all instances, global bias.
objective : The default value is set to reg:linear . You need to specify the type of learner you want which includes linear regression, logistic regression, poisson regression etc.
eval_metric : You need to specify the evaluation metrics for validation data, a default metric will be assigned according to objective( rmse for regression, and error for classification, mean average precision for ranking
seed : As always here you specify the seed to reproduce the same set of outputs.

Advanced functionality of xgboost

Compared to other machine learning techniques, I find implementation of xgboost really simple. If you did all we have done till now, you already have a model.

Let’s take it one step further and try to find the variable importance in the model and subset our variable list.

# Lets start with finding what the actual tree looks like

model <- xgb.dump(xgb, with.stats = T)
model[1:10] #This statement prints top 10 nodes of the model

# Get the feature real names
names <- dimnames(data.matrix(X[,-1]))[[2]]

# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model = xgb)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])

#In case last step does not work for you because of a version issue, you can try following :
barplot(importance_matrix[,1])

As you can observe, many variables are just not worth using into our model. You can conveniently remove these variables and run the model again. This time you can expect a better accuracy.

Testing whether the results make sense

Let’s assume, Age was the variable which came out to be most important from the above analysis. Here is a simple chi-square test which you can do to see whether the variable is actually important or not.

test <- chisq.test(train$Age, output_vector)
print(test)

We can do the same process for all important variables. This will bring out the fact whether the model has accurately identified all possible important variables or not.

End Notes

With this article, you can definitely build a simple xgboost model. You will be amazed to see the speed of this algorithm against comparable models. In this post, I discussed various aspects of using xgboost algorithm in R. Most importantly, you must convert your data type to numeric, otherwise this algorithm won’t work.

Also, I would suggest you to pay attention to these parameters as they can make or break any model. If you still find these parameters difficult to understand, feel free to ask me in the comments section below.

Did you find the article useful? Have you used this technique before? How did the model perform? Do you use some better (easier/faster) techniques for performing the tasks discussed above? Can you replicate the codes in Python? We’ll be glad if you share your thoughts as comments below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Tavish 26 Jun 2020

Algorithm Classification Data Science Intermediate Machine Learning

Responses From Readers

Aditya 22 Jan, 2016

How to find best parameter values for the model?

Show 1 reply

Tavish 27 Jan, 2016

Aditya, Its an iterative process. You generally start with the default value and then move towards either extremes depending on the CV gain. Tavish

kumar 22 Jan, 2016

Below code is giving an error : labels = df_train['labels']

Show 1 reply

Mikhail 22 Jan, 2016

I think in the dataset "label" is "Loan_Status" and this code is right labels = df_train['Loan_Status'] df_train = df_train[-grep('Loan_Status', colnames(df_train))]

HighSpirits 22 Jan, 2016

Very helpful article Srivastava. I heard about XGBOOST but did not implement it. Will definitely try this in the next competition, using this article.

Neil 22 Jan, 2016

hi Tavish, Thanks for taking the time to put together this elaborate explanation.. I'm trying to follow along using the code, and seem to have come unstuck at Step 2. This line of code throws an 'undefined columns selected' error: labels = df_train['labels'] What am I missing?

Show 1 reply

Tavish 27 Jan, 2016

I have used a loans data which is not publicly available and not the loan challenge data on AV. The intention of the article was to understand the underlying process of XGboost. Hope the article helped you.

Mikhail 22 Jan, 2016

Thx for material, Tavish Srivastava. In your code you use variable "Age", but there is not this variable in the dataset. How you get this feature?

Show 1 reply

Tavish Srivastava 27 Jan, 2016

Glenn Schultz 22 Jan, 2016

Nice article, I am going to try this algorithm on mortgage prepayment and default data

Ramesh 22 Jan, 2016

Hi, Thanks for posting wonderful article XGboost. Below code is not merging train and test dataset excluding Loan_Status from Train dataset. labels = df_train['labels'] df_train = df_train[-grep('labels', colnames(df_train))] # combine train and test data df_all = rbind(df_train,df_test) I think simple way to do it is # Exclude column 13 df_train_sub = subset(df_train, select=c(1:12)) Merge train and Test dataset. df_all = rbind(df_train_sub,df_test) Let me know if i am missing something here.

Show 1 reply

Tavish Srivastava 27 Jan, 2016

Dummy 22 Jan, 2016

1. You should load 'Matrix" package to run the function sparse.model.matrix() 2. There is no “label” or "Age" or "Employer" in the download data set. 3. For "categorical features" in the data set, there are "Gender", "Married", "Education", "Self_Employed", "Property_Area"

Shashi 25 Jan, 2016

I guess Tavish idea with this was to theoretically demonstrate the use of xgboost. The code as presented here have lots of errors with respect to variable names and I do not think you can run these codes as is.

r_achar 27 Jan, 2016

Hi folks, If anyone is looking for a working example of xgboost, here is a simple example in R. Although xgboost is an overkill for this problem, it demonstrates how to run a multi-class classification using xgboost. . https://github.com/rachar1/DataAnalysis/blob/master/xgboost_Classification.R Hope this helps.

Nagesh 01 Feb, 2016

Great article, it would be much helpful if you can get in to details of xgb.importance(), like what can we understand from the Gain, Cover and Frequence columns of the output. Thanks :)

Amar Jaiswal 02 Feb, 2016

The feature importance part was unknown to me, so thanks a ton Tavish. Looking forward to applying it into my models. Also, i guess there is an updated version to xgboost i.e.,"xgb.train" and here we can simultaneously view the scores for train and the validation dataset. that we pass into the algorithm as xgb.DMatrix. Also xgb.cv gives us a very good idea to select parameters for xgb.train as here we can specify nfolds for the number of cross validations. Would love to get your views on these too !!!

Surya Prakash 06 Mar, 2016

I am getting error while converting datatypes of Loan Prediction to Numeric > names(n) [1] "Gender" "Married" "Dependents" "Education" [5] "Self_Employed" "ApplicantIncome" "CoapplicantIncome" "LoanAmount" [9] "Loan_Amount_Term" "Credit_History" "Property_Area" "Loan_Status" >sparse_matrix <- sparse.model.matrix(response ~ .,data = n) Error in model.frame.default(object, data, xlev = xlev) : variable lengths differ (found for 'Gender') I am unable to figure out the issue. Kindly suggest.

Bellur Srikar 09 May, 2016

Hi Tavish, Great article. Thanks. Can you let me know how to access the data set you used so that i can follow your step and get a bettee understanding? Thansk Srikar

Beatriz Penalver Bernabe 08 Jun, 2016

Thank you so much for such a great intro to xgboost!

RnMe 23 Jun, 2016

Hi Tavish, Definitely a good article. But it would be great if you give the dataset along with the article and explain the techniques based on that.. Also many of the parameter explanations are not clear. May be it would be because of my lesser experience in this area.

Tanguy 04 Sep, 2016

Hi Tavish, Thanks for the article. I did not understand your paragraph on the Chi2 square test. How does this test allows you to (in)validate a feature ?

Mahesh 06 Sep, 2016

Hi Tanvish, Is it possible to use multiple computer's CPU to process XGBOOST. Thanks,

arun 25 Sep, 2016

Error in Using xgboost--- I have following data set of stock prices of selected shares on nifty. data.frame': 1772 obs. of 291 variables: $ TCS.NS.Open : num [1:1772, 1] 0.977 -1.369 -0.324 -0.524 -1.291 ... $ TCS.NS.High : num [1:1772, 1] 1.024 -1.373 -0.323 -0.523 -1.302 ... $ TCS.NS.Low : num [1:1772, 1] 0.994 -1.372 -0.3 -0.547 -1.29 ... $ TCS.NS.Close : num [1:1772, 1] 0.982 -1.371 -0.313 -0.562 -1.301 ... $ TCS.NS.Volume : num [1:1772, 1] -0.465 0.064 -0.122 0.369 1.03 -0.52 -0.559 -0.613 0.333 -0.815 ... $ TCS.NS.Adjusted : num [1:1772, 1] 0.969 -1.306 -0.154 -1.018 -0.977 ... $ INFY.NS.Open : num [1:1772, 1] 1.501 -1.498 0.128 -0.463 -0.117 ... $ INFY.NS.High : num [1:1772, 1] 1.483 -1.508 0.115 -0.495 -0.104 ... $ INFY.NS.Low : num [1:1772, 1] 1.436 -1.507 0.104 -0.552 -0.107 ... $ INFY.NS.Close : num [1:1772, 1] 1.416 -1.487 0.096 -0.574 -0.09 ... $ INFY.NS.Volume : num [1:1772, 1] 3.856 -0.174 -0.096 0.486 -0.105 ... $ INFY.NS.Adjusted : num [1:1772, 1] 0.487 -1.343 -0.471 -1.056 -0.705 ... $ TECHM.NS.Open : num [1:1772, 1] 1.313 -1.513 -0.754 0.403 -0.235 . When I run following xgboost model, I get error--- bst=xgboost(data=as.matrix(train[,predictorNames]), label=train$outcome, verbose = 0, eta=0.1, gamma=50, missing = NaN, nround=50, colsample_bytree=0.1, subsample=8.6, objective="binary:logistic") Error in xgb.get.DMatrix(data, label, missing) : xgboost: need label when data is a matrix I checked label is provided but error persists.

Juhi Garg 28 Sep, 2016

Hi Tanvish, I am using Decision Forest Regression for my model, but I need a method to select important features out of 100+ features and then train the Decision Forest Regression Model, What's your view on using "XGBOOST" to just do feature selection and then train model using DFR?

Sray Agarwal 16 Nov, 2016

I am using similar parameters for xgboost and xgbtrain, but the output is slightly different. Even the RMSE is bit different. In such case, which one should I use training.matrix = as.matrix(training) dtraining <- xgb.DMatrix(as.matrix(training[,-5]), label = as.matrix(training[,5])) param <- list("objective" = "reg:linear", # multiclass classification "subsample"= subsample, "colsample_bytree" = colsample_bytree, "max_depth" = max_depth, # maximum depth of tree "min_child_weight" = min_child_weight, "max_delta_step" = max_delta_step, "eta" = eta, # step size shrinkage "gamma" = gamma , # minimum loss reduction "nthread" = nthreads#, # number of threads to be used #"eval_metric" = evalerror ) bst <- xgb.train(params = param, data=dtraining, nrounds=nrounds, maximize = FALSE, verbose = 0) bst2<-xgboost(data = training.matrix[,-5], label = training.matrix[,5], verbose = 1, nrounds=nrounds, params = param, maximize = FALSE)

sedighi 05 Dec, 2016

Thanks to dear tavish.

How to use XGBoost algorithm in R in easy steps

Overview

Introduction

What is XGBoost?

Preparation of Data for using XGBoost

Building Model using Xgboost on R

Step 1: Load all the libraries

Step 2 : Load the dataset

Step 3: Data Cleaning & Feature Engineering

Step 4: Tune and Run the model

Step 5: Score the Test Population

Parameters used in Xgboost

General Parameters

Booster Parameters

Linear Booster Specific Parameters

Learning Task Parameters

Advanced functionality of xgboost

Testing whether the results make sense

End Notes

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Frequently Asked Questions

Responses From Readers

Write for us

Machine Learning

Take a note

View all notes

How to use XGBoost algorithm in R in easy steps

Overview

Introduction

What is XGBoost?

Preparation of Data for using XGBoost

Building Model using Xgboost on R

Step 1: Load all the libraries

Step 2 : Load the dataset

Step 3: Data Cleaning & Feature Engineering

Step 4: Tune and Run the model

Step 5: Score the Test Population

Parameters used in Xgboost

General Parameters

Booster Parameters

Linear Booster Specific Parameters

Learning Task Parameters

Advanced functionality of xgboost

Testing whether the results make sense

End Notes

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Frequently Asked Questions

Responses From Readers

Write for us

Machine Learning

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

NaÃ¯ve Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Take a note

View all notes