← Back to the schedule

How to build high performing weighted XGBoost ML model for real life imbalance dataset

Calendar icon

Wednesday 14th

Time icon

14:50 | 15:30

Location icon

Theatre 25


Keywords defining the session:

- BigData

- MachineLearning

- Scalable

Takeaway points of the session:

- New R based Scalable ML Framework.

- How to run ML algorithms on Big Data.


This Session guide user through how to use XGBoost, Scikit Learn. The goal is to use a Jupyter notebook and data from the UCI repository for Bank Marketing Data to predict if a client will purchase a Certificate of Deposit (CD) from a banking institution.

Class imbalance is a common problem in data science, where the number of positive samples are significantly less than the number of negative samples. As data scientists, one would like to solve this problem and create a classifier with good performance. XGBoost (Extreme Gradient Boosting Decision Tree) is very common tool for creating the Machine Learning Models for classification and regression. However, there are various tricks and techniques for creating good classification models using XGBoost for imbalanced data-sets that is non-trivial and the reason for developing this tutorial.

In this talk, we will illustrate how the Machine Learning classification is performed using XGBoost, which is usually a better choice compared to logistic regression and other techniques. We will use a real life data set which is highly imbalanced (i.e the number of positive sample is much less than the number of negative samples).

This Session will walk the user through the following conceptual steps:

-Data Set Description.

-Exploratory Analysis to understand the data.

-Use various preprocessing to clean and prepare the data.

-Use naive XGBoost to run the classification.

-Use cross validation to get precision-recall precision recall curve and ROC curve.

-We will then tune it and use weighted positive samples to improve classification performance.

We will also talk about the following advanced techniques.

– Oversampling of majority class

– and Undersampling of minority class.

– SMOTE algorithms.y out R4ML .

Here is the notebook.