14:50 | 15:30
Keywords defining the session:
Takeaway points of the session:
- New R based Scalable ML Framework.
- How to run ML algorithms on Big Data.
This Session guide user through how to use XGBoost, Scikit Learn. The goal is to use a Jupyter notebook and data from the UCI repository for Bank Marketing Data to predict if a client will purchase a Certificate of Deposit (CD) from a banking institution.
Class imbalance is a common problem in data science, where the number of positive samples are significantly less than the number of negative samples. As data scientists, one would like to solve this problem and create a classifier with good performance. XGBoost (Extreme Gradient Boosting Decision Tree) is very common tool for creating the Machine Learning Models for classification and regression. However, there are various tricks and techniques for creating good classification models using XGBoost for imbalanced data-sets that is non-trivial and the reason for developing this tutorial.
In this talk, we will illustrate how the Machine Learning classification is performed using XGBoost, which is usually a better choice compared to logistic regression and other techniques. We will use a real life data set which is highly imbalanced (i.e the number of positive sample is much less than the number of negative samples).
This Session will walk the user through the following conceptual steps:
-Data Set Description.
-Exploratory Analysis to understand the data.
-Use various preprocessing to clean and prepare the data.
-Use naive XGBoost to run the classification.
-Use cross validation to get precision-recall precision recall curve and ROC curve.
-We will then tune it and use weighted positive samples to improve classification performance.
We will also talk about the following advanced techniques.
– Oversampling of majority class
– and Undersampling of minority class.
– SMOTE algorithms.y out R4ML .
Here is the notebook.