Xgboost Classifier

Tip

It is recommended to use google colaboratory to run this notebook

Xgboost stands for “Extreme Gradient Boosting”. This supervised algorithm is created using the principle of supervised learning, decision trees ensembles and tree boosting. The library was designed to provide good scalability, portability and accuracy.

# Extra libraries required

# Install ray tune
! pip install tune-sklearn ray[tune]

# Install shap
! pip install shap
# Import necessary packages
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold, train_test_split
from xgboost import XGBClassifier, Booster
from sklearn.metrics import classification_report
import plotly.express as px
import plotly.io as pio
# Set default plotly renderer
pio.renderers.default = "notebook_connected" # Use "colab" when running in google colaboratory
# Load data into dataframe
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/uci/ospi/datasets/preprocessed_osi.csv')

Preproessing

The preprocessing are the same as used for other ensembling methods.

y = df['Revenue']
X = df.drop('Revenue', axis=1)
# Split data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

As usual, it is necessary to oversample the minority class in order to improve the model performance.

# Oversample the minority class in the target variable
oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train, y_train)

Model Training

# Declare estimator
estimator = XGBClassifier(tree_method='gpu_hist', gpu_id='0')

# Declare cross validation method
cv = StratifiedKFold()

# Declare paramter grid
param_grid = dict(
    n_estimators = [50, 100, 200, 400],
    max_depth = [3, 6, 9],
    learning_rate = [1, 0.1, 0.01],
    subsample = [0.5, 0.8, 1],
    colsample_bytree = [0.5, 0.8, 1]
)
# Import grid search from tune sklearn
from tune_sklearn import TuneGridSearchCV

# Train the model
xgb_clf = TuneGridSearchCV(estimator=estimator, param_grid=param_grid, scoring="f1", cv=cv, n_jobs=-1, use_gpu=True, verbose=2)
xgb_clf.fit(X_train, y_train)
xgb_clf.best_estimator_
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id='0',
              learning_rate=0.01, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=None, n_estimators=200, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=0.5, tree_method='gpu_hist', verbosity=1)
# Save and load the model if required
import joblib

joblib.dump(xgb_clf.best_estimator_, open('/content/drive/MyDrive/Colab Notebooks/uci/ospi/models/xgb.pkl', 'wb'))
# xgb_clf_loaded = joblib.load(open('/content/drive/MyDrive/Colab Notebooks/uci/ospi/models/xgb.pkl', 'rb'))
# Get predictions from the model
y_pred = xgb_clf.best_estimator_.predict(X_test, validate_features=False)

Model Evaluation

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

       False       0.95      0.92      0.93      2594
        True       0.63      0.76      0.69       489

    accuracy                           0.89      3083
   macro avg       0.79      0.84      0.81      3083
weighted avg       0.90      0.89      0.90      3083

Xgboost classifier classifier is as good as the adaboost classifier and is also comparable to the models used in the original paper.

Model Interpretation

Xgboost also uses white-box models which helps to explain the model in a better way. The feature importances provided by the model are of great use for understanding how the model has learnt.

# Create a feature importance dataframe
feat_imp_data = zip(list(df.drop('Revenue', axis=1).columns), xgb_clf.best_estimator_.feature_importances_)
feat_imp_df = pd.DataFrame(columns=['column', 'feature_importance'], data=feat_imp_data)
# Sort feature importance
feat_imp_df.sort_values('feature_importance', ascending=False, inplace=True)
fig = px.bar(feat_imp_df[:20], x='feature_importance', y='column', orientation='h')
fig.show()

Further, SHAP values can also be explored. Currently struggling to get shap working for XGBoost.