Xgboost Classifier¶
Tip
It is recommended to use google colaboratory to run this notebook
Xgboost stands for “Extreme Gradient Boosting”. This supervised algorithm is created using the principle of supervised learning, decision trees ensembles and tree boosting. The library was designed to provide good scalability, portability and accuracy.
# Extra libraries required
# Install ray tune
! pip install tune-sklearn ray[tune]
# Install shap
! pip install shap
# Import necessary packages
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold, train_test_split
from xgboost import XGBClassifier, Booster
from sklearn.metrics import classification_report
import plotly.express as px
import plotly.io as pio
# Set default plotly renderer
pio.renderers.default = "notebook_connected" # Use "colab" when running in google colaboratory
# Load data into dataframe
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/uci/ospi/datasets/preprocessed_osi.csv')
Preproessing¶
The preprocessing are the same as used for other ensembling methods.
y = df['Revenue']
X = df.drop('Revenue', axis=1)
# Split data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
As usual, it is necessary to oversample the minority class in order to improve the model performance.
# Oversample the minority class in the target variable
oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train, y_train)
Model Training¶
# Declare estimator
estimator = XGBClassifier(tree_method='gpu_hist', gpu_id='0')
# Declare cross validation method
cv = StratifiedKFold()
# Declare paramter grid
param_grid = dict(
n_estimators = [50, 100, 200, 400],
max_depth = [3, 6, 9],
learning_rate = [1, 0.1, 0.01],
subsample = [0.5, 0.8, 1],
colsample_bytree = [0.5, 0.8, 1]
)
# Import grid search from tune sklearn
from tune_sklearn import TuneGridSearchCV
# Train the model
xgb_clf = TuneGridSearchCV(estimator=estimator, param_grid=param_grid, scoring="f1", cv=cv, n_jobs=-1, use_gpu=True, verbose=2)
xgb_clf.fit(X_train, y_train)
xgb_clf.best_estimator_
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id='0',
learning_rate=0.01, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=None, n_estimators=200, n_jobs=1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=0.5, tree_method='gpu_hist', verbosity=1)
# Save and load the model if required
import joblib
joblib.dump(xgb_clf.best_estimator_, open('/content/drive/MyDrive/Colab Notebooks/uci/ospi/models/xgb.pkl', 'wb'))
# xgb_clf_loaded = joblib.load(open('/content/drive/MyDrive/Colab Notebooks/uci/ospi/models/xgb.pkl', 'rb'))
# Get predictions from the model
y_pred = xgb_clf.best_estimator_.predict(X_test, validate_features=False)
Model Evaluation¶
print(classification_report(y_test, y_pred))
precision recall f1-score support
False 0.95 0.92 0.93 2594
True 0.63 0.76 0.69 489
accuracy 0.89 3083
macro avg 0.79 0.84 0.81 3083
weighted avg 0.90 0.89 0.90 3083
Xgboost classifier classifier is as good as the adaboost classifier and is also comparable to the models used in the original paper.
Model Interpretation¶
Xgboost also uses white-box models which helps to explain the model in a better way. The feature importances provided by the model are of great use for understanding how the model has learnt.
# Create a feature importance dataframe
feat_imp_data = zip(list(df.drop('Revenue', axis=1).columns), xgb_clf.best_estimator_.feature_importances_)
feat_imp_df = pd.DataFrame(columns=['column', 'feature_importance'], data=feat_imp_data)
# Sort feature importance
feat_imp_df.sort_values('feature_importance', ascending=False, inplace=True)
fig = px.bar(feat_imp_df[:20], x='feature_importance', y='column', orientation='h')
fig.show()
Further, SHAP values can also be explored. Currently struggling to get shap working for XGBoost.