Adaboost Classifier¶

Tip

It is recommended to use google colaboratory to run this notebook.

# Extra libraries required

# Install ray tune
! pip install tune-sklearn ray[tune]

# Install shap
# ! pip install shap

Adaboost classifier is the ensemble method which fits the sequence of weak learners on repeatedly modified versions of data. The predictions from all the weak learners are then combined using wieghted majority vote. The algorithm trains the first learner on the unweighted data but in the subsequent training the weights are adjusted based on misclassifications so that more difficult cases also be dealt with.

# Import necessary packages
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import plotly.express as px
import plotly.io as pio

/usr/local/lib/python3.7/dist-packages/sklearn/externals/six.py:31: FutureWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
  "(https://pypi.org/project/six/).", FutureWarning)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.neighbors.base module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.
  warnings.warn(message, FutureWarning)

# Set default plotly renderer
pio.renderers.default = "notebook_connected" # set it to "colab" for working in google colaboratory

# Load data into dataframe
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/uci/ospi/datasets/preprocessed_osi.csv')

Preprocessing¶

The preprocessing steps remains the same as with the earlier algorithms.

y = df['Revenue']
X = df.drop('Revenue', axis=1)

# Split the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

The underlying base estimator that will be used in Adaboost classifier is the decision tree classifier which can not perform well with the imbalanced classes. Hence it is better to oversample the minority class.

# Oversample the minority class in the target variable
oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train, y_train)

Model Training¶

Adaboost requires a base base estimator. It will be better to use best Decision Tree classifier that was trained earlier.

# Declare the model
estimator = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=20, splitter="random", class_weight="balanced"))

# Declare cross-validation method
cv = StratifiedKFold()

# Declare paramters grid
param_grid = dict(
    n_estimators = [25, 50, 100, 200],
    learning_rate = [1, 0.1, 0.01, 0.001]
)

# Import grid search model from tune sklearn
from tune_sklearn import TuneGridSearchCV

# Train the model
adab_clf = TuneGridSearchCV(estimator=estimator, param_grid=param_grid, scoring='f1', n_jobs=-1, cv=cv, use_gpu=True, verbose=2)
adab_clf.fit(X_train, y_train)

adab_clf.best_estimator

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight='balanced',
                                                         criterion='gini',
                                                         max_depth=20,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                                                         random_state=None,
                                                         splitter='random'),
                   learning_rate=0.01, n_estimators=200, random_state=None)

# Save and load the model if required
import joblib
joblib.dump(adab_clf.best_estimator_, '/content/drive/MyDrive/Colab Notebooks/uci/ospi/models/adab.pkl')
adab_clf = joblib.load('/content/drive/MyDrive/Colab Notebooks/uci/ospi/models/adab.pkl')

# Use model for prediction
y_pred = adab_clf.predict(X_test)

Model Evaluation¶

# Print classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.93      0.93      0.93      2594
        True       0.65      0.65      0.65       489

    accuracy                           0.89      3083
   macro avg       0.79      0.79      0.79      3083
weighted avg       0.89      0.89      0.89      3083

Adaboost classifier has performed quite well. The result is close to Random Forest Classifier (F1 score = 0.81) that is used in the original paper.

Model Interpretation¶

Adaboost classifier also uses white-box model and it is easier to explain the results. Also the base estimator that is used - Decision Tree Classifier - uses a white box model itself.

# Create a feature importance dataframe
feat_imp_data = zip(list(df.drop('Revenue', axis=1).columns), adab_clf.feature_importances_)
feat_imp_df = pd.DataFrame(columns=['column', 'feature_importance'], data=feat_imp_data)

# Sort feature importance
feat_imp_df.sort_values('feature_importance', ascending=False, inplace=True)

fig = px.bar(feat_imp_df[:20], x='feature_importance', y='column', orientation='h')
fig.show()

As usual, the page value feature has gained the highest value in the importance measure. This model believes that features regarding product related pages are more important than features related to other pages. Product related pages features are closely followed by the administrative pages features in terms of importance. Exit rates affect more than the bounce rate to the model. Also, the month of November came out to be one of the most important predictor.

Unfortunately, SHAP package does not support Adaboost Classifier.

Online Shoppers Purchasing Intention