Logistic Regression

Tip

It is recommended to use google colaboratory for running the notebook.

# Extra libraries required

# Install ray tune
# ! pip install tune-sklearn ray[tune]

# Install shap
# ! pip install shap
Copy to clipboard

Logistic regression is the special case if linear regression. The main concept underlying a logistic regssion is the natural log of odds. Consider a simplest case of linear regression with one continuous predictor X and a dichotomous outcome Y. The plot of such a case results in two parallel lines which are difficult for ordinary linear regression to fit. Instead, the predictor X is grouped into various categories and comput the mean of outcome variable for those groups. The resultant plot can be approximated by a sigmoid function. Even signmoid is difficult to be fit by a linear regression. But this issue can be dealt by applying logit transformation to the dependent variable. The simplest logistic regression model is represented by,
logit(Y)=natural_log(odds)=ln(π1π)=α+βx

To find the probability of an outcome, take the antilog of both the sides of (\ref{eq1}). Euation (\ref{eq1}) is necessary to make the relationship between the predictor and the dependent variable linear. [PLI02] One of the major advantage of the logistic regression is that the equation of probability is simple. This allows it to be applied to large datasets. But the major con of this method is that it can not map the non linear relationships peoperly.

# Import necessary packages
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.feature_selection import GenericUnivariateSelect, chi2, f_classif, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import shap
import plotly.express as px
import plotly.io as pio
Copy to clipboard
# Set default plotly renderer
pio.renderers.default = "notebook_connected" # set it to "colab" for working in google colaboratory
Copy to clipboard
# Load data into dataframe
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/uci/ospi/datasets/preprocessed_osi.csv')
Copy to clipboard

Preprocessing

Before we perform any preprocessing, it is necessary to separate the data into training set and testing set.

y = df['Revenue']
X = df.drop('Revenue', axis=1)
Copy to clipboard
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Copy to clipboard

Beginning with the focus on target variable. Since our target varaible has highly imbalance entries for transacting and non-transacting users, it is necessary to oversample the class with less entries. For our purpose, we will be using SMOTE to oversample the minority class.

# Oversample the minority class in the target variable
oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train, y_train)
Copy to clipboard

Also, logistic regression is senistive to the feature ranges. Hence, it is necessary for us to transoform the data into unit norm. For our purpose, we will be utilizing the minmaxscaler API from scikit-learn. MinMaxScaler will not affect the categorical features unlike StandardScaler.

# Scale the data
transformer = MinMaxScaler()
X_train = transformer.fit_transform(X_train)

# Apply same transformation to test set
X_test = transformer.transform(X_test)
Copy to clipboard

Model Training

# Declare the model
estimator = LogisticRegression()

# Declare cross-validation method
cv = StratifiedKFold()

# Declare parameter grid to for each component of pipeline
param_grid = dict(
    C = [0.001, 0.1, 1, 10, 100],
    solver = ['liblinear', 'lbfgs', 'newton-cg', 'sag', 'saga'],
    max_iter = [100, 150, 200]
)
Copy to clipboard
# Import grid search model from tune sklearn
from tune_sklearn import TuneGridSearchCV

# Train the model
logreg_clf = TuneGridSearchCV(estimator=estimator, param_grid=param_grid, scoring='f1', n_jobs=-1, cv=cv, use_gpu=True)
logreg_clf.fit(X_train, y_train)
Copy to clipboard
# Get the best performing model
logreg_clf.best_estimator_
Copy to clipboard
LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
Copy to clipboard
# Save and load the model if required
# import joblib
# joblib.dump(logreg_clf.best_estimator_, '/content/drive/MyDrive/Colab Notebooks/uci/ospi/models/log_reg.pkl')
# logreg_clf = joblib.load('/content/drive/MyDrive/Colab Notebooks/uci/ospi/models/log_reg.pkl')
Copy to clipboard
# Use model for prediction
y_pred = logreg_clf.predict(X_test)
Copy to clipboard

Model Evaluation

The method of evaluation chosen by the authors of the original paper is F1 score. In order to compare our result with the author’s result, it is beneficial for us to compute F1 score.

print(classification_report(y_test, y_pred))
Copy to clipboard
              precision    recall  f1-score   support

       False       0.95      0.88      0.91      2594
        True       0.54      0.74      0.63       489

    accuracy                           0.86      3083
   macro avg       0.74      0.81      0.77      3083
weighted avg       0.88      0.86      0.87      3083
Copy to clipboard

The average f1 score came out to around 0.77 which is quite impressive. Though it could not beat the models used in the original paper, it is not far behind them.

Model Interpretation

In order to know on what basis machine learning models are giving us these results, it is necessary to understand how the trained model is looking at the features of the datasets. Furthermore, it is also helpful for us to know how a particular value of feature is affecting the outcome. We will be using SHAP values for model interpretation, beginning with the feature importance. It should also be noted that coefficients provided by the logistic regression models can also be interpreted as feature importances.

# Create a datframe with columns and their corresponsing coefficients
coef_df = pd.DataFrame(list(zip(df.drop('Revenue', axis=1).columns, logreg_clf.best_estimator_.coef_[0])), columns=['column', 'coef'])
coef_df.sort_values('coef', ascending=False, inplace=True)
Copy to clipboard
# Comput shap values
explainer = shap.Explainer(logreg_clf.best_estimator_, X_train, feature_names=df.drop('Revenue', axis=1).columns)
shap_values = explainer(X_test)
Copy to clipboard
# Plot shap values
shap.plots.bar(shap_values)
Copy to clipboard
../../_images/log_reg_train_30_0.png
# Plot coefficients of logistic regression model
fig = px.bar(coef_df[:10], x='coef', y='column', orientation='h')
fig.show()
Copy to clipboard
01020304050PageValuesProductRelated_DurationTrafficType__16InformationalOperatingSystems__7Browser__12ProductRelatedTrafficType__20TrafficType__8Browser__13
coefcolumn

From the above two plots, it can be observed that shap and the logistic regression model coefficients agree on page values to be the most important variable. The other feature that they are partially agreeing on is the duration for which visitor visits product related pages. Also, month of november seems to be also important for shopping and it makes sense since that is the period near holidays.

shap.plots.beeswarm(shap_values=shap_values, max_display=20)
Copy to clipboard
../../_images/log_reg_train_33_0.png

Some high page values are prominently affecting the model. The higher the page value the higher the chance that the visitor will transact. For exit rates, though it is true that high value will not convert a visitor but low values may not convert her either. Customer spending huge amount of time on product related pages can convert the her into a transacting visitor, though the effect is not even comparable with the effect of page values. Also the customer visiting more number of informational pages may covert into a trasacting user. People shop the least in the month of may and shop the most in the month of november. Traffic type 8 is are the more likely to transact. This is even confirmed from the logitic regression coefficient as it is among the top 10 features affecting the model. Traffic type 10, 11 also contribute to the revenue. People using operating system 2 are more likely to transact. Though less amount of explanation is present from the author for the features such traffic type and operating system, no concrete interpretation for these features.

Learnings

  1. Feature selection methods were tried with generic univariate select but it did not improve the performance. Hence the idea to select features is dropped as these methods would reduce the explainability of the model.

  2. Logistic regression hyperparameter tuning with grid search training on GPU is extremely fast.