Logistic Regression¶
# Load the packages
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report
# Load the data
train_df = pd.read_csv('./../../../../data/train/train.csv')
test_df = pd.read_csv('./../../../../data/test/test.csv')
# Load the feature selection result
feature_selector = pd.read_csv('./../../../../data/feature_ranking.csv')
feature_selector.set_index('Unnamed: 0', inplace=True)
# Separate feature space from target variable
y_train = train_df['Attrition']
X_train = train_df.drop('Attrition', axis=1)
y_test = test_df['Attrition']
X_test = test_df.drop('Attrition', axis=1)
We will be running models for different set of features and evaluate their performances. We start with complete dataset and then start with meaximum feature score of 8 to 5.
# Declare the model paramters for searching
C = np.logspace(-4, 4, num=9)
# Declare and train the model
log_reg = LogisticRegressionCV(Cs=C, scoring='f1', max_iter=1000, n_jobs=-1)
Complete data¶
# Train the model
log_reg.fit(X_train, y_train)
LogisticRegressionCV(Cs=array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03,
1.e+04]),
max_iter=1000, n_jobs=-1, scoring='f1')
# Predict using model
y_pred = log_reg.predict(X_test)
# Make the classification report
print(classification_report(y_test, y_pred))
precision recall f1-score support
False 0.92 0.91 0.91 255
True 0.45 0.51 0.48 39
accuracy 0.85 294
macro avg 0.69 0.71 0.70 294
weighted avg 0.86 0.85 0.86 294
From the above reported, it can be observed that the model is quite good for predicting who won’t leave but pretty bad for identifying who will be leaving.
Feature score of 8¶
# Create the new dataset
# Get features with feature score of 8
features = feature_selector[feature_selector['Total']==8].index.tolist()
X_train_8 = X_train.loc[:, features]
X_test_8 = X_test.loc[:, features]
# Train the model
log_reg.fit(X_train_8, y_train)
LogisticRegressionCV(Cs=array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03,
1.e+04]),
max_iter=1000, n_jobs=-1, scoring='f1')
# Predict with model
y_pred_8 = log_reg.predict(X_test_8)
# Make the report
print(classification_report(y_test, y_pred))
precision recall f1-score support
False 0.92 0.91 0.91 255
True 0.45 0.51 0.48 39
accuracy 0.85 294
macro avg 0.69 0.71 0.70 294
weighted avg 0.86 0.85 0.86 294
There is no improvement in the result. But since this model uses less number of features, it better to use it in production in order to improve the retraining and inferencing with huge load of data.
Since the least number of features that could be used gave the same performance as all the features, it is better to skip the other scores since the chance of improvement in result is quite less.