Random Forest Classifier¶

# Load the packages
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Load the data
train_df = pd.read_csv('./../../../../data/train/train.csv')
test_df = pd.read_csv('./../../../../data/test/test.csv')

# Load the feature selection result
feature_selector = pd.read_csv('./../../../../data/feature_ranking.csv')
feature_selector.set_index('Unnamed: 0', inplace=True)

# Separate feature space from target variable
y_train = train_df['Attrition']
X_train = train_df.drop('Attrition', axis=1)
y_test = test_df['Attrition']
X_test = test_df.drop('Attrition', axis=1)

We will be running models for different set of features and evaluate their performances. We start with complete dataset and then start with meaximum feature score of 8 to 5.

# Declare the model paramters for searching
param_grid = dict(
    criterion = ['gini', 'entropy'],
    splitter = ['best', 'random'],
    max_depth = [20, 40, 60, None],
    min_samples_split = [2, 10, 40]
)

# Declare and train the model
dt_clf = DecisionTreeClassifier(class_weight="balanced", max_features=None)
dt = GridSearchCV(estimator=dt_clf, param_grid=param_grid, scoring='f1', n_jobs=-1)

Complete data¶

# Train the model
dt.fit(X_train, y_train)

GridSearchCV(estimator=DecisionTreeClassifier(class_weight='balanced'),
             n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [20, 40, 60, None],
                         'min_samples_split': [2, 10, 40],
                         'splitter': ['best', 'random']},
             scoring='f1')

# Get the parameters for the best model
dt.best_estimator_

DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=40)

# Predict using model
y_pred = dt.predict(X_test)

# Make the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.88      0.85      0.87       255
        True       0.21      0.26      0.23        39

    accuracy                           0.78       294
   macro avg       0.55      0.56      0.55       294
weighted avg       0.79      0.78      0.78       294

The results not better than that of logistic regression. The precision, recall and f1 of attrition is not at all good as that of random forest.

Feature score of 8¶

# Create the new dataset

# Get features with feature score of 8
features = feature_selector[feature_selector['Total']==8].index.tolist()
X_train_8 = X_train.loc[:, features]
X_test_8 = X_test.loc[:, features]

# Train the model
dt.fit(X_train_8, y_train)

GridSearchCV(estimator=DecisionTreeClassifier(class_weight='balanced'),
             n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [20, 40, 60, None],
                         'min_samples_split': [2, 10, 40],
                         'splitter': ['best', 'random']},
             scoring='f1')

# Predict with model
y_pred_8 = dt.predict(X_test_8)

# Make the report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.88      0.85      0.87       255
        True       0.21      0.26      0.23        39

    accuracy                           0.78       294
   macro avg       0.55      0.56      0.55       294
weighted avg       0.79      0.78      0.78       294

There is no improvement in the result. But since this model uses less number of features, it better to use it in production in order to improve the retraining and inferencing with huge load of data.

Since the least number of features that could be used gave the same performance as all the features, it is better to skip the other scores since the chance of improvement in result is quite less.

Random Forest Classifier Random Forest Classifier