Data Preprocessing¶
# Import packages
import json
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from imblearn.over_sampling import SMOTE
# Load the dataset
df = pd.read_csv('./../../../data/cleaned_data.csv')
# Load lists of numerical and categorical columns from the static file
with open('./../../../data/statics.json') as f:
statics = json.load(f)
categorical_columns = statics['categorical_columns']
numerical_columns = statics['numerical_columns']
Before we begin the preprocessing, it is necessary to split the data into training and testing sets. This is necessary because every transformation has to be trainined on training data while transformation should be done on training and testing set.
# Separate the target variable from the other data
y = df['Attrition']
X = df.drop('Attrition', axis=1)
categorical_columns.remove('Attrition')
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape
(1176, 31)
# Seggregate the data into numerical and categorical variable for training data
num_df_train = X_train[numerical_columns]
cat_df_train = X_train[categorical_columns]
# Seggregate the data into numerical and categorical variable for testing data
num_df_test = X_test[numerical_columns]
cat_df_test = X_test[categorical_columns]
Preprocessing per data types¶
Numerical columns¶
Let us begin the data preprocessing with the numerical columns. Since some of the columns are positively skewed and they does not belong to the same scale, it would be better to make the their scale common. The transformation that will be used in the MinMaxScaler from the scikit-learn. Mathematically, it can be given as: $\( X' = \frac{X - X_{min}}{X_{max} - X_{min}} \)$
# Scale the data
transformer = MinMaxScaler()
num_df_train = transformer.fit_transform(num_df_train)
num_df_test = transformer.transform(num_df_test)
# Convert the numpy arrays to dataframe
num_df_train = pd.DataFrame(num_df_train, columns=numerical_columns)
num_df_test = pd.DataFrame(num_df_test, columns=numerical_columns)
Categorical columns¶
As far as categorical columns are concerned, they need to be represented by numbers so that machines can process the data. For our data, some columns need ordinal encoding while others need one-hot encoding.
# Separate the columns into ordinal and one hot columns
ordinal_columns = ['Education', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'OverTime', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'WorkLifeBalance']
one_hot_columns = ['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus']
ordinal_list = []
In order to encode columns with some order, we need to first declare which will help the algorithm to encode.
ordinal_list.append(['Below College', 'College', 'Bachelor', 'Master', 'Doctor']) # Education
ordinal_list.append(['Low', 'Medium', 'High', 'Very High']) # EnvironmentSatisfaction
ordinal_list.append(['Low', 'Medium', 'High', 'Very High']) # JobInvolvement
ordinal_list.append(['level_1', 'level_2', 'level_3', 'level_4', 'level_5']) #JobLevel
ordinal_list.append(['Low', 'Medium', 'High', 'Very High']) # JobSatisfaction
ordinal_list.append(['No', 'Yes']) # OverTime
ordinal_list.append(['Excellent', 'Outstanding']) # PerformanceRating
ordinal_list.append(['Low', 'Medium', 'High', 'Very High']) # RelationshipSatisfaction
ordinal_list.append(['level_0', 'level_1', 'level_2', 'level_3']) # JobInvolvement
ordinal_list.append(['Bad', 'Good', 'Better', 'Best']) # WorkLifeBalance
# Apply Ordinal Encoder
onc = OrdinalEncoder(categories=ordinal_list)
ordinal_cat_df_train = onc.fit_transform(cat_df_train[ordinal_columns])
ordinal_cat_df_test = onc.transform(cat_df_test[ordinal_columns])
# Covert the numpy array to dataframe
ordinal_cat_df_train = pd.DataFrame(ordinal_cat_df_train, columns=ordinal_columns)
ordinal_cat_df_test = pd.DataFrame(ordinal_cat_df_test, columns=ordinal_columns)
# Apply One-hot Encoder
onehot_cat_df_train = pd.DataFrame()
for column in one_hot_columns:
temp = pd.get_dummies(cat_df_train[column], prefix=column, prefix_sep=' ')
onehot_cat_df_train = pd.concat([onehot_cat_df_train, temp], axis=1)
onehot_cat_df_test = pd.DataFrame()
for column in one_hot_columns:
temp = pd.get_dummies(cat_df_test[column], prefix=column, prefix_sep=' ')
onehot_cat_df_test = pd.concat([onehot_cat_df_test, temp], axis=1)
Merge preprocessed data¶
Before merging all the data together, it is better to check the length of individual data.
# Check the length of the training data
print(f"Length of numerical data: {len(num_df_train)}")
print(f"Length of ordinal categorical data: {len(ordinal_cat_df_train)}")
print(f"Length of one-hot encoded columns data: {onehot_cat_df_train.shape[0]}")
Length of numerical data: 1176
Length of ordinal categorical data: 1176
Length of one-hot encoded columns data: 1176
# Check the length of the testing data
print(f"Length of numerical data: {len(num_df_test)}")
print(f"Length of ordinal categorical data: {len(ordinal_cat_df_test)}")
print(f"Length of one-hot encoded columns data: {onehot_cat_df_test.shape[0]}")
Length of numerical data: 294
Length of ordinal categorical data: 294
Length of one-hot encoded columns data: 294
Since all the length match together, the data can be merged together.
# Merge the data
train_df = pd.concat([num_df_train.reset_index(drop=True), ordinal_cat_df_train.reset_index(drop=True), onehot_cat_df_train.reset_index(drop=True)], axis=1)
test_df = pd.concat([num_df_test.reset_index(drop=True), ordinal_cat_df_test.reset_index(drop=True), onehot_cat_df_test.reset_index(drop=True)], axis=1)
Checking upon the final shape of both the dataframes
print(f"The shape of training data is: {train_df.shape}")
print(f"The shape of testing data is: {test_df.shape}")
The shape of training data is: (1176, 51)
The shape of testing data is: (294, 51)
Handling Outliers¶
Outliers are the observations which does not fit well with the dataset. These observations are errors during the recording process. Outliers does not find place in the pattern that exist in the dataset and hence in order to reduce their influence on the outcome it is better to either treat them or remove them. Since we cannot correct the values of such data now, we will be resorting to remove them.
To remove outliers, it is necessary to detect them first. We will be utilizing various methods to detect the outliers and then combine the results of those methods to get the final set for removal. There will be 3 methods that will be used -
Robust Covariance
Isolation Forest
Local Outlier Factor
# Initialize the algorithms
rob_cov = EllipticEnvelope(contamination=0.05)
iso_fst = IsolationForest(contamination=0.05, random_state=42)
lof = LocalOutlierFactor(contamination=0.05)
# Find the outliers using each method
y_pred_rob_cov = rob_cov.fit_predict(train_df)
y_pred_iso_fst = iso_fst.fit_predict(train_df)
y_pred_lof = lof.fit_predict(train_df)
/Users/pushkar/miniforge3/envs/hra/lib/python3.8/site-packages/sklearn/covariance/_robust_covariance.py:647: UserWarning: The covariance matrix associated to your dataset is not full rank
warnings.warn("The covariance matrix associated to your dataset "
# Create a dataframe for the result
y_pred_df = pd.DataFrame(columns = ['Robust Covariance', 'Isolation Forest', 'Local Outlier Factor'])
y_pred_df['Robust Covariance'] = y_pred_rob_cov
y_pred_df['Isolation Forest'] = y_pred_iso_fst
y_pred_df['Local Outlier Factor'] = y_pred_lof
# Find the indexes of observations for which are marked as outlier by majority of algorithm
y_pred_df['Total'] = y_pred_df['Robust Covariance'] + y_pred_df['Isolation Forest'] + y_pred_df['Local Outlier Factor']
y_pred_df['Outlier'] = y_pred_df['Total'].apply(lambda x: True if x < 0 else False)
print(f"Total outliers detected: {len(y_pred_df[y_pred_df['Outlier'] == True])}")
Total outliers detected: 38
Since the total outliers detected are very few, it would be safe for us to remove them. Though we lose information after removing the outliers but that will not harm our results to greater extent.
# Remove outliers from the training data
outleir_index = y_pred_df[y_pred_df['Outlier'] == True].index
train_df = train_df.drop(outleir_index)
y_train = y_train.reset_index(drop=True).drop(labels=outleir_index)
Oversampling¶
The target variable is significant class imbalance which may introduce bias during the training process. In order to avoid such situation it is necessary to undersample the majority class or oversmaple the minority class of the target variable. But undersampling may lead to data loss and since we don’t have much data in the first place, it will be better to perform oversampling of the minority class.
# Convert the yes and no to binary values
y_train = y_train.apply(lambda x: True if x=='Yes' else False)
y_test = y_test.apply(lambda x: True if x=='Yes' else False)
# Perform oversampling
oversample = SMOTE()
X_train, y_train = oversample.fit_resample(train_df, y_train)
# Store the values in files
X_train = pd.concat([X_train, y_train], axis=1)
X_test = pd.concat([test_df.reset_index(drop=True), y_test.reset_index(drop=True)], axis=1)
X_train.to_csv('./../../../data/train/train.csv', index=False)
X_test.to_csv('./../../../data/test/test.csv', index=False)