Dataset Introduction

The dataset that is used in this book is IBM HR Analytics Employee Attrition & Performance hosted on Kaggle. It is uploaded 4 years ago with no revisions since then. The size of the data file is around 222kB. There are 35 columns in the dataset. The primary aim for hosting the dataset was to predict the attrition of the employees.

# Import the necessary packages
import pandas as pd 
import json
# Load the data
df = pd.read_csv('./../../../data/data.csv')
df.head()
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 College Life Sciences 1 1 ... Low 80 level_0 8 0 Bad 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 Below College Life Sciences 1 2 ... Very High 80 level_1 10 3 Better 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 College Other 1 4 ... Medium 80 level_0 7 3 Better 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 Master Life Sciences 1 5 ... High 80 level_0 8 3 Better 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 Below College Medical 1 7 ... Very High 80 level_1 6 3 Better 2 2 2 2

5 rows × 35 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   object
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   object
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   object
 14  JobLevel                  1470 non-null   object
 15  JobRole                   1470 non-null   object
 16  JobSatisfaction           1470 non-null   object
 17  MaritalStatus             1470 non-null   object
 18  MonthlyIncome             1470 non-null   int64 
 19  MonthlyRate               1470 non-null   int64 
 20  NumCompaniesWorked        1470 non-null   int64 
 21  Over18                    1470 non-null   object
 22  OverTime                  1470 non-null   object
 23  PercentSalaryHike         1470 non-null   int64 
 24  PerformanceRating         1470 non-null   object
 25  RelationshipSatisfaction  1470 non-null   object
 26  StandardHours             1470 non-null   int64 
 27  StockOptionLevel          1470 non-null   object
 28  TotalWorkingYears         1470 non-null   int64 
 29  TrainingTimesLastYear     1470 non-null   int64 
 30  WorkLifeBalance           1470 non-null   object
 31  YearsAtCompany            1470 non-null   int64 
 32  YearsInCurrentRole        1470 non-null   int64 
 33  YearsSinceLastPromotion   1470 non-null   int64 
 34  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(17), object(18)
memory usage: 402.1+ KB

There are 1470 entries in the dataset with 35 columns. Also, no null values are present in the dataset.
There are 19 numerical columns and 16 categorical columns.
Almost all the columns are self-explainatory but still we will look at each column briefly.

Column Name

Description

Age

Age of the employee.

Attrition

Whether the employee left the firm or not. It is the target variable for prediction analysis for attrition.

BusinessTravel

Whether the emplyee needs to travel for business purposes or not.

DailyRate

Daily rate of the employee for the work.

DistanceFromHome

Distance of the office from the employee’s home.

Education

Qualification till which an employee completed the education.

EducationFeild

Feild of study during the education.

EmployeeCount

This column provides no information.

EmployeeNumber

Unique identifier of employee.

EnvironmentSatisfaction

Satifaction level of employee regarding the environment in the office.

Gender

Gender of the employee.

HourlyRate

Hourly Rate of the employee.

JobInvolvement

Satisfaction level of employee regarding their involvement during the employment.

JobLevel

Level of employee in the heirarchy of promotion.

JobSatisfaction

Overall job satisfaction level of employee.

MaritalStatus

Whether the employee employee is married or not

MonthlyIncome

Monthly income of the employee.

MonthlyRate

Monthly rate of the employee.

NumCompaniesWorked

Number of companies that the employee worked in.

Over18

This column provides no information.

OverTime

Whether the employee needs to do overtime or not.

PercentSalaryHike

Recent percentage hike in the salary.

PerformanceRating

Recent performance rating that was awarded.

RelationshipSatisfaction

Satisfaction level regarding the employee’s professional relationships in the company.

StandardHours

Average number of hours of work that the employee put in everyday.

StockOptionLevel

Level of stock options.

TotalWorkingYears

Total experience of employee in years.

TrainingTimesLastYear

Number of times employee was trained in the previous year.

WorkLifeBalance

Satisfaction level with regards to work life balance.

YearsAtCompany

Number of years the employee worked in the current company.

YearsInCurrentRole

Number of years the employee worked in current role.

YearsSinceLastPromotion

Number of years since the employee is promoted.

YearsWithCurrManager

Number of years the employee worked for the current manager.

It is better to remove columns which does not required and does not contribute to the information gain from the dataset.

# Drop unnecessary columns
df.drop(['EmployeeCount', 'Over18', 'StandardHours'], axis=1, inplace=True)

Remove the dropped columns from the list of numerical and categorical columns.

# Load the static lists
with open('./../../../data/statics.json') as f:
    statics = json.load(f)
categorical_columns = statics['categorical_columns']
numerical_columns = statics['numerical_columns']

# Remove the colums
categorical_columns.remove('Over18')
numerical_columns.remove('EmployeeCount')
numerical_columns.remove('StandardHours')

# Write the new columns back to the static file
statics['categorical_columns'] = categorical_columns
statics['numerical_columns'] = numerical_columns

with open('./../../../data/statics.json', 'w') as f:
    json.dump(statics, f)
# Save the processed data
df.to_csv('./../../../data/cleaned_data.csv', index=False)