Dataset Introduction¶

The dataset that is used in this book is IBM HR Analytics Employee Attrition & Performance hosted on Kaggle. It is uploaded 4 years ago with no revisions since then. The size of the data file is around 222kB. There are 35 columns in the dataset. The primary aim for hosting the dataset was to predict the attrition of the employees.

# Import the necessary packages
import pandas as pd 
import json

# Load the data
df = pd.read_csv('./../../../data/data.csv')

df.head()

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	College	Life Sciences	1	1	...	Low	80	level_0	8	0	Bad	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	Below College	Life Sciences	1	2	...	Very High	80	level_1	10	3	Better	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	College	Other	1	4	...	Medium	80	level_0	7	3	Better	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	Master	Life Sciences	1	5	...	High	80	level_0	8	3	Better	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	Below College	Medical	1	7	...	Very High	80	level_1	6	3	Better	2	2	2	2

5 rows × 35 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 Age                       1470 non-null   int64 
 Attrition                 1470 non-null   object
 BusinessTravel            1470 non-null   object
 DailyRate                 1470 non-null   int64 
 Department                1470 non-null   object
 DistanceFromHome          1470 non-null   int64 
 Education                 1470 non-null   object
 EducationField            1470 non-null   object
 EmployeeCount             1470 non-null   int64 
 EmployeeNumber            1470 non-null   int64 
EnvironmentSatisfaction   1470 non-null   object
Gender                    1470 non-null   object
HourlyRate                1470 non-null   int64 
JobInvolvement            1470 non-null   object
JobLevel                  1470 non-null   object
JobRole                   1470 non-null   object
JobSatisfaction           1470 non-null   object
MaritalStatus             1470 non-null   object
MonthlyIncome             1470 non-null   int64 
MonthlyRate               1470 non-null   int64 
NumCompaniesWorked        1470 non-null   int64 
Over18                    1470 non-null   object
OverTime                  1470 non-null   object
PercentSalaryHike         1470 non-null   int64 
PerformanceRating         1470 non-null   object
RelationshipSatisfaction  1470 non-null   object
StandardHours             1470 non-null   int64 
StockOptionLevel          1470 non-null   object
TotalWorkingYears         1470 non-null   int64 
TrainingTimesLastYear     1470 non-null   int64 
WorkLifeBalance           1470 non-null   object
YearsAtCompany            1470 non-null   int64 
YearsInCurrentRole        1470 non-null   int64 
YearsSinceLastPromotion   1470 non-null   int64 
YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(17), object(18)
memory usage: 402.1+ KB

There are 1470 entries in the dataset with 35 columns. Also, no null values are present in the dataset.
There are 19 numerical columns and 16 categorical columns.
Almost all the columns are self-explainatory but still we will look at each column briefly.

Column Name	Description
Age	Age of the employee.
Attrition	Whether the employee left the firm or not. It is the target variable for prediction analysis for attrition.
BusinessTravel	Whether the emplyee needs to travel for business purposes or not.
DailyRate	Daily rate of the employee for the work.
DistanceFromHome	Distance of the office from the employee’s home.
Education	Qualification till which an employee completed the education.
EducationFeild	Feild of study during the education.
EmployeeCount	This column provides no information.
EmployeeNumber	Unique identifier of employee.
EnvironmentSatisfaction	Satifaction level of employee regarding the environment in the office.
Gender	Gender of the employee.
HourlyRate	Hourly Rate of the employee.
JobInvolvement	Satisfaction level of employee regarding their involvement during the employment.
JobLevel	Level of employee in the heirarchy of promotion.
JobSatisfaction	Overall job satisfaction level of employee.
MaritalStatus	Whether the employee employee is married or not
MonthlyIncome	Monthly income of the employee.
MonthlyRate	Monthly rate of the employee.
NumCompaniesWorked	Number of companies that the employee worked in.
Over18	This column provides no information.
OverTime	Whether the employee needs to do overtime or not.
PercentSalaryHike	Recent percentage hike in the salary.
PerformanceRating	Recent performance rating that was awarded.
RelationshipSatisfaction	Satisfaction level regarding the employee’s professional relationships in the company.
StandardHours	Average number of hours of work that the employee put in everyday.
StockOptionLevel	Level of stock options.
TotalWorkingYears	Total experience of employee in years.
TrainingTimesLastYear	Number of times employee was trained in the previous year.
WorkLifeBalance	Satisfaction level with regards to work life balance.
YearsAtCompany	Number of years the employee worked in the current company.
YearsInCurrentRole	Number of years the employee worked in current role.
YearsSinceLastPromotion	Number of years since the employee is promoted.
YearsWithCurrManager	Number of years the employee worked for the current manager.

It is better to remove columns which does not required and does not contribute to the information gain from the dataset.

# Drop unnecessary columns
df.drop(['EmployeeCount', 'Over18', 'StandardHours'], axis=1, inplace=True)

Remove the dropped columns from the list of numerical and categorical columns.

# Load the static lists
with open('./../../../data/statics.json') as f:
    statics = json.load(f)
categorical_columns = statics['categorical_columns']
numerical_columns = statics['numerical_columns']

# Remove the colums
categorical_columns.remove('Over18')
numerical_columns.remove('EmployeeCount')
numerical_columns.remove('StandardHours')

# Write the new columns back to the static file
statics['categorical_columns'] = categorical_columns
statics['numerical_columns'] = numerical_columns

with open('./../../../data/statics.json', 'w') as f:
    json.dump(statics, f)

# Save the processed data
df.to_csv('./../../../data/cleaned_data.csv', index=False)

Human Resource Analytics

Dataset Introduction¶