Dataset Introduction¶
The dataset that is used in this book is IBM HR Analytics Employee Attrition & Performance hosted on Kaggle. It is uploaded 4 years ago with no revisions since then. The size of the data file is around 222kB. There are 35 columns in the dataset. The primary aim for hosting the dataset was to predict the attrition of the employees.
# Import the necessary packages
import pandas as pd
import json
# Load the data
df = pd.read_csv('./../../../data/data.csv')
df.head()
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | College | Life Sciences | 1 | 1 | ... | Low | 80 | level_0 | 8 | 0 | Bad | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | Below College | Life Sciences | 1 | 2 | ... | Very High | 80 | level_1 | 10 | 3 | Better | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | College | Other | 1 | 4 | ... | Medium | 80 | level_0 | 7 | 3 | Better | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | Master | Life Sciences | 1 | 5 | ... | High | 80 | level_0 | 8 | 3 | Better | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | Below College | Medical | 1 | 7 | ... | Very High | 80 | level_1 | 6 | 3 | Better | 2 | 2 | 2 | 2 |
5 rows × 35 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1470 non-null int64
1 Attrition 1470 non-null object
2 BusinessTravel 1470 non-null object
3 DailyRate 1470 non-null int64
4 Department 1470 non-null object
5 DistanceFromHome 1470 non-null int64
6 Education 1470 non-null object
7 EducationField 1470 non-null object
8 EmployeeCount 1470 non-null int64
9 EmployeeNumber 1470 non-null int64
10 EnvironmentSatisfaction 1470 non-null object
11 Gender 1470 non-null object
12 HourlyRate 1470 non-null int64
13 JobInvolvement 1470 non-null object
14 JobLevel 1470 non-null object
15 JobRole 1470 non-null object
16 JobSatisfaction 1470 non-null object
17 MaritalStatus 1470 non-null object
18 MonthlyIncome 1470 non-null int64
19 MonthlyRate 1470 non-null int64
20 NumCompaniesWorked 1470 non-null int64
21 Over18 1470 non-null object
22 OverTime 1470 non-null object
23 PercentSalaryHike 1470 non-null int64
24 PerformanceRating 1470 non-null object
25 RelationshipSatisfaction 1470 non-null object
26 StandardHours 1470 non-null int64
27 StockOptionLevel 1470 non-null object
28 TotalWorkingYears 1470 non-null int64
29 TrainingTimesLastYear 1470 non-null int64
30 WorkLifeBalance 1470 non-null object
31 YearsAtCompany 1470 non-null int64
32 YearsInCurrentRole 1470 non-null int64
33 YearsSinceLastPromotion 1470 non-null int64
34 YearsWithCurrManager 1470 non-null int64
dtypes: int64(17), object(18)
memory usage: 402.1+ KB
There are 1470 entries in the dataset with 35 columns. Also, no null values are present in the dataset.
There are 19 numerical columns and 16 categorical columns.
Almost all the columns are self-explainatory but still we will look at each column briefly.
Column Name |
Description |
|---|---|
Age |
Age of the employee. |
Attrition |
Whether the employee left the firm or not. It is the target variable for prediction analysis for attrition. |
BusinessTravel |
Whether the emplyee needs to travel for business purposes or not. |
DailyRate |
Daily rate of the employee for the work. |
DistanceFromHome |
Distance of the office from the employee’s home. |
Education |
Qualification till which an employee completed the education. |
EducationFeild |
Feild of study during the education. |
EmployeeCount |
This column provides no information. |
EmployeeNumber |
Unique identifier of employee. |
EnvironmentSatisfaction |
Satifaction level of employee regarding the environment in the office. |
Gender |
Gender of the employee. |
HourlyRate |
Hourly Rate of the employee. |
JobInvolvement |
Satisfaction level of employee regarding their involvement during the employment. |
JobLevel |
Level of employee in the heirarchy of promotion. |
JobSatisfaction |
Overall job satisfaction level of employee. |
MaritalStatus |
Whether the employee employee is married or not |
MonthlyIncome |
Monthly income of the employee. |
MonthlyRate |
Monthly rate of the employee. |
NumCompaniesWorked |
Number of companies that the employee worked in. |
Over18 |
This column provides no information. |
OverTime |
Whether the employee needs to do overtime or not. |
PercentSalaryHike |
Recent percentage hike in the salary. |
PerformanceRating |
Recent performance rating that was awarded. |
RelationshipSatisfaction |
Satisfaction level regarding the employee’s professional relationships in the company. |
StandardHours |
Average number of hours of work that the employee put in everyday. |
StockOptionLevel |
Level of stock options. |
TotalWorkingYears |
Total experience of employee in years. |
TrainingTimesLastYear |
Number of times employee was trained in the previous year. |
WorkLifeBalance |
Satisfaction level with regards to work life balance. |
YearsAtCompany |
Number of years the employee worked in the current company. |
YearsInCurrentRole |
Number of years the employee worked in current role. |
YearsSinceLastPromotion |
Number of years since the employee is promoted. |
YearsWithCurrManager |
Number of years the employee worked for the current manager. |
It is better to remove columns which does not required and does not contribute to the information gain from the dataset.
# Drop unnecessary columns
df.drop(['EmployeeCount', 'Over18', 'StandardHours'], axis=1, inplace=True)
Remove the dropped columns from the list of numerical and categorical columns.
# Load the static lists
with open('./../../../data/statics.json') as f:
statics = json.load(f)
categorical_columns = statics['categorical_columns']
numerical_columns = statics['numerical_columns']
# Remove the colums
categorical_columns.remove('Over18')
numerical_columns.remove('EmployeeCount')
numerical_columns.remove('StandardHours')
# Write the new columns back to the static file
statics['categorical_columns'] = categorical_columns
statics['numerical_columns'] = numerical_columns
with open('./../../../data/statics.json', 'w') as f:
json.dump(statics, f)
# Save the processed data
df.to_csv('./../../../data/cleaned_data.csv', index=False)