Exploratory Data Analysis - Univariate Analysis¶
import json
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from ipywidgets import widgets
from scipy.stats import shapiro
import statsmodels.api as sm
import plotly.io as pio
pio.renderers.default = "vscode"
df = pd.read_csv('./../../../data/cleaned_data.csv')
# Load lists of numerical and categorical columns from the static file
with open('./../../../data/statics.json') as f:
statics = json.load(f)
categorical_columns = statics['categorical_columns']
numerical_columns = statics['numerical_columns']
# Separate out the dataframe intro numerical and categorical dataframe
num_df = df[numerical_columns]
cat_df = df[categorical_columns]
Numerical Columns¶
Distribution¶
# Descriptive statics for numerical variables
num_df.describe()
| Age | DailyRate | DistanceFromHome | EmployeeNumber | HourlyRate | MonthlyIncome | MonthlyRate | NumCompaniesWorked | PercentSalaryHike | TotalWorkingYears | TrainingTimesLastYear | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 |
| mean | 36.923810 | 802.485714 | 9.192517 | 1024.865306 | 65.891156 | 6502.931293 | 14313.103401 | 2.693197 | 15.209524 | 11.279592 | 2.799320 | 7.008163 | 4.229252 | 2.187755 | 4.123129 |
| std | 9.135373 | 403.509100 | 8.106864 | 602.024335 | 20.329428 | 4707.956783 | 7117.786044 | 2.498009 | 3.659938 | 7.780782 | 1.289271 | 6.126525 | 3.623137 | 3.222430 | 3.568136 |
| min | 18.000000 | 102.000000 | 1.000000 | 1.000000 | 30.000000 | 1009.000000 | 2094.000000 | 0.000000 | 11.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 30.000000 | 465.000000 | 2.000000 | 491.250000 | 48.000000 | 2911.000000 | 8047.000000 | 1.000000 | 12.000000 | 6.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 |
| 50% | 36.000000 | 802.000000 | 7.000000 | 1020.500000 | 66.000000 | 4919.000000 | 14235.500000 | 2.000000 | 14.000000 | 10.000000 | 3.000000 | 5.000000 | 3.000000 | 1.000000 | 3.000000 |
| 75% | 43.000000 | 1157.000000 | 14.000000 | 1555.750000 | 83.750000 | 8379.000000 | 20461.500000 | 4.000000 | 18.000000 | 15.000000 | 3.000000 | 9.000000 | 7.000000 | 3.000000 | 7.000000 |
| max | 60.000000 | 1499.000000 | 29.000000 | 2068.000000 | 100.000000 | 19999.000000 | 26999.000000 | 9.000000 | 25.000000 | 40.000000 | 6.000000 | 40.000000 | 18.000000 | 15.000000 | 17.000000 |
From the above table, it can be observed that some of the highly skewed columns include \(MonthlyIncome\), \(YearsAtCompany\), and \(YearsSinceLastPromotion\). More information can be obtained by observing the distribution of all the variables.
# Create interactive plots
# Create a widget for selecting column
numcols = widgets.Dropdown(options = numerical_columns, value = numerical_columns[0], description="Numerial columns")
# Create plotly trace of histogram
num_trace1 = go.Histogram(x=num_df[numerical_columns[0]],
histnorm='probability',
name = 'Distribution')
# Create plotly trace of boc plot
num_trace2 = go.Box(x=num_df[numerical_columns[0]],
boxpoints='outliers', name = 'Quartiles representation')
# Create a widget for histogram
ng1 = go.FigureWidget(data=[num_trace1],
layout = go.Layout(
title = dict(text='Distribution of features')
))
# Create a widget for box plot
ng2 = go.FigureWidget(data=[num_trace2],
layout = go.Layout(
title = dict(text='Quartiles representation of features')
))
# Create a function for observing the change in the selection
def num_response(change):
"""
Function to update the values in the graph based on the selected column.
"""
with ng1.batch_update():
ng1.data[0].x = num_df[numcols.value]
ng1.layout.xaxis.title = 'Distribution of ' + str(numcols.value) + ' variable'
with ng2.batch_update():
ng2.data[0].x = num_df[numcols.value]
ng2.layout.xaxis.title = numcols.value
numcols.observe(num_response, names='value')
num_container = widgets.VBox([numcols, ng1, ng2])
display(num_container)
From the above distributions following observations can be noted:
The average age of the participants is 37 years while the median age is rests at 36 years of age. We have representation of almost all sorts of working population right from the age of 18 to the age of 60. There are no outliers that exist in the dataset as far as age is concerned.
Variables that approximately follows uniform distribution are variables representing daily rate, hourly rate with exception for values greater than 100, and monthly rate.
There are variables which are positively skewed that includes distance from home, monthly income, number of companies worked, percentage hike, total working years, and years at a company.
There are 2 variables which have double peaks. The variables represents years in current role and years since last promotion.
Only 1 variable representing number of training in last year seems to be following normal distribution.
There are outliers present in variables such as monthly income, number of companies worked, total working years, number of trainings in last year, years at company, years in current role, years since last promotion, and years with current manager. In order to decide whether to keep or remove the outliers a more closer look into variables are required.
Normality check¶
sw_df = pd.DataFrame(columns=['Name of the column', 'SW Statistics', 'P-value', 'Is Normal'])
for column in numerical_columns:
result = shapiro(num_df[column])
# Alpha is set to 5%
is_norm = True if result[1]>0.05 else False
sw_df = sw_df.append(pd.Series({
'Name of the column': column,
'SW Statistics': result[0],
'P-value': result[1],
'Is Normal': is_norm
}),
ignore_index=True)
sw_df
| Name of the column | SW Statistics | P-value | Is Normal | |
|---|---|---|---|---|
| 0 | Age | 0.977448 | 2.035274e-14 | False |
| 1 | DailyRate | 0.954398 | 5.330206e-21 | False |
| 2 | DistanceFromHome | 0.861593 | 4.085809e-34 | False |
| 3 | EmployeeNumber | 0.952486 | 2.001128e-21 | False |
| 4 | HourlyRate | 0.955029 | 7.413545e-21 | False |
| 5 | MonthlyIncome | 0.827908 | 4.403389e-37 | False |
| 6 | MonthlyRate | 0.954464 | 5.515457e-21 | False |
| 7 | NumCompaniesWorked | 0.848779 | 2.634180e-35 | False |
| 8 | PercentSalaryHike | 0.900604 | 7.476921e-30 | False |
| 9 | TotalWorkingYears | 0.907428 | 5.628518e-29 | False |
| 10 | TrainingTimesLastYear | 0.895095 | 1.583637e-30 | False |
| 11 | YearsAtCompany | 0.838994 | 3.669825e-36 | False |
| 12 | YearsInCurrentRole | 0.896182 | 2.140117e-30 | False |
| 13 | YearsSinceLastPromotion | 0.703726 | 4.203895e-45 | False |
| 14 | YearsWithCurrManager | 0.897460 | 3.058352e-30 | False |
Since the dataset is not huge, it is safe for us to trust these values and conclude that not a single variable follow normal distribution.
Categorical variable¶
Distribution¶
# Create interactive plots
# Create widget for selecting column
catcols = widgets.Dropdown(options=categorical_columns, value=categorical_columns[0], description='Categorical columns')
# Create bar plot trace for histogram
cat_trace1 = go.Bar(x = cat_df[categorical_columns[0]].value_counts().index,
y = cat_df[categorical_columns[0]].value_counts().values)
# Create a widget for bar plot
cg = go.FigureWidget(data=[cat_trace1],
layout=go.Layout(
title = dict(text="Distribution of features")
))
# Create function for observing the change in the column name
def cat_response(change):
with cg.batch_update():
cg.data[0].x = cat_df[catcols.value].value_counts().index
cg.data[0].y = cat_df[catcols.value].value_counts().values
cg.layout.xaxis.title = 'Distribution of ' + str(catcols.value) + ' variable'
catcols.observe(cat_response, names='value')
cat_container = widgets.VBox([catcols, cg])
display(cat_container)
From the above bar charts, following observations can noted:
The target variable is highly imbalanced.
Most of the employees travel rarely. Frequent travellers and non-travellers are too less as compared to rarede travellers.
Most of the employees belongs to Research and Development department which is followed by Sales and then Human Resources.
Maximum number of employees completed their Bachelor’s degree followed by employees who even complete their Master’s degree.
Maximum number of employees have their majors in Life Sciences and Medical. The number of employees with majors in Marketing, Technical Degree, Human Resources and Other are too less as compared to the top 2 fields mentioned.
People are quite content with the environment in which they are working.
Dataset is represented by more number of males than females.
Emplpoyees are also content with their involvement in their respective jobs.
Most of the employees belongs to the lower levels in the heirarachy, mostly level 1 and level 2.
The top 5 roles that exist in the current samples are sales executive, research scientist, laboratory technician, manufacturing director and healthcare representative.
Most of the employees are satisfied with their jobs but still we have quite a significant number of people who are not.
Maximum number of employees are married but there is significant portion of employees who are divorced.
Around one-thord employees do overtime.
Performance rating for all employeed lie in only 2 bands i.e. execellent and outstanding.
Most of the employees are satisfied with their relationship with the company but still a signifiacnt portion does not fell so.
More than 75% of population own stock options at levels 0 and 1.
More than 80% of employees feel that the work-life balance is available.