Exploratory Data Analysis - Univariate Analysis

import json
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from ipywidgets import widgets
from scipy.stats import shapiro
import statsmodels.api as sm
import plotly.io as pio
pio.renderers.default = "vscode"
df = pd.read_csv('./../../../data/cleaned_data.csv')
# Load lists of numerical and categorical columns from the static file
with open('./../../../data/statics.json') as f:
    statics = json.load(f)
categorical_columns = statics['categorical_columns']
numerical_columns = statics['numerical_columns']
# Separate out the dataframe intro numerical and categorical dataframe
num_df = df[numerical_columns]
cat_df = df[categorical_columns]

Numerical Columns

Distribution

# Descriptive statics for numerical variables
num_df.describe()
Age DailyRate DistanceFromHome EmployeeNumber HourlyRate MonthlyIncome MonthlyRate NumCompaniesWorked PercentSalaryHike TotalWorkingYears TrainingTimesLastYear YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
count 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000
mean 36.923810 802.485714 9.192517 1024.865306 65.891156 6502.931293 14313.103401 2.693197 15.209524 11.279592 2.799320 7.008163 4.229252 2.187755 4.123129
std 9.135373 403.509100 8.106864 602.024335 20.329428 4707.956783 7117.786044 2.498009 3.659938 7.780782 1.289271 6.126525 3.623137 3.222430 3.568136
min 18.000000 102.000000 1.000000 1.000000 30.000000 1009.000000 2094.000000 0.000000 11.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 30.000000 465.000000 2.000000 491.250000 48.000000 2911.000000 8047.000000 1.000000 12.000000 6.000000 2.000000 3.000000 2.000000 0.000000 2.000000
50% 36.000000 802.000000 7.000000 1020.500000 66.000000 4919.000000 14235.500000 2.000000 14.000000 10.000000 3.000000 5.000000 3.000000 1.000000 3.000000
75% 43.000000 1157.000000 14.000000 1555.750000 83.750000 8379.000000 20461.500000 4.000000 18.000000 15.000000 3.000000 9.000000 7.000000 3.000000 7.000000
max 60.000000 1499.000000 29.000000 2068.000000 100.000000 19999.000000 26999.000000 9.000000 25.000000 40.000000 6.000000 40.000000 18.000000 15.000000 17.000000

From the above table, it can be observed that some of the highly skewed columns include \(MonthlyIncome\), \(YearsAtCompany\), and \(YearsSinceLastPromotion\). More information can be obtained by observing the distribution of all the variables.

# Create interactive plots

# Create a widget for selecting column
numcols = widgets.Dropdown(options = numerical_columns, value = numerical_columns[0], description="Numerial columns")

# Create plotly trace of histogram
num_trace1 = go.Histogram(x=num_df[numerical_columns[0]], 
                         histnorm='probability', 
                         name = 'Distribution')

# Create plotly trace of boc plot
num_trace2 = go.Box(x=num_df[numerical_columns[0]], 
                   boxpoints='outliers', name = 'Quartiles representation')

# Create a widget for histogram
ng1 = go.FigureWidget(data=[num_trace1],
                     layout = go.Layout(
                         title = dict(text='Distribution of features')
                     ))

# Create a widget for box plot
ng2 = go.FigureWidget(data=[num_trace2],
                     layout = go.Layout(
                         title = dict(text='Quartiles representation of features')
                     ))

# Create a function for observing the change in the selection
def num_response(change):
    """
    Function to update the values in the graph based on the selected column.
    """
    with ng1.batch_update():
        ng1.data[0].x = num_df[numcols.value]
        ng1.layout.xaxis.title = 'Distribution of ' + str(numcols.value) + ' variable'
    
    with ng2.batch_update():
        ng2.data[0].x = num_df[numcols.value]
        ng2.layout.xaxis.title = numcols.value
    
numcols.observe(num_response, names='value')

num_container = widgets.VBox([numcols, ng1, ng2])
display(num_container)

From the above distributions following observations can be noted:

  • The average age of the participants is 37 years while the median age is rests at 36 years of age. We have representation of almost all sorts of working population right from the age of 18 to the age of 60. There are no outliers that exist in the dataset as far as age is concerned.

  • Variables that approximately follows uniform distribution are variables representing daily rate, hourly rate with exception for values greater than 100, and monthly rate.

  • There are variables which are positively skewed that includes distance from home, monthly income, number of companies worked, percentage hike, total working years, and years at a company.

  • There are 2 variables which have double peaks. The variables represents years in current role and years since last promotion.

  • Only 1 variable representing number of training in last year seems to be following normal distribution.

  • There are outliers present in variables such as monthly income, number of companies worked, total working years, number of trainings in last year, years at company, years in current role, years since last promotion, and years with current manager. In order to decide whether to keep or remove the outliers a more closer look into variables are required.

Normality check

sw_df = pd.DataFrame(columns=['Name of the column', 'SW Statistics', 'P-value', 'Is Normal'])
for column in numerical_columns:
    result = shapiro(num_df[column])
    # Alpha is set to 5%
    is_norm = True if result[1]>0.05 else False
    sw_df = sw_df.append(pd.Series({
        'Name of the column': column,
        'SW Statistics': result[0],
        'P-value': result[1],
        'Is Normal': is_norm
    }),
    ignore_index=True)
sw_df
Name of the column SW Statistics P-value Is Normal
0 Age 0.977448 2.035274e-14 False
1 DailyRate 0.954398 5.330206e-21 False
2 DistanceFromHome 0.861593 4.085809e-34 False
3 EmployeeNumber 0.952486 2.001128e-21 False
4 HourlyRate 0.955029 7.413545e-21 False
5 MonthlyIncome 0.827908 4.403389e-37 False
6 MonthlyRate 0.954464 5.515457e-21 False
7 NumCompaniesWorked 0.848779 2.634180e-35 False
8 PercentSalaryHike 0.900604 7.476921e-30 False
9 TotalWorkingYears 0.907428 5.628518e-29 False
10 TrainingTimesLastYear 0.895095 1.583637e-30 False
11 YearsAtCompany 0.838994 3.669825e-36 False
12 YearsInCurrentRole 0.896182 2.140117e-30 False
13 YearsSinceLastPromotion 0.703726 4.203895e-45 False
14 YearsWithCurrManager 0.897460 3.058352e-30 False

Since the dataset is not huge, it is safe for us to trust these values and conclude that not a single variable follow normal distribution.

Categorical variable

Distribution

# Create interactive plots

# Create widget for selecting column
catcols = widgets.Dropdown(options=categorical_columns, value=categorical_columns[0], description='Categorical columns')

# Create bar plot trace for histogram
cat_trace1 = go.Bar(x = cat_df[categorical_columns[0]].value_counts().index, 
                    y = cat_df[categorical_columns[0]].value_counts().values)

# Create a widget for bar plot
cg = go.FigureWidget(data=[cat_trace1],
                     layout=go.Layout(
                         title = dict(text="Distribution of features")
                     ))

# Create function for observing the change in the column name
def cat_response(change):
    with cg.batch_update():
        cg.data[0].x = cat_df[catcols.value].value_counts().index
        cg.data[0].y = cat_df[catcols.value].value_counts().values
        cg.layout.xaxis.title = 'Distribution of ' + str(catcols.value) + ' variable'
        
catcols.observe(cat_response, names='value')

cat_container = widgets.VBox([catcols, cg])
display(cat_container)

From the above bar charts, following observations can noted:

  • The target variable is highly imbalanced.

  • Most of the employees travel rarely. Frequent travellers and non-travellers are too less as compared to rarede travellers.

  • Most of the employees belongs to Research and Development department which is followed by Sales and then Human Resources.

  • Maximum number of employees completed their Bachelor’s degree followed by employees who even complete their Master’s degree.

  • Maximum number of employees have their majors in Life Sciences and Medical. The number of employees with majors in Marketing, Technical Degree, Human Resources and Other are too less as compared to the top 2 fields mentioned.

  • People are quite content with the environment in which they are working.

  • Dataset is represented by more number of males than females.

  • Emplpoyees are also content with their involvement in their respective jobs.

  • Most of the employees belongs to the lower levels in the heirarachy, mostly level 1 and level 2.

  • The top 5 roles that exist in the current samples are sales executive, research scientist, laboratory technician, manufacturing director and healthcare representative.

  • Most of the employees are satisfied with their jobs but still we have quite a significant number of people who are not.

  • Maximum number of employees are married but there is significant portion of employees who are divorced.

  • Around one-thord employees do overtime.

  • Performance rating for all employeed lie in only 2 bands i.e. execellent and outstanding.

  • Most of the employees are satisfied with their relationship with the company but still a signifiacnt portion does not fell so.

  • More than 75% of population own stock options at levels 0 and 1.

  • More than 80% of employees feel that the work-life balance is available.