Exploratory Data Analysis - Univariate Analysis¶

import json
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from ipywidgets import widgets
from scipy.stats import shapiro
import statsmodels.api as sm
import plotly.io as pio
pio.renderers.default = "vscode"

df = pd.read_csv('./../../../data/cleaned_data.csv')

# Load lists of numerical and categorical columns from the static file
with open('./../../../data/statics.json') as f:
    statics = json.load(f)
categorical_columns = statics['categorical_columns']
numerical_columns = statics['numerical_columns']

# Separate out the dataframe intro numerical and categorical dataframe
num_df = df[numerical_columns]
cat_df = df[categorical_columns]

Numerical Columns¶

Distribution¶

# Descriptive statics for numerical variables
num_df.describe()

	Age	DailyRate	DistanceFromHome	EmployeeNumber	HourlyRate	MonthlyIncome	MonthlyRate	NumCompaniesWorked	PercentSalaryHike	TotalWorkingYears	TrainingTimesLastYear	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
count	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000
mean	36.923810	802.485714	9.192517	1024.865306	65.891156	6502.931293	14313.103401	2.693197	15.209524	11.279592	2.799320	7.008163	4.229252	2.187755	4.123129
std	9.135373	403.509100	8.106864	602.024335	20.329428	4707.956783	7117.786044	2.498009	3.659938	7.780782	1.289271	6.126525	3.623137	3.222430	3.568136
min	18.000000	102.000000	1.000000	1.000000	30.000000	1009.000000	2094.000000	0.000000	11.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	30.000000	465.000000	2.000000	491.250000	48.000000	2911.000000	8047.000000	1.000000	12.000000	6.000000	2.000000	3.000000	2.000000	0.000000	2.000000
50%	36.000000	802.000000	7.000000	1020.500000	66.000000	4919.000000	14235.500000	2.000000	14.000000	10.000000	3.000000	5.000000	3.000000	1.000000	3.000000
75%	43.000000	1157.000000	14.000000	1555.750000	83.750000	8379.000000	20461.500000	4.000000	18.000000	15.000000	3.000000	9.000000	7.000000	3.000000	7.000000
max	60.000000	1499.000000	29.000000	2068.000000	100.000000	19999.000000	26999.000000	9.000000	25.000000	40.000000	6.000000	40.000000	18.000000	15.000000	17.000000

From the above table, it can be observed that some of the highly skewed columns include \(MonthlyIncome\), \(YearsAtCompany\), and \(YearsSinceLastPromotion\). More information can be obtained by observing the distribution of all the variables.

# Create interactive plots

# Create a widget for selecting column
numcols = widgets.Dropdown(options = numerical_columns, value = numerical_columns[0], description="Numerial columns")

# Create plotly trace of histogram
num_trace1 = go.Histogram(x=num_df[numerical_columns[0]], 
                         histnorm='probability', 
                         name = 'Distribution')

# Create plotly trace of boc plot
num_trace2 = go.Box(x=num_df[numerical_columns[0]], 
                   boxpoints='outliers', name = 'Quartiles representation')

# Create a widget for histogram
ng1 = go.FigureWidget(data=[num_trace1],
                     layout = go.Layout(
                         title = dict(text='Distribution of features')
                     ))

# Create a widget for box plot
ng2 = go.FigureWidget(data=[num_trace2],
                     layout = go.Layout(
                         title = dict(text='Quartiles representation of features')
                     ))

# Create a function for observing the change in the selection
def num_response(change):
    """
    Function to update the values in the graph based on the selected column.
    """
    with ng1.batch_update():
        ng1.data[0].x = num_df[numcols.value]
        ng1.layout.xaxis.title = 'Distribution of ' + str(numcols.value) + ' variable'
    
    with ng2.batch_update():
        ng2.data[0].x = num_df[numcols.value]
        ng2.layout.xaxis.title = numcols.value
    
numcols.observe(num_response, names='value')

num_container = widgets.VBox([numcols, ng1, ng2])

display(num_container)

From the above distributions following observations can be noted:

The average age of the participants is 37 years while the median age is rests at 36 years of age. We have representation of almost all sorts of working population right from the age of 18 to the age of 60. There are no outliers that exist in the dataset as far as age is concerned.
Variables that approximately follows uniform distribution are variables representing daily rate, hourly rate with exception for values greater than 100, and monthly rate.
There are variables which are positively skewed that includes distance from home, monthly income, number of companies worked, percentage hike, total working years, and years at a company.
There are 2 variables which have double peaks. The variables represents years in current role and years since last promotion.
Only 1 variable representing number of training in last year seems to be following normal distribution.
There are outliers present in variables such as monthly income, number of companies worked, total working years, number of trainings in last year, years at company, years in current role, years since last promotion, and years with current manager. In order to decide whether to keep or remove the outliers a more closer look into variables are required.

Normality check¶

sw_df = pd.DataFrame(columns=['Name of the column', 'SW Statistics', 'P-value', 'Is Normal'])
for column in numerical_columns:
    result = shapiro(num_df[column])
    # Alpha is set to 5%
    is_norm = True if result[1]>0.05 else False
    sw_df = sw_df.append(pd.Series({
        'Name of the column': column,
        'SW Statistics': result[0],
        'P-value': result[1],
        'Is Normal': is_norm
    }),
    ignore_index=True)
sw_df

	Name of the column	SW Statistics	P-value	Is Normal
0	Age	0.977448	2.035274e-14	False
1	DailyRate	0.954398	5.330206e-21	False
2	DistanceFromHome	0.861593	4.085809e-34	False
3	EmployeeNumber	0.952486	2.001128e-21	False
4	HourlyRate	0.955029	7.413545e-21	False
5	MonthlyIncome	0.827908	4.403389e-37	False
6	MonthlyRate	0.954464	5.515457e-21	False
7	NumCompaniesWorked	0.848779	2.634180e-35	False
8	PercentSalaryHike	0.900604	7.476921e-30	False
9	TotalWorkingYears	0.907428	5.628518e-29	False
10	TrainingTimesLastYear	0.895095	1.583637e-30	False
11	YearsAtCompany	0.838994	3.669825e-36	False
12	YearsInCurrentRole	0.896182	2.140117e-30	False
13	YearsSinceLastPromotion	0.703726	4.203895e-45	False
14	YearsWithCurrManager	0.897460	3.058352e-30	False

Since the dataset is not huge, it is safe for us to trust these values and conclude that not a single variable follow normal distribution.

Categorical variable¶

Distribution¶

# Create interactive plots

# Create widget for selecting column
catcols = widgets.Dropdown(options=categorical_columns, value=categorical_columns[0], description='Categorical columns')

# Create bar plot trace for histogram
cat_trace1 = go.Bar(x = cat_df[categorical_columns[0]].value_counts().index, 
                    y = cat_df[categorical_columns[0]].value_counts().values)

# Create a widget for bar plot
cg = go.FigureWidget(data=[cat_trace1],
                     layout=go.Layout(
                         title = dict(text="Distribution of features")
                     ))

# Create function for observing the change in the column name
def cat_response(change):
    with cg.batch_update():
        cg.data[0].x = cat_df[catcols.value].value_counts().index
        cg.data[0].y = cat_df[catcols.value].value_counts().values
        cg.layout.xaxis.title = 'Distribution of ' + str(catcols.value) + ' variable'
        
catcols.observe(cat_response, names='value')

cat_container = widgets.VBox([catcols, cg])

display(cat_container)

From the above bar charts, following observations can noted:

The target variable is highly imbalanced.
Most of the employees travel rarely. Frequent travellers and non-travellers are too less as compared to rarede travellers.
Most of the employees belongs to Research and Development department which is followed by Sales and then Human Resources.
Maximum number of employees completed their Bachelor’s degree followed by employees who even complete their Master’s degree.
Maximum number of employees have their majors in Life Sciences and Medical. The number of employees with majors in Marketing, Technical Degree, Human Resources and Other are too less as compared to the top 2 fields mentioned.
People are quite content with the environment in which they are working.
Dataset is represented by more number of males than females.
Emplpoyees are also content with their involvement in their respective jobs.
Most of the employees belongs to the lower levels in the heirarachy, mostly level 1 and level 2.
The top 5 roles that exist in the current samples are sales executive, research scientist, laboratory technician, manufacturing director and healthcare representative.
Most of the employees are satisfied with their jobs but still we have quite a significant number of people who are not.
Maximum number of employees are married but there is significant portion of employees who are divorced.
Around one-thord employees do overtime.
Performance rating for all employeed lie in only 2 bands i.e. execellent and outstanding.
Most of the employees are satisfied with their relationship with the company but still a signifiacnt portion does not fell so.
More than 75% of population own stock options at levels 0 and 1.
More than 80% of employees feel that the work-life balance is available.

Human Resource Analytics

Exploratory Data Analysis - Univariate Analysis¶

Numerical Columns¶

Distribution¶

Normality check¶

Categorical variable¶

Distribution¶