Exploratory Data Analysis - Univariate¶
Tip
This notebook contains interactive graphs and hence are not rendered directly here. Please use “live code” option to run it here or run the complete in Google Colaboratory or Binder.
Note
The interactivity for matplotlib graphs does not work with live code functionality. Hence, running the code in Google Colaboratory or Binder is recommended.
# Import necessary packages
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from ipywidgets import widgets
from scipy.stats import shapiro
import statsmodels.api as sm
from matplotlib import pyplot as plt
# Load data into dataframe from local
# df = pd.read_csv('./../../datasets/online_shoppers_intention.csv')
# Load data into dataframe from UCI repository
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv')
Numerical features¶
Only numbers can not tell us anything on it’s own but visual representation of those numbers is a treasure. This section will be focused on the univariate analysis i.e. analysis of variables one at a time. We will feal with numeircal features first followed by categorical features.
Histograms and box plots can help us understand the numerical features with regards to their distribution. They also helps us to identify the outliers (values which are too far away from normal values) and also help us to understand the probability of an event occurrence.
# Select numerical columns data
numerical_columns = ['Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration',
'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues',
'SpecialDay']
num_df = df[numerical_columns]
# Create interactive plots
# Create a widget for selecting column
numcols = widgets.Dropdown(options = numerical_columns, value = 'Administrative', description="Numerial columns")
# Create plotly trace of histogram
num_trace1 = go.Histogram(x=num_df['Administrative'],
histnorm='probability',
name = 'Distribution')
# Create plotly trace of boc plot
num_trace2 = go.Box(x=num_df['Administrative'],
boxpoints='outliers', name = 'Quartiles representation')
# Create a widget for histogram
ng1 = go.FigureWidget(data=[num_trace1],
layout = go.Layout(
title = dict(text='Distribution of features')
))
# Create a widget for box plot
ng2 = go.FigureWidget(data=[num_trace2],
layout = go.Layout(
title = dict(text='Quartiles representation of features')
))
# Create a function for observing the change in the selection
def num_response(change):
"""
Function to update the values in the graph based on the selected column.
"""
with ng1.batch_update():
ng1.data[0].x = num_df[numcols.value]
ng1.layout.xaxis.title = 'Distribution of ' + str(numcols.value) + ' variable'
with ng2.batch_update():
ng2.data[0].x = num_df[numcols.value]
ng2.layout.xaxis.title = numcols.value
numcols.observe(num_response, names='value')
num_container = widgets.VBox([numcols, ng1, ng2])
display(num_container)
All the page related columns have most of the values concentrated towards 0. Bounce rates and exit rates have values near the extremes, though for both of them values exist more towards 0 than the other end. Page values and special day columns have a large number of values which are 0.
As for this analysis, I won’t be considering any point in the columns adminstrative pages (Administrative), bounce rates (BounceRates), exit rates (ExitRates) and special day (SpecialDay) as outliers. For the remaining columns following are criteria that will be used to select and remove outliers:
Time spent on administrative pages (Administrative_Duration) - Values more than 3000 seconds.
Number of Informational pages (Informational) - Values more than 20 pages.
Time spent on informational pages(Informational_Duration) - Values more than 2500 seconds.
Number of Product related pages (ProductRelated) - Values more than 600 pages.
Time spent on product related pages (ProductRelated_Duration) - Values more than 40,000 seconds.
Page values (PageValues) - Values more than 300.
Next we can check the normality of numerical features.
# Perform Shapiro-Wilk test for checking the normality of the numerical features
sw_df = pd.DataFrame(columns=['Name of the feature', 'SW Statistics', 'P-value', 'Is Normal'])
for column in numerical_columns:
result = shapiro(df[column])
is_norm = True if result[1]>0.05 else False
sw_df = sw_df.append(pd.Series({'Name of the feature': column,
'SW Statistics': result[0],
'P-value':result[1],
'Is Normal':is_norm}),
ignore_index=True)
sw_df
/Users/pushkar/miniconda3/envs/ospi/lib/python3.8/site-packages/scipy/stats/morestats.py:1681: UserWarning:
p-value may not be accurate for N > 5000.
Name of the feature | SW Statistics | P-value | Is Normal | |
---|---|---|---|---|
0 | Administrative | 0.734400 | 0.0 | False |
1 | Administrative_Duration | 0.481695 | 0.0 | False |
2 | Informational | 0.458277 | 0.0 | False |
3 | Informational_Duration | 0.259782 | 0.0 | False |
4 | ProductRelated | 0.610410 | 0.0 | False |
5 | ProductRelated_Duration | 0.555028 | 0.0 | False |
6 | BounceRates | 0.492207 | 0.0 | False |
7 | ExitRates | 0.699234 | 0.0 | False |
8 | PageValues | 0.355064 | 0.0 | False |
9 | SpecialDay | 0.343015 | 0.0 | False |
Shapiro-Wilk (SW) test is chosen because it has the best power for the given significance among other tests for checking normality. The null hypthesis of Shapiro-Wilk test is that the population is normally distributed.
Due to high power of SW test, for large number of features, it is quite sensitive to the data. This means that if number of features increase, then a slight deviation from the normal distribution becomes quite significant and hence the p-value approaches 0. To confirm the deviation from the normality is this condition, it is necessary to check Quantile-Quantile (Q-Q) plot.
# Q-Q plots for numerical features
%matplotlib inline
def plot_qq(x):
fig, ax = plt.subplots(figsize=(10, 8))
probplot = sm.ProbPlot(num_df[x], fit=True)
probplot.qqplot(line='s', ax=ax)
plt.show(fig)
widgets.interact(plot_qq, x=numerical_columns)
<function __main__.plot_qq(x)>
Based on the Shapiro-Wilk test, sample distribution of the features does not follow normal distribution i.e. none of the numerical features are normally distributed. The usual criterion is that is the p-value is less than the alpha, then the null hypothesis is rejected. Assuming alpha of 0.05 (95% confidence interval), not a single feature has a p-value greater than this and hence in every case the null hypothesis is rejected.
For Q-Q plots, all the points should lie on the standardized line. Thus deviation of sample ditribution from the normal distribution is further confirmed by Q-Q plots of those features.
Categorical features¶
For categorical, we can check counts of available categories in each variable.
# Select categorical features
categorical_columns = ['OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend', 'Month',
'Revenue']
cat_df = df[categorical_columns]
# Create interactive plots
# Create widget for selecting column
catcols = widgets.Dropdown(options=categorical_columns, value='OperatingSystems', description='Categorical columns')
# Create bar plot trace for histogram
cat_trace1 = go.Bar(x = cat_df['OperatingSystems'].value_counts().index,
y = cat_df['OperatingSystems'].value_counts().values)
# Create a widget for bar plot
cg = go.FigureWidget(data=[cat_trace1],
layout=go.Layout(
title = dict(text="Distribution of features")
))
# Create function for observing the change in the column name
def cat_response(change):
with cg.batch_update():
cg.data[0].x = cat_df[catcols.value].value_counts().index
cg.data[0].y = cat_df[catcols.value].value_counts().values
cg.layout.xaxis.title = 'Distribution of ' + str(catcols.value) + ' variable'
catcols.observe(cat_response, names='value')
cat_container = widgets.VBox([catcols, cg])
display(cat_container)
Beginning from the operating systems, operating systems 5, 6 and 7 have very less data but can’t be considered as outliers. Operating System 2 is used by most of the visitors.
Then we have browsers. There is no data for browser number 9. Though browser number 11 and 12 have less data, they can’t be considered as outliers.
The number of visitors are high from region number 1 and least from region number 5.
Traffic type has the most of number of categories. Type 11, 16 and 17 does not have any data. Mostly the visitors are of traffic type 1.
Most of the visitors are of visitor type returning. Though as compared to returning visitors, new visitors are too less in number.
Most of the users shop in the weekends.
Data is available for only 10 months. The missing months are January and April. Most of the shopping takes place in the months of May and November. These are closely floowed by March and December.
Most of visitors does not finalize the transaction. The ratio of the transacting users to the non-transacting users in approximately 1:5.