Exploratory Data Analysis - Multivariate

Tip

This notebook contains interactive graphs and hence are not rendered directly here. Please use “live code” option to run it here or run the complete in Google Colaboratory or Binder.

Note

The interactivity for matplotlib graphs does not work with live code functionality. Hence, running the code in Google Colaboratory or Binder is recommended.


# Import necessary packages
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from ipywidgets import widgets
# Load data into dataframe from local
# df = pd.read_csv('./../../datasets/online_shoppers_intention.csv')

# Load data into dataframe from UCI repository
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv')

For multivariate analysis, it is better to start with finding the correlation of numerical features with each other. This is also help us identify the numerical features which are related to each other.”

# Select numerical columns data
numerical_columns = ['Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 
                     'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 
                     'SpecialDay']
num_df = df[numerical_columns]
# Compute correlation matrix using both pearson ans spearman correlation methods
pearson_corr_matrix = num_df.corr() # The default method used is pearson correlation
spearman_corr_matrix = num_df.corr(method='spearman')
# Plot Pearson correlation heatmap
fig = px.imshow(pearson_corr_matrix)
fig.show()

Page related columns are positively correlated with each other but they are weak correlations. It is not surprising that the number of pages of each type visited is highly correlated with the time spent in pages of that type. It is most strong in case of product related pages. Bounce rates, exit rates and special days are negatively correlated with the page type features. Bounce rate is also postively correlated with exit rate. Bounce rate, exit rate and special day are negatively correlated with the page value feature.
Product pages related features that are ProductRelated - ProductRelated_Duration and BounceRates - ExitRates exhibit high positive corelation. Hence one of the each pair can be eliminated. This elimination will be considered while training the models for which multi-collinearity matter as these features might provide nearly same information to the models.

# Plot Spearman correlation heatmap
fig = px.imshow(spearman_corr_matrix)
fig.show()

Page related features are more non linearly correlated with each other than they are linearly. Also, these features has weak positive non linear correlation with page type features.
This correlations will be more clear with scatter plots.

# Create interactive plots

# Create column selection widgets
numcols1 = widgets.Dropdown(options=numerical_columns, value='Administrative', description='Numerical columns')
numcols2 = widgets.Dropdown(options=numerical_columns, value='Administrative', description='Numerical columns')

# Create scatter plot trace
num_trace1 = go.Scatter(x = num_df['Administrative'], y=num_df['Administrative'], mode='markers')

# Create widget for scatter plot
ng1 = go.FigureWidget(data=[num_trace1], 
                      layout = go.Layout(
                          title = dict(text='Relation between variables'),
                          xaxis=dict(title=dict(text='Administrative'))
                      ))
# ng1.update_xaxes(title_text='Administrative')
# ng1.update_yaxes(title_text='Administrative')

# Create function for observin change in selection
def num_response1(change):
    """Function to update the values in graph based on selection"""
    with ng1.batch_update():
        ng1.data[0].x = num_df[numcols1.value]
        ng1.layout.xaxis.title = numcols1.value
        
def num_response2(change):
    """Function to update the values in graph based on selection"""
    with ng1.batch_update():
        ng1.data[0].y = num_df[numcols2.value]
        ng1.layout.yaxis.title = numcols2.value
        
numcols1.observe(num_response1, names='value')
numcols2.observe(num_response2, names='value')

num_container = widgets.VBox([widgets.HBox([numcols1, numcols2]), ng1])
display(num_container)

Scatter plots confirmes the correlation that are observed in the heatmap.

Next we can look at some fundamental difference between the vistors who transacted and those who did not transacted.

# Create interactive plots

# Create widget to select columns
numcols3 = widgets.Dropdown(options=numerical_columns, value='Administrative', description='Numerical columns')

# Select and slice required data
revenue = df[df['Revenue']]
non_revenue = df[~df['Revenue']]

# Create figure for the plot

fig1 = px.histogram(df, x="Administrative", color="Revenue", 
                    cumulative=True, opacity=0.5, histnorm="probability", barmode="overlay")

# Create widget for the plot
ng3 = go.FigureWidget(fig1)

# Create function to respond to changes
def num_response3(change):
    """Function to change the values based on selection of column"""
    with ng3.batch_update():
        ng3.data[0].x = revenue[numcols3.value]
        ng3.data[1].x = non_revenue[numcols3.value]
        ng3.layout.xaxis.title = numcols3.value
        
numcols3.observe(num_response3, names='value')

num_container2 = widgets.VBox([numcols3, ng3])
display(num_container2)

From the cumulative distributions, it can be observed that visitors transact visit more number of administrative pages, spend more time on administrative pages, marginally visit more number of informational pages, spend more time on informational pages, visit more number of product related pages and spend more time on those product related pages. Also they have low values of bounce rates and exit rates but have high corresponds to high page values. Special days does not matter for transacting visitors.