Introduction to numerical features

# Import necessary packages
import pandas as pd
Copy to clipboard
# Read file into dataframe
df = pd.read_csv('./../../../datasets/online_shoppers_intention.csv')
Copy to clipboard

Column Decsriptions

Following are the numerical features in the dataset:

Administrative

These are the number of pages visted by the vistor about account management.

Administrative duration

Total amount of time (in seconds) spent by the visitor on account management related pages.

Informational

Number of pages visited by the visitor about Web site, communication and address information of the shopping site.

Informational duration

Total amount of time (in seconds) spent by the visitor on informational pages.

Bounce rate

Average bounce rate value of the pages visited by the visitor. It refers to the percentage of visitors who enter the ite from that page and then leave without triggering any other requests to the analytics server during that session.

Exit rate

Average exit rate value of the pages visited by the visitor. It is the value for a specific webpage calculated for all pageviews to the page, the percentage that were the last in the session.

Page value

Average page value of the pages visited by the visitor. It represents the average value for a web page that a user visted before completing an e-commerce transaction.

Special day

Closeness of the site visiting time to a special day

# Select numerical columns data
numerical_columns = ['Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 
                     'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 
                     'SpecialDay']
df = df[numerical_columns]
Copy to clipboard

Column statistics

df.info()
Copy to clipboard
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
dtypes: float64(7), int64(3)
memory usage: 963.4 KB
Copy to clipboard

There are no null entries in the numerical data. Have both integer and floating types of data.

df.describe()
Copy to clipboard
Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues SpecialDay
count 12330.000000 12330.000000 12330.000000 12330.000000 12330.000000 12330.000000 12330.000000 12330.000000 12330.000000 12330.000000
mean 2.315166 80.818611 0.503569 34.472398 31.731468 1194.746220 0.022191 0.043073 5.889258 0.061427
std 3.321784 176.779107 1.270156 140.749294 44.475503 1913.669288 0.048488 0.048597 18.568437 0.198917
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 7.000000 184.137500 0.000000 0.014286 0.000000 0.000000
50% 1.000000 7.500000 0.000000 0.000000 18.000000 598.936905 0.003112 0.025156 0.000000 0.000000
75% 4.000000 93.256250 0.000000 0.000000 38.000000 1464.157214 0.016813 0.050000 0.000000 0.000000
max 27.000000 3398.750000 24.000000 2549.375000 705.000000 63973.522230 0.200000 0.200000 361.763742 1.000000

The mean duration of time spent by vistor on different types are people differ widely. Users are more likely to visit and spend their time on product related pages followed by administrative pages and then informational pages. Though the maximum value for the time spent on product related pages in a session seems to be off-beat. Informational related pages have highly skewed values.
It should be noted that the maximum value of bounce rates and exit rates is 0.2. Bounce rates and special day columns have highly skewed values.