Data Preparation¶

Data preparation step is the most important step for any machine learning project since the performance of model is highly dependent on the quality of data. This data does not contain any missing values which saved a lot of work.

# Import necessary packages
import pandas as pd
import numpy as np

# Load file into the dataframe
df = pd.read_csv('./../../datasets/online_shoppers_intention.csv')

# Data preview
df.head()

	ProductRelated	ProductRelated_Duration	BounceRates	ExitRates	Month	OperatingSystems	Browser	Region	TrafficType	VisitorType	Weekend	Revenue
0	1	0.000000	0.20	0.20	Feb	1	1	1	1	Returning_Visitor	False	False
1	2	64.000000	0.00	0.10	Feb	2	2	1	2	Returning_Visitor	False	False
2	1	0.000000	0.20	0.20	Feb	4	1	9	3	Returning_Visitor	False	False
3	2	2.666667	0.05	0.14	Feb	3	2	2	4	Returning_Visitor	False	False
4	10	627.500000	0.02	0.05	Feb	3	3	1	4	Returning_Visitor	True	False

# Data general information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 Administrative           12330 non-null  int64  
 Administrative_Duration  12330 non-null  float64
 Informational            12330 non-null  int64  
 Informational_Duration   12330 non-null  float64
 ProductRelated           12330 non-null  int64  
 ProductRelated_Duration  12330 non-null  float64
 BounceRates              12330 non-null  float64
 ExitRates                12330 non-null  float64
 PageValues               12330 non-null  float64
 SpecialDay               12330 non-null  float64
Month                    12330 non-null  object 
OperatingSystems         12330 non-null  int64  
Browser                  12330 non-null  int64  
Region                   12330 non-null  int64  
TrafficType              12330 non-null  int64  
VisitorType              12330 non-null  object 
Weekend                  12330 non-null  bool   
Revenue                  12330 non-null  bool   
dtypes: bool(2), float64(7), int64(7), object(2)
memory usage: 1.5+ MB

There is no preprocessing required for numerical features at this moment.

It would better to one-hot encode categorical features.

# One-hot encode categorical columns
onehot_columns = ['OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Month']
for column in onehot_columns:
    temp_df = pd.get_dummies(df[column], prefix=column+'_')
    df = pd.concat([df, temp_df], axis=1)
    df.drop(column, axis=1, inplace=True)

# Remove outliers
df = df[df['Administrative_Duration'] < 3000]
df = df[df['Informational'] < 20]
df = df[df['Informational_Duration'] < 2500]
df = df[df['ProductRelated'] < 600]
df = df[df['ProductRelated_Duration'] < 40000]
df = df[df['PageValues'] < 300]

df.shape

(12325, 75)

From the shape of the new dataframe, it can be inferred that only 5 rows contained outliers. There isn’t much data loss.

# Save the processed data
df.to_csv('preprocessed_osi.csv', index=False)

Online Shoppers Purchasing Intention

Data Preparation¶