Data Preparation¶
Data preparation step is the most important step for any machine learning project since the performance of model is highly dependent on the quality of data. This data does not contain any missing values which saved a lot of work.
# Import necessary packages
import pandas as pd
import numpy as np
# Load file into the dataframe
df = pd.read_csv('./../../datasets/online_shoppers_intention.csv')
# Data preview
df.head()
Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | Month | OperatingSystems | Browser | Region | TrafficType | VisitorType | Weekend | Revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Feb | 1 | 1 | 1 | 1 | Returning_Visitor | False | False |
1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.00 | 0.10 | 0.0 | 0.0 | Feb | 2 | 2 | 1 | 2 | Returning_Visitor | False | False |
2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.20 | 0.20 | 0.0 | 0.0 | Feb | 4 | 1 | 9 | 3 | Returning_Visitor | False | False |
3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.05 | 0.14 | 0.0 | 0.0 | Feb | 3 | 2 | 2 | 4 | Returning_Visitor | False | False |
4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.02 | 0.05 | 0.0 | 0.0 | Feb | 3 | 3 | 1 | 4 | Returning_Visitor | True | False |
# Data general information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Administrative 12330 non-null int64
1 Administrative_Duration 12330 non-null float64
2 Informational 12330 non-null int64
3 Informational_Duration 12330 non-null float64
4 ProductRelated 12330 non-null int64
5 ProductRelated_Duration 12330 non-null float64
6 BounceRates 12330 non-null float64
7 ExitRates 12330 non-null float64
8 PageValues 12330 non-null float64
9 SpecialDay 12330 non-null float64
10 Month 12330 non-null object
11 OperatingSystems 12330 non-null int64
12 Browser 12330 non-null int64
13 Region 12330 non-null int64
14 TrafficType 12330 non-null int64
15 VisitorType 12330 non-null object
16 Weekend 12330 non-null bool
17 Revenue 12330 non-null bool
dtypes: bool(2), float64(7), int64(7), object(2)
memory usage: 1.5+ MB
There is no preprocessing required for numerical features at this moment.
It would better to one-hot encode categorical features.
# One-hot encode categorical columns
onehot_columns = ['OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Month']
for column in onehot_columns:
temp_df = pd.get_dummies(df[column], prefix=column+'_')
df = pd.concat([df, temp_df], axis=1)
df.drop(column, axis=1, inplace=True)
# Remove outliers
df = df[df['Administrative_Duration'] < 3000]
df = df[df['Informational'] < 20]
df = df[df['Informational_Duration'] < 2500]
df = df[df['ProductRelated'] < 600]
df = df[df['ProductRelated_Duration'] < 40000]
df = df[df['PageValues'] < 300]
df.shape
(12325, 75)
From the shape of the new dataframe, it can be inferred that only 5 rows contained outliers. There isn’t much data loss.
# Save the processed data
df.to_csv('preprocessed_osi.csv', index=False)