Data Preparation

Data preparation step is the most important step for any machine learning project since the performance of model is highly dependent on the quality of data. This data does not contain any missing values which saved a lot of work.

# Import necessary packages
import pandas as pd
import numpy as np
# Load file into the dataframe
df = pd.read_csv('./../../datasets/online_shoppers_intention.csv')
# Data preview
df.head()
Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues SpecialDay Month OperatingSystems Browser Region TrafficType VisitorType Weekend Revenue
0 0 0.0 0 0.0 1 0.000000 0.20 0.20 0.0 0.0 Feb 1 1 1 1 Returning_Visitor False False
1 0 0.0 0 0.0 2 64.000000 0.00 0.10 0.0 0.0 Feb 2 2 1 2 Returning_Visitor False False
2 0 0.0 0 0.0 1 0.000000 0.20 0.20 0.0 0.0 Feb 4 1 9 3 Returning_Visitor False False
3 0 0.0 0 0.0 2 2.666667 0.05 0.14 0.0 0.0 Feb 3 2 2 4 Returning_Visitor False False
4 0 0.0 0 0.0 10 627.500000 0.02 0.05 0.0 0.0 Feb 3 3 1 4 Returning_Visitor True False
# Data general information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType              12330 non-null  int64  
 15  VisitorType              12330 non-null  object 
 16  Weekend                  12330 non-null  bool   
 17  Revenue                  12330 non-null  bool   
dtypes: bool(2), float64(7), int64(7), object(2)
memory usage: 1.5+ MB

There is no preprocessing required for numerical features at this moment.

It would better to one-hot encode categorical features.

# One-hot encode categorical columns
onehot_columns = ['OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Month']
for column in onehot_columns:
    temp_df = pd.get_dummies(df[column], prefix=column+'_')
    df = pd.concat([df, temp_df], axis=1)
    df.drop(column, axis=1, inplace=True)
# Remove outliers
df = df[df['Administrative_Duration'] < 3000]
df = df[df['Informational'] < 20]
df = df[df['Informational_Duration'] < 2500]
df = df[df['ProductRelated'] < 600]
df = df[df['ProductRelated_Duration'] < 40000]
df = df[df['PageValues'] < 300]
df.shape
(12325, 75)

From the shape of the new dataframe, it can be inferred that only 5 rows contained outliers. There isn’t much data loss.

# Save the processed data
df.to_csv('preprocessed_osi.csv', index=False)