Introduction to categorical features

# Import necessary packages
import pandas as pd
# Load file into dataframe
df = pd.read_csv('./../../../datasets/online_shoppers_intention.csv')

Column description

Following are the categorical features present in the dataset:

OperatingSystems

Operating system of the visitor. This column is already encoded and the true values are not available.

Browser

Browser of the visitor. This column is already encoded and true values are not available.

Region

Geographic region from which the session has been started by the visitor. This column is already encoded and true values are not available.

TrafficType

Traffic source by which the visitor has arrived at the Web site (e.g., banner, SMS, direct). This column is already encoded and true values are not available.

VisitorType

Visitor type as ‘‘New Visitor,’’ ‘‘Returning Visitor,’’ and ‘‘Other’’. The details of what are all visitors are included in ‘’Other’’ is unknown.

Weekend

Boolean value indicating whether the date of the visit is weekend

Month

Month value of the visit date

Revenue

Class label indicating whether the visit has been finalized with a transaction. This is the target variable.

# Select categorical features
categorical_columns = ['OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend', 'Month', 
                       'Revenue']
df = df[categorical_columns]

Column statistics

# Data information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   OperatingSystems  12330 non-null  int64 
 1   Browser           12330 non-null  int64 
 2   Region            12330 non-null  int64 
 3   TrafficType       12330 non-null  int64 
 4   VisitorType       12330 non-null  object
 5   Weekend           12330 non-null  bool  
 6   Month             12330 non-null  object
 7   Revenue           12330 non-null  bool  
dtypes: bool(2), int64(4), object(2)
memory usage: 602.2+ KB

Eventhough the selected features are categorical, some of the categorical features are already encoded and are represented in number. ‘VisitorType’ and ‘Month’ are not encoded and are available in string format.

# Find number of unique values in each categorical feature
unique_cnt_df = pd.DataFrame(columns=['Column', 'Number of unique values'])
unique_cnt_df['Column'] = categorical_columns
unique_cnt_df['Number of unique values'] = unique_cnt_df['Column'].apply(lambda x: df[x].nunique())
unique_cnt_df
Column Number of unique values
0 OperatingSystems 8
1 Browser 13
2 Region 9
3 TrafficType 20
4 VisitorType 3
5 Weekend 2
6 Month 10
7 Revenue 2

The column month is expected to have 12 unique values but has only 10.