Introduction to categorical features¶
# Import necessary packages
import pandas as pd
# Load file into dataframe
df = pd.read_csv('./../../../datasets/online_shoppers_intention.csv')
Column description¶
Following are the categorical features present in the dataset:
OperatingSystems¶
Operating system of the visitor. This column is already encoded and the true values are not available.
Browser¶
Browser of the visitor. This column is already encoded and true values are not available.
Region¶
Geographic region from which the session has been started by the visitor. This column is already encoded and true values are not available.
TrafficType¶
Traffic source by which the visitor has arrived at the Web site (e.g., banner, SMS, direct). This column is already encoded and true values are not available.
VisitorType¶
Visitor type as ‘‘New Visitor,’’ ‘‘Returning Visitor,’’ and ‘‘Other’’. The details of what are all visitors are included in ‘’Other’’ is unknown.
Weekend¶
Boolean value indicating whether the date of the visit is weekend
Month¶
Month value of the visit date
Revenue¶
Class label indicating whether the visit has been finalized with a transaction. This is the target variable.
# Select categorical features
categorical_columns = ['OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend', 'Month',
'Revenue']
df = df[categorical_columns]
Column statistics¶
# Data information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 OperatingSystems 12330 non-null int64
1 Browser 12330 non-null int64
2 Region 12330 non-null int64
3 TrafficType 12330 non-null int64
4 VisitorType 12330 non-null object
5 Weekend 12330 non-null bool
6 Month 12330 non-null object
7 Revenue 12330 non-null bool
dtypes: bool(2), int64(4), object(2)
memory usage: 602.2+ KB
Eventhough the selected features are categorical, some of the categorical features are already encoded and are represented in number. ‘VisitorType’ and ‘Month’ are not encoded and are available in string format.
# Find number of unique values in each categorical feature
unique_cnt_df = pd.DataFrame(columns=['Column', 'Number of unique values'])
unique_cnt_df['Column'] = categorical_columns
unique_cnt_df['Number of unique values'] = unique_cnt_df['Column'].apply(lambda x: df[x].nunique())
unique_cnt_df
Column | Number of unique values | |
---|---|---|
0 | OperatingSystems | 8 |
1 | Browser | 13 |
2 | Region | 9 |
3 | TrafficType | 20 |
4 | VisitorType | 3 |
5 | Weekend | 2 |
6 | Month | 10 |
7 | Revenue | 2 |
The column month is expected to have 12 unique values but has only 10.