Data Selection and Segregation

# Import necessary packages
import pandas as pd

Combine datasets

Data from years 2017 and 2018 will be used as training set and data from year 2019 will be used as testing set. But for preprocessing purpose, data will be combined and then it will be separated for modelling.

# Combine data
data = pd.concat([pd.read_csv('./../../../datasets/cleaned_osmi_2017.csv'), pd.read_csv('./../../../datasets/cleaned_osmi_2018.csv'), pd.read_csv('./../../../datasets/cleaned_osmi_2019.csv')])

Feature Selection

It would be convenient for now to remove all the columns with subjective answers.

# Selecting columns to be dropped
drop_columns = ['describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.', 
                'describe the conversation with coworkers you had about your mental health including their reactions.', 
                'describe the conversation your coworker had with you about their mental health (please do not use names).', 
                'describe the conversation you had with your previous coworkers about your mental health including their reactions.', 
                'describe the conversation your coworker had with you about their mental health (please do not use names)..1', 
                'describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.',
                'what disorder(s) have you been diagnosed with?',
                'why or why not?',
                'why or why not?.1',
                'describe the circumstances of the badly handled or unsupportive response.',
                'describe the circumstances of the supportive or well handled response.',
                'briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.',
                'if there is anything else you would like to tell us that has not been covered by the survey questions, please use this space to do so.',
                'what us state or territory do you live in?',
                'what us state or territory do you work in?']
# Drop selected columns and store the data
print("Dropping columns")
data.drop(drop_columns, axis=1, inplace=True)
Dropping columns
data.to_csv('./../../../datasets/data.csv', index=False)