My first exposure to a Machine Learning problem.

Tackling the world-famous Titanic Machine Learning problem and introducing myself to the world of ML.

Shaunak Varudandi
Towards Data Science

--

Photo by NOAA on Unsplash

Introduction

Everyone in this day and age knows about the RMS Titanic. Titanic was the biggest ship of its time and was termed as unsinkable. So much so that the engineers and architects of the ship came up to the conclusion that there is no need to add an adequate number of lifeboats on the ship. This meant the ship can accommodate more passengers and enable them to move around the decks freely.

Disaster struck on the 14th of April, 1912 when the Titanic collided with an iceberg causing extensive damage to the ship. 2 hours and 40 minutes later, the ship was at the bottom of the ocean. Owing to the poor planning of ship designers the lifeboats could not accommodate all the passengers present on the ship. This caused extensive loss of life. The estimated death toll was around 1517 passengers.

For those of you reading this blog, you might be having difficulty associating the title of my blog with the introduction that I have given thus far. This is understandable and the reason why I have given a brief about the whole Titanic disaster is because my first machine learning project is based on the exact same topic.

Problem Statement

The Titanic challenge on Kaggle is an open competition for beginners to take part in and get acquainted with the Kaggle platform. Additionally, it also provides a starting point for people to start their machine learning journey.

The problem statement is simple. Given a training set and a test set, I had to develop a machine learning algorithm that predicts the passengers who survived the Titanic disaster and the ones who perished.

Work Flow

Since this was my first exposure to a machine learning problem, I wanted to test all the possible algorithms I can on the data set. I ended up importing a number of machine learning algorithms, starting from the simpler one’s such as Logistic Regression and K Nearest Neighbors algorithm to the more intricate one’s such as Random Forest classifier and AdaBoost classifier.

(Image by author)

After importing the libraries and loading the data into a pandas data frame, I started exploring the data, looking out for possible outliers or missing values that could be a cause of concern for our algorithm in the later stages of our project.

Data Exploration

As the first step of data exploration, I used the describe function to get an overview of the numerical data present in the training data set. The describe function is a useful tool since it gives the user a statistical overview of the data. It gives us the count, mean, standard deviation and all the quartile values. The training data set has information of 891 passengers.

(Image by author)

In the next step I make use of Bar Graphs to explore categorical variables. Bar graphs are a good way of exploring the count of various categorical labels present in a feature. A bar graph for the target label, i.e. Survived or Not Survived is given in the image below where 1 signifies survived whereas 0 signifies as Not survived.

(Image by author)

The bar graph was also used to compare the number of women and the number of men who survived the accident. As the bar graph depicts quite clearly (through the images provided below), the number of women who survived the accident outweighs the number of men who survived it. Instances of such a pattern can be found in the history books, which state that the captain explicitly issued an order for women and children to be saved first. As a result, the survival rate for women was almost three times higher than for men.

Bar Graph depicting the number of women who survived (left) v/s Bar Graph depicting the number of men who survived (right). (Image by author)

After getting a fairly decent idea about the categorical variables in the training set, I now moved my attention towards missing values present in our data set (i.e. training data and test data). Missing values can cause the algorithm to misbehave and underperform, which is why they should be dealt with appropriately. I used the pandas isnull and sum function to get the count of missing values. In the training data Age, Cabin, and Embarked were the columns that had missing values in them, whereas in the test data Fare, Age and Cabin had missing data.

Summary of missing values in Training Data (left) and Test Data (Right). (Image by author)

Data Engineering (Training Set)

After the data exploration step, I had a pretty good idea of the data set in hand, and hence I decided to make the necessary modifications and changes in the data set to suit the needs of our machine learning algorithms.

Step 1: Removing Rows/Columns with Missing Data.

The “Embarked” column had 2 missing values. Owing to the low count of missing values, I decided to drop the rows consisting of the missing value. Dropping them will not lead to a significant loss in information for the algorithm. On the other hand, the cabin column has a considerable number of missing values, these missing terms account for 77% of all the data present in the “Cabin” column. Missing values from the cabin column can hinder the model from reaching optimal accuracy, therefore, I decided to drop the “Cabin” column.

Step 2: Fill the Missing Values in a Data Set.

The only column left with missing values in the training set is the “Age” column. In order to fill in the missing values, I used the boxplot. Since the column “Pclass” has no missing values, I mapped it against the Age values that were present in the data set. The boxplots which were formed gave me a mean “Age” value for passengers present in each class on the ship. These mean values were then substituted in place of the missing values, thus providing us with a complete “Age” column. The boxplot can be seen in the image below.

(Image by author)

Step 3: One-hot encoding of the categorical variables.

One hot encoding is a process by which categorical variables are converted into an interpretable form that can be provided to ML algorithms to do a better job in prediction. The pandas get_dummies function helps in converting categorical variables into a one-hot encoded format. For this project, I converted the “Sex” column and the “Embarked” column into one-hot encoded features. The image below shows new columns added to our data frame. The columns “male” and “female” were derived after applying one-hot encoding on the “Sex” column.

(Image by author)

Step 4: Dropping columns which are not required

After performing steps 1, 2, and 3, I am left with a few columns that are not required in the further stages of the project. Firstly, I dropped the “Sex” and “Embarked” columns since I had their one hot encoded equivalents and secondly, I dropped the “Ticket” and “Name” column since they do not assist in increasing the predictive power of the machine learning models.

Data Engineering (Test Set)

Since the training set and test were two different CSV files, after completing the data engineering task on the training set, I switched my attention towards the test set. I used the same four steps, described in the previous section to, clean the test set.

In summary, the “Cabin” column was dropped, a row with a missing “Fare” value was removed and the missing “Age” values were substituted using the methodology which was used to replace missing “Age” values in the training set. Lastly, categorical variables were converted using one-hot encoding and columns that were not required were dropped from the test set.

Implementing Machine Learning algorithms on training data

Now that I have a clean data set, the only part remaining was to assign the training and test set to variables. The same can be seen in the image below.

(Image by author)

Once the above step was executed, I started implementing Machine learning algorithms on the training data set. The algorithms that I implemented on the training data are:

  • Logistic Regression
  • Support Vector Machine (Kernel: Linear, RBF, and Sigmoid)
  • Decision Trees
  • Ridge Classifier
  • Random Forest Classifier
  • XG Boost Classifier
  • AdaBoost Classifier
  • K Nearest Neighbors Classifier

After implementing the above-listed algorithms on the training data, I calculated the model accuracy as well as the cross-validation score of the fitted models. The accuracy scores and the cross-validation scores of every model have been listed in the table below.

(Image by author)

Inference

Looking at the above table, SVM (Kernel: RBF), Decision Trees, and XG Boost Classifier had a remarkable model accuracy but a low cross-validation score, an indication of overfitting on the training data. On the contrary, Ridge Regression Classifier had a high model accuracy as well as cross-validation score. Thus, I conclude that Ridge Regression Classifier is the optimum algorithm for my Machine Learning Project.

Submission on Kaggle

I used the Ridge Regression Classifier on the test set and generated a CSV file for submission on Kaggle. The submission file consists of the predicted values for passengers present in the test set. The model achieved an accuracy of 76%.

Conclusion

My first attempt at a machine learning project was a fruitful one. I came across various methods that are used to clean data and handle missing values and I was able to implement various machine learning algorithms on the data set and assess their accuracy.

In the future, I will be learning more about the intricacies of various machine learning algorithms and their working. This will allow me to understand a situation and accordingly use the algorithm which best satisfies the need. Lastly, I will be working on many more interesting Machine Learning Projects and share it with the Machine Learning enthusiasts out there in the world.

The process flow followed by me for the Titanic project can be found on Github. I hope you enjoyed reading my blog.

--

--

MBA (Co-op) student at the DeGroote School of Business || Aspiring Business Data Analyst.