Making it to the Top 4% of a Kaggle leaderboard and why I shouldn't be there
If there is only one thing I learned by doing this, is that I really don't know very much. Now that I've gotten that off my chest, here are the methods, mistakes, and lessons I learned while attempting my first ML models in R.
TLDR; I wanted to apply my beginner R skills to a complex problem, and chose machine learning because why not. I managed to do well, but my lack of knowledge blinded me to the fact that I was kinda cheating, as I relied heavily on trial and error to rise the ranks. Despite this, I learned a lot about classification models and random forests, wrote more complex R code that the beginner stuff I had done so far, and enjoyed messing around and reflecting on the experience. All in all, it’s still a win for me.
I think everyone can agree that 2020 was full of surprises, most of them unpleasant. But among one of the better ones was being introduced to R as a part of my psychology masters. Because psychology is faced with a serious replication crisis, where researchers are only able to reproduce 36% of studies with similar results, providing open-source code in journal articles is becoming the new standard.
Rather than gate-keeping data and using archaic tools like SPSS, the Wordpress of data analysis tools, many psychology researchers now publish their analysis code and raw data with their articles. By increasing transparency, researchers hope to improve the validity of a field that is often plagued with pseudoscience-y rhetoric and misleading headlines. I mean, articles that lead with “scientists say smart people eat more brie” based on a paper with 3-person sample will probably never disappear, but this is hopefully a step in the right direction.
Fast forward half a year later, I got over the initial excitement of learning something new, endured a lot frustration and feelings of defeat triggered by incomprehensible errors, and eventually became somewhat comfortable with R. I even started to like it. There’s something I always find so satisfying about building things from scratch, whether it’s an article or function that loops through columns without needing to use an interface. But at some point I got bored of running t-tests and power analyses on clean data prepped by my lecturers, so I turned to Google (mostly Reddit) to answer the question: what else can I do with this? And the answer was Kaggle.
Kaggle and the “Titanic - Machine Learning from Disaster” competition
Kaggle is a community that hosts data science and machine learning competitions. Before starting, I fell into the majority group of people who think they understand what an algorithm is, but wouldn’t be able to clearly explain what it is (which is the metric I use to gauge comprehension). Very often, people use the word magic to describe copywriting, a word that is often associated with data science and machine learning. So naturally, I’ve always been convinced that there is absolutely nothing magical about it. And for the first time ever, I felt equipped with at least some of the basic skills I needed to venture into this mystical realm of predictions.
The very first competition recommended to Kaggle beginners involves predicting whether or not people on the Titanic survived the disaster. What you get is a file with info like gender, age, how much they paid for their tickets, and bunch of other information you need to make “magic” with. I’ll explain to the best of my abilities how I did that, and how I managed to get to the top 4% of the competition leaderboard. But first, I need to explain why my rank means absolutely nothing, and how I (unknowingly) cheated and guessed my way through this process.
Why my high leaderboard rank is incredibly misleading
Training a model is all about finding relationships in data, storing them (as a model), and applying them to new data to answer a question. How well that question is answered is determined in testing. Say you’re given a multiple choice quiz in school to test your knowledge and understanding. If you can make unlimited attempts at submitting and reiterating your answers based on the results, you will eventually guess what most of the right answers are - even if you know absolutely nothing.
It was only after I finished the competition that I realized that this was basically what I did. I used many submissions to test my models, and made many random and uninformed guesses and only kept the parameter tweaks and methods that boosted my score. There is absolutely nothing scientific about this. That said, I still learned a lot and expanded my understanding on what R can do. But in terms of applying even very simple ML to a real-life problem, I would still be very lost.
What I did, how it went, and the R libraries i used
Here is a quick overview of the models that I build and how well they performed during testing.
Model | Classification Tree | Random Forest | Conditional Inference Tree (CIT) | Random Forest (CIT) |
---|---|---|---|---|
Accuracy | 77.99% | 75.83% | 75.20% | 80.38% |
library(tidyverse) #For data wrangling
## Warning: package 'tidyverse' was built under R version 4.1.2
## Warning: package 'ggplot2' was built under R version 4.1.2
## Warning: package 'tibble' was built under R version 4.1.2
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'purrr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.2
## Warning: package 'stringr' was built under R version 4.1.2
## Warning: package 'forcats' was built under R version 4.1.2
library(ggplot2) #For general data visualization
library(ggridges) #For density plots
## Warning: package 'ggridges' was built under R version 4.1.2
library(randomForest) #For random forest modeling (CART)
## Warning: package 'randomForest' was built under R version 4.1.2
library(rpart) #For decision tree modeling (CART)
## Warning: package 'rpart' was built under R version 4.1.2
library(rpart.plot) #For plotting decision trees
## Warning: package 'rpart.plot' was built under R version 4.1.2
library(caret) #For model training and optimization
## Warning: package 'caret' was built under R version 4.1.2
library(party) #For conditional inference trees and random forests (CIT)
## Warning: package 'party' was built under R version 4.1.2
## Warning: package 'strucchange' was built under R version 4.1.2
## Warning: package 'zoo' was built under R version 4.1.2
## Warning: package 'sandwich' was built under R version 4.1.2
Data wrangling and feature engineering (fancy way to say make better variables from existing ones)
The data on Kaggle is already divided into two files, one for training the model and one to test it. Unlike the training data, the test data excludes the column Survived
, because the whole point of the challenge is to figure out that out. I combined the two data sets into one table and add a Survived
column with NAs for the test data. This makes it easier to modify and create new variables instead of doing it twice for each dataset. Here’s how the first few rows look:
#train <- read.csv("train.csv")
#test <- read.csv("test.csv")
test$Survived <- NA
combined <- rbind(train, test)
head(combined)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
When taking a look at how the data is structured, I could see that some nominal variables are incorrectly classed as integers (since number codes are used, i.e. for Survived
0 means didn’t make it and 1 means that they did, but the numbers alone mean nothing). Also, some variables like Name
were messy and not necessarily useful, at least not in the current format.
str(combined)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Step 1: Classify nominal and categorical variables as factors
I know that “male” and “female” are two options for the variable Sex
, not just a series of random letters. And based on my knowledge from traveling, I can guess that Pclass
represents how much passengers paid for they’re ticket, and the lowest number is actually the highest class. Reclassifying these variables just gives R the context needs to interpret these values correctly as factors, rather than a series of letters or numbers.
class(combined$Sex) #Right now this is classed as a character variable
## [1] "character"
Variables like Survived
and Sex
are nominal variables, meaning there is no order of importance or value (though we could argue that surviving beats death). On the other hand, Pclass
or passenger class goes from 3rd to 2nd then 1st in terms of their hierarchical value, making it an ordinal variable. Because this is counter-intuitive to R, which assumes higher numbers are of higher value, this needs to be set manually with the levels
argument.
combined$Survived <- factor(combined$Survived)
combined$Sex <- factor(combined$Sex)
combined$Embarked <- factor(combined$Embarked)
combined$Pclass <- factor(combined$Pclass, order = TRUE, levels = c(3, 2, 1))
Step 2: Visualize the data to explore relationships
I like pictures, and graphs tend to be easier to understand and see relationships in. A quick look at Survival
by other variables in the train dataset made a few things clear, which I grouped by categorical and numeric (interval/ratio) variables.
Survival
by categorical variables
Sex
- More women than men survived
mosaicplot(table(train$Sex, train$Survived),
color = TRUE,
xlab = "Sex",
ylab = "Survived",
main = "Sex vs Titanic Survival",
sub = "Survived: 0 = died, 1 = survived"
)
Pcalss
- More high class passengers survived than low class passengers
mosaicplot(table(train$Pclass, train$Survived),
color = TRUE,
xlab = "Sex",
ylab = "Survived",
main = "Passenger Class vs Titanic Survival",
sub = "Survived: 0 = died, 1 = survived"
)
Embarked
- Passengers boarding at Cherbourg had the highest number of survivors
mosaicplot(table(train$Embarked, train$Survived),
color = TRUE,
ylab = "Survived",
main = "Port of Embarkation vs Titanic Survival",
sub = "\n Survived: 0 = died, 1 = survived \n C = Cherbourg, Q = Queenstown, S = Southampton"
)
Survival
by numeric (integer/ratio) variables
Age
- The bimodal distribution for survivors suggests that more children and elderly passengers survived than didn’t (note that there are 177 missing values in out of 981 passengers)
theme_set(theme_ridges())
#Need to drop NAs for now for visualizations to run
train_fs <- train %>%
na.omit()
ggplot(train_fs, aes(x = Age, y = as.factor(Survived), height = ..density..)) +
geom_density_ridges(aes(fill = as.factor(Survived)),
stat = "density",
trim = TRUE,
show.legend = TRUE,
panel_scaling = FALSE) +
scale_x_continuous(expand = c(0.01, 0)) +
labs(title = "Age in Years") +
scale_fill_grey()+
theme_minimal()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
Fare
- A larger proportion of people who died paid low ticket fares
ggplot(train_fs, aes(x = Fare, y = as.factor(Survived), height = ..density..)) +
geom_density_ridges(aes(fill = as.factor(Survived)),
stat = "density",
trim = TRUE,
show.legend = TRUE,
panel_scaling = FALSE) +
scale_x_continuous(expand = c(0.01, 0)) +
labs(title = "Ticket Fare") +
scale_fill_grey()+
theme_minimal()
SibSp
- Number of siblings and spouses on board seems to be slightly higher for survivors than non-survivors
ggplot(train, aes(x = Parch, y = as.factor(Survived), height = ..density..)) +
geom_density_ridges(aes(fill = as.factor(Survived)),
stat = "density",
trim = TRUE,
show.legend = TRUE,
panel_scaling = FALSE) +
scale_x_continuous(expand = c(0.01, 0)) +
labs(title = "Number of Parents/Children") +
scale_fill_grey()+
theme_minimal()
Parch
- Number of parents and children on board seems to be slightly higher for survivors than non-survivors
ggplot(train, aes(x = SibSp, y = as.factor(Survived), height = ..density..)) +
geom_density_ridges(aes(fill = as.factor(Survived)),
stat = "density",
trim = TRUE,
show.legend = FALSE,
panel_scaling = TRUE) +
scale_x_continuous(expand = c(0.01, 0)) +
labs(title = "Number of Siblings") +
scale_fill_grey()+
theme_minimal()
What about the remaining variables?
There are four variables left, but are tricky to work with for the following reasons:
PassengerId
- Not very useful and just used for indexing rows, so it is excluded when training modelsName
- As a string, this variable isn’t that useful for same reasons asPassengerId
Ticket
- These ticket numbers are not useful in their original format (929 unique values across both train and test data)Cabin
- The cabin numbers are also not useful in their original format (186 unique values across both train and test data plus 1,014 missing values)
combined %>% group_by(Ticket) %>% summarize(count=n())
## # A tibble: 929 × 2
## Ticket count
## <chr> <int>
## 1 110152 3
## 2 110413 3
## 3 110465 2
## 4 110469 1
## 5 110489 1
## 6 110564 1
## 7 110813 2
## 8 111163 1
## 9 111240 1
## 10 111320 1
## # … with 919 more rows
combined %>% group_by(Cabin) %>% summarize(count=n())
## # A tibble: 187 × 2
## Cabin count
## <chr> <int>
## 1 "" 1014
## 2 "A10" 1
## 3 "A11" 1
## 4 "A14" 1
## 5 "A16" 1
## 6 "A18" 1
## 7 "A19" 1
## 8 "A20" 1
## 9 "A21" 1
## 10 "A23" 1
## # … with 177 more rows
Step 3: Feature engineering (creating new variables)
While more data doesn’t necessarily make for a more accurate model, some of the trickier variables can be manipulated to be more useful for our model. Naturally, this is better than just excluding them altogether (sometimes).
New variable 1: FamilySize
from SibSp
and Parch
When visualizing the data, it was clear that there were only minor differences between the SibSp
and Parch
distributions between passengers who survived and those who didn’t. But what if they were added together create a new variable called FamilySize
? Once combined into FamilySize
, it was clearer that there is a larger proportion of non-survivors at the lower end of the x-axis (people with fewer family members).
combined$FamilySize <- combined$SibSp + combined$Parch + 1
combined_fs <- combined %>%
na.omit()
familysize_plot <- ggplot(combined_fs, aes(x = FamilySize, y = Survived, height = ..density..)) +
geom_density_ridges(aes(fill = Survived),
stat = "density",
trim = TRUE,
show.legend = FALSE,
panel_scaling = TRUE) +
scale_x_continuous(expand = c(0.01, 0)) +
labs(title = "Family Size") +
scale_fill_grey()+
theme_minimal()
familysize_plot
New variable 2: Title
from Name
Considering the distribution of Survival
by Fare
, it made sense to assume that the more you paid, the more likely you were to get special treatment and VIP seats on the life rafts. In addition to having the passenger’s names, the Name
variable also includes their title. Based on the title, the wealth or status of passengers who survived can also be compared with those who didn’t.
The Name
variable is classed as a character or string, which can be split up by certain separators like spaces, periods, commas, etc. The following code shows how the first name, title, and family name are split and ordered.
strsplit(combined$Name[1], split='[,.]')
## [[1]]
## [1] "Braund" " Mr" " Owen Harris"
strsplit(combined$Name[200], split='[,.]')
## [[1]]
## [1] "Yrois" " Miss"
## [3] " Henriette (\"Mrs Harbeck\")"
class(strsplit(combined$Name[200], split='[,.]')) #Tells us that these words are put into a list
## [1] "list"
The title is wedged between the last and first names, so it’s position is 2 in the list strsplit()
from base R creates as an output, and can be selected by subsetting with [[1]][2]
. The first index [[1]]
is position in the list, while [2]
is the position of the desired split string in that list item to extract the title. For example:
strsplit(combined$Name[200], split='[,.]')[[1]][2]
## [1] " Miss"
I used sapply
to repeat this process over the entire dataset to create a new variable called Title
.
getitle <- function(x) {
strsplit(x, split='[,.]')[[1]][2]
}
combined$Title <- sapply(combined$Name, getitle)
table(combined$Title) #View all titles in a table
##
## Capt Col Don Dona Dr
## 1 4 1 1 8
## Jonkheer Lady Major Master Miss
## 1 1 2 61 260
## Mlle Mme Mr Mrs Ms
## 2 1 757 197 2
## Rev Sir the Countess
## 8 1 1
Some of these seem repetitive. I dealt with this by grouping them as follows:
combined$Title <- trimws(combined$Title) #There seems to be some extra white space after using strplit(), this code resolves the issue when the logical replacement doesn't work
combined$Title[combined$Title %in% c('Mme', 'Mlle')] <- 'Mlle' #French ladies
combined$Title[combined$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir' #Rich or high-ranking men
combined$Title[combined$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady' #Rich ladies
table(combined$Title)
##
## Col Dr Lady Master Miss Mlle Mr Mrs Ms Rev Sir
## 4 8 4 61 260 3 757 197 2 8 5
Because decision trees (the basis of all the methods I used for modeling) favors variables with more factor levels, I made these groups even smaller to try and prevent bias.
combined$Title[combined$Title %in% c('Mlle', 'Mrs', 'Ms', 'Lady', 'Miss')] <- 'Ladies' #All women
combined$Title[combined$Title %in% c('Mr')] <- 'RegGuys' #Regular dudes
combined$Title[combined$Title %in% c('Col', 'Rev', 'Dr')] <- 'SmartGuys' #Educated/career dudes
combined$Title[combined$Title %in% c('Sir', 'Master')] <- 'RichGuys' #Guys who get small 7-figure loans from daddy to start a real-estate business
table(combined$Title)
##
## Ladies RegGuys RichGuys SmartGuys
## 466 757 66 20
combined$Title <- factor(combined$Title)
Visually, the results look like this. Right off the bat, it was clear that RegGuys
had the highest ratio of death to survival.
mosaicplot(table(combined$Title, combined$Survived),
color = TRUE,
ylab = "Survived",
main = "Title vs Titanic Survival",
sub = "\n Survived: 0 = died, 1 = survived"
)
New variable 3: FamilySurvival
from Name
and Ticket
Note: many disgruntled people on Reddit claim that using this variable is cheating, and that this information wouldn’t be available to anyone at the time who would need to predict who died without having the data available. But then again, there were also no computers then, and I’m not about to do all of this on pen and paper.
This was the most confusing variable to wrap my head around, and truthfully I borrowed and Frankensteined most of the following code from a few kernels I saw that did the same thing to improve model accuracy. Essentially, FamilySurvival
is determined by taking a look at passengers that match by Ticket
and last names, pulled from the Name
variable, and determining if at least one family member survived or none. This factor variable would have the following levels:
- 0 = no survivors
- 0.5 = neither condition met (neutral)
- 1 = at least one survivor
First, the code extracted the last names into a new variable called LastName
to determine if any people with the same last name survived or not. Next, it looped through both the Ticket
and LastName
columns to see if people traveling under the same ticket number, who we assume are related or at least traveling together, or having the same family name have survived or not. If either condition is met, the people are considered family.
#Get that last name from Name and create a new column
combined$LastName = gsub(",.*$", "", combined$Name)
#Set FamilySurvival default to 0.5 so it's neutral
combined$FamilySurvival = 0.5
for (i in combined$Ticket) {
a <- filter(combined, combined$Ticket == i)
if (nrow(a) > 1 & any(a$FamilySurvival < 1) & any(a$Survived == 1, na.rm = TRUE)) {
i = as.name(i)
combined[combined$Ticket == i,]$FamilySurvival = 1
}
else if (nrow(a) > 1 & all(a$FamilySurvival < 1) & all(a$Survived == 0, na.rm = TRUE)) {
i = as.name(i)
combined[combined$Ticket == i,]$FamilySurvival = 0
}
}
#Convert to factor
combined$FamilySurvival <- factor(combined$FamilySurvival)
The plot below highlights that passengers with no surviving family members had the largest proportion of deaths. Essentially, a very useful predictor created out of two originally useless pieces of information.
mosaicplot(table(combined$FamilySurvival, combined$Survived),
color = TRUE,
ylab = "Survived",
main = "Title vs Titanic Survival",
sub = "\n Survived: 0 = died, 1 = survived"
)
New variable 4 CabinClass
from Cabin
As mentioned earlier, the Cabin
values were really messy. There was also a huge number of blank or missing values. But before dealing with that, I eliminated the numbers to extract just the class which could then be compared with Pclass
.
combined$CabinClass <- substr(combined$Cabin, 1, 1)
table(combined$CabinClass)
##
## A B C D E F G T
## 1014 22 65 94 46 41 21 5 1
combined$CabinClass <- factor(combined$CabinClass)
Unlike Pclass
, the CabinClass
by Survival
doesn’t say much. But the main reason for this seems to be the huge proportion of missing values.
pclass_plot <- mosaicplot(table(combined$Pclass, combined$Survived),
color = TRUE,
xlab = "Sex",
ylab = "Survived",
main = "Passenger Class vs Titanic Survival",
sub = "Survived: 0 = died, 1 = survived"
)
Step 4: Deal with missing Age
,Fare
, Cabin
and Embarked
values
To determine how many missing values there were in both the test and train datasets, I created a function and then looped it over each column to sum up the number of NAs.
missing <- function (x) {
missing_vals <- sum(is.na(x))
if(missing_vals > 0) {
print(missing_vals)
}
else {
print("Complete!")
}
}
table(lapply(combined, missing))
## [1] "Complete!"
## [1] 418
## [1] "Complete!"
## [1] "Complete!"
## [1] "Complete!"
## [1] 263
## [1] "Complete!"
## [1] "Complete!"
## [1] "Complete!"
## [1] 1
## [1] "Complete!"
## [1] "Complete!"
## [1] "Complete!"
## [1] "Complete!"
## [1] "Complete!"
## [1] "Complete!"
## [1] "Complete!"
## , , Pclass = Complete!, Name = Complete!, Sex = Complete!, Age = 263, SibSp = Complete!, Parch = Complete!, Ticket = Complete!, Fare = 1, Cabin = Complete!, Embarked = Complete!, FamilySize = Complete!, Title = Complete!, LastName = Complete!, FamilySurvival = Complete!, CabinClass = Complete!
##
## Survived
## PassengerId 418
## Complete! 1
There were a lot of missing values, especially for Age
. There are 86 missing Age values in the test
dataset and 177 in the train
dataset. There was also a missing Fare
value that the following model will take care of. Embarked
on the other hand, needed a bit more cleaning.
These can either be eliminated with the na.omit()
function, replaced with the mean, median, or mode, or imputed statistically. I chose the third option since it seemed like the most accurate way to go about it. To do this, I used rfImpute()
from the randomForest
package to use the available data across all data to determine the missing values based on the relationships between existing ones.
Remove redundant variables
Some variables like Name
, LastName
, Cabin
and Ticket
are character variables that are no longer necessary. Also, they cannot be included in the data when imputing missing values. The reason for this is that, as I mentioned before, they cannot be used to find relationships or patterns since they are only interpreted by R as a series of numbers. Also, converting them to a factor would make absolutely no sense, as being named John probably has no significant impact on your survival (even if a relationship would actually be found).
combined <- combined %>%
select(-Ticket, -Cabin, -LastName, -Name)
str(combined)
## 'data.frame': 1309 obs. of 13 variables:
## $ PassengerId : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Ord.factor w/ 3 levels "3"<"2"<"1": 1 3 1 3 1 1 3 1 1 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
## $ FamilySize : num 2 2 1 2 1 1 1 5 3 2 ...
## $ Title : Factor w/ 4 levels "Ladies","RegGuys",..: 2 1 1 1 2 2 2 3 1 1 ...
## $ FamilySurvival: Factor w/ 3 levels "0","0.5","1": 2 3 2 3 2 2 1 1 3 3 ...
## $ CabinClass : Factor w/ 9 levels "","A","B","C",..: 1 4 1 4 1 1 6 1 1 1 ...
Another issue is that for Embarked
and CabinClass
there are a bunch of blank cells. Again, R doesn’t understand that this is missing, and interprets it as a factor level with a weird name. But R understands NA, so I manually replaced the blanks with this before using rfImpute()
to, ehm, impute the missing data.
#Let R know that the blanks in CabinClass and Embarked are actually NAs
combined$Embarked[combined$Embarked == ""] <- NA
combined$CabinClass[combined$CabinClass == ""] <- NA
set.seed(99) #important for reproducing results later on
#Trim to remove non-numeric and non-factor variables for this case
combined <- as_tibble(rfImpute(Pclass ~., ntree = 500, iter = 5, data = combined))
## ntree OOB 1 2 3
## 500: 2.83% 1.55% 6.86% 2.17%
## ntree OOB 1 2 3
## 500: 4.20% 1.97% 12.64% 1.86%
## ntree OOB 1 2 3
## 500: 4.13% 2.54% 10.83% 1.86%
## ntree OOB 1 2 3
## 500: 4.35% 2.12% 13.00% 1.86%
## ntree OOB 1 2 3
## 500: 4.58% 2.12% 14.08% 1.86%
Bin Age
values into AgeGroup
to minimize issues with outliers**
** Strangely enough, I tried this because many notebooks and blogs said that it helped with decision tree and random forest accuracy. In my case it made everything do worse. That is, if I replaced Age
with AgeGroup
. But when keeping both variables, overall performance was improved for every single model. Honestly, I have no idea why.
summary(combined$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.17 21.00 26.00 29.33 37.00 80.00
combined <- combined %>%
mutate(AgeGroup = cut(Age, breaks = c(0, 12, 20, 30, 40, 50, 60, 80),
right = T, labels = F)) # children
combined$AgeGroup <- factor(combined$AgeGroup) #convert to factor since
table(combined$AgeGroup)
##
## 1 2 3 4 5 6 7
## 99 191 527 231 166 62 33
Remove redundant factor levels for Embarked
and CabinClas
There was still the teeny issue of Embarked
and CabinClass
having four levels since the blank spaces were counted as a unique entry. I got rid of this with the droplevels()
function in base R.
str(combined$Embarked) #Still has 4 instead of 3 levels thanks to the blanks
## Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
combined$Embarked = droplevels(combined$Embarked) #Drop blank level
str(combined$Embarked) #All good!
## Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
str(combined$CabinClass) #Still has 9 instead of 8 levels thanks to the blanks
## Factor w/ 9 levels "","A","B","C",..: 7 4 7 4 7 7 6 8 8 7 ...
combined$CabinClass = droplevels(combined$CabinClass) #Drop blank level
str(combined$CabinClass) #All good!
## Factor w/ 8 levels "A","B","C","D",..: 6 3 6 3 6 6 5 7 7 6 ...
Split combined
back into the test
and train
data, and create a validate
dataset
#Create a validation sample from train
train <- combined[1:891,]
fractionTraining <- 0.70
fractionValidation <- 0.30
sampleSizeTraining <- floor(fractionTraining * nrow(train))
sampleSizeValidation <- floor(fractionValidation * nrow(train))
set.seed(89008)
indicesTraining <- sort(sample(seq_len(nrow(train)), size = sampleSizeTraining))
indicesNotTraining <- setdiff(seq_len(nrow(train)), indicesTraining)
indicesValidation <- sort(sample(indicesNotTraining, size = sampleSizeValidation))
train <- train[indicesTraining,]
validate <- train[indicesValidation,]
#Separate test data from combined data
test <- combined[892:1309,]
#Remove PassengerId from train and validate dataset
train <- train %>%
select(-PassengerId)
validate <- validate %>%
select(-PassengerId)
#Remove the imputed Survived values from the test dataset
test <- test %>%
select(-Survived)
Method 1: Classification and Regression Tree (CART) Models
After summarizing and visualizing the data, I could already intuitively guess that certain groups of people were more likely to survive than others, i.e. women or higher-paying passengers. But what wasn’t clear was if being a woman or being rich was more likely to get you off safely (Jack would argue that both help).
A decision or classification tree model also works this way, but in more science-y terms: it uses parameters or predictor variables to predict values for a response variable. Because the response variable in this case only has two outcomes it is binary, and the method used is referred to as classification or partition tree. The other option for numeric or ratio data is a regression tree. Together, they are referred to as CART.
How classification trees work
Out of the 12 predictor variables in the train
dataset, the classification tree uses a Gini impurity measure or index and variable importance to arrange and split the nodes. The Gini measure is the probability of incorrectly classifying an observation, so the lower the better, and variable importance is a measure of how much removing a variable decreases overall model accuracy, so the higher the better.
Using these measures, the model selects a top-level variable or root node that, when dropped, reduces model accuracy the most. After that, the model re-evaluates the probability of survival under each predictor value by selecting another predictor variable beneath it. What you end up with is a “tree” with many leaves or nodes with the probability of the response variable at the bottom.
Building a tree model using train
data
The tree starts by splitting the sample according to Title
, and above it is the overall probability of survival for the 100% of sample, 0.37 or 37%. If the passenger had a title that fell under RegGuys or SmartGuys, they then had a 0.16 or 16% chance of survival. As you continue to go down the tree, you see that if they had family members who didn’t survive, chances of survival went down to 0%. I could go through each and every branch but you get the idea.
#Recode the Survived variable to make the tree plot easier to read
train_tree <- train %>%
mutate(Survived = recode(Survived, "0" = "Died", "1" = "Survived"))
# Create model
set.seed(564)
model1 <- rpart(Survived ~ .,
data = train_tree,
method = 'class')
rpart.plot(model1, extra = 106, box.palette="Grays")
Looking at variable importance
As mentioned earlier, the more important the variable, the higher up in the decision tree you’d think it would be. But the table of predictor variable importance below is only somewhat reflected in the tree plot above, since the Gini metric is the main player in deciding how the model is built. Interestingly, Sex
isn’t selected as a splitting variable despite ranking as the third most important variable. However, this could be because Title
already classifies passengers by gender to some extent, and ranks first in the list.
varimp_tree <- as.data.frame(model1$variable.importance)
varimp_tree$Variable <- rownames(varimp_tree)
row.names(varimp_tree) <- NULL
varimp_tree <- varimp_tree %>%
rename("Importance" = "model1$variable.importance") %>%
select(Variable, Importance)
varimp_tree
## Variable Importance
## 1 Title 89.192821
## 2 FamilySurvival 81.691307
## 3 Sex 77.892380
## 4 FamilySize 49.884852
## 5 Parch 31.710769
## 6 Age 29.317875
## 7 SibSp 21.489273
## 8 Fare 11.206918
## 9 CabinClass 2.490842
## 10 Embarked 2.465980
## 11 Pclass 2.163779
## 12 AgeGroup 1.902670
Determine model accuracy on unseen data - validate
: 86.77%
Time to test the model out on data it hasn’t seen before. An accuracy score of 87.04% seems really good, but it may not work as well with a different sample (i.e. test
).
predict_train <- predict(model1, validate, type = "class")
table_predictions <- table(validate$Survived, predict_train)
accuracy_Test <- sum(diag(table_predictions)) / sum(table_predictions)
print(paste('Accuracy for train', accuracy_Test))
## [1] "Accuracy for train 0.867724867724868"
Tree model accuracy on unseen data - test
: 77.99%
After submitting the output below to Kaggle, the tree model was only moderately accurate. It was far from a great score but definitely higher than my very first attempt using a KNN model that, truthfully, I barely understood (it only 65.68% accurate - could have just tossed a coin).
test$Survived <- predict(model1, test, type = "class")
model_predictions_tree <- test %>%
select(PassengerId, Survived) %>%
mutate(Survived = recode(Survived, "Died" = "0", "Survived" = "1")) #Switch back to orignal outputs for Kaggle
write.csv(model_predictions_tree, "./model1.csv", row.names = FALSE)
Tweaks and parameter tuning for classification trees
Before moving on, there are a few ways to tweak tree models that can affect how accurate it’s predictions will be. These are set using rpart.control()
function in the rpart
package and added to the control
argument in rpart()
like this:
# Defualt controls
control <- rpart.control(minsplit = 20, #default
maxdepth = 30, #default
maxcompete = 4, #default
cp = 0.01 #default
)
minsplit
- the minimum number of observations for a split to occur; the higher this is, the more constrained our model becomes as it has to consider more samples at each nodemaxdepth
- determines depth of the tree; higher values for this means more splits and more information captured from the datamaxcompete
- the number of competitor splits retained in the output, i.e. which variable came in second, third, etc.cp
- the complexity parameter; used to set the optimal size of the decision tree, and any split that does not decrease lack of fit by a factor of cp is not included (the higher this number, the more “strict” the model is when including variables)
I only went for cp
tuning because it made the most sense to me - not very scientific but I wanted to be able to understand the things I did. Using the following function, I got an overview of how the model performs with cp
different values. The results show that the lowest xerror
, a summary measure of the fit of a model, was lowest when cp = 0.01
, which is already the default.
printcp(model1)
##
## Classification tree:
## rpart(formula = Survived ~ ., data = train_tree, method = "class")
##
## Variables actually used in tree construction:
## [1] Age FamilySurvival Fare Title
##
## Root node error: 233/623 = 0.374
##
## n= 623
##
## CP nsplit rel error xerror xstd
## 1 0.437768 0 1.00000 1.00000 0.051833
## 2 0.184549 1 0.56223 0.60086 0.044713
## 3 0.017167 2 0.37768 0.40343 0.038344
## 4 0.010000 4 0.34335 0.37339 0.037131
Limitations of decision tree models
When using a classification tree model this way, the entire train
sample is looked at to find patterns. Because of this, patterns due to variability or coincidence could make misinterpreted as applicable to ALL passengers beyond this sample. In academic terms, this affects the generalizability of the model, which explains the drop in accuracy as soon as the model was applied to the test
data. In the end, the validate
data was split from the original train
data, and I am not sure how these samples were split. As with most of the things I write here, I am not 100% sure - this is just my semi-informed opinion.
Method 2: Random Forests (CART)
Random forests address the limitations of decision trees by dividing the training data into a smaller number of samples (re-sampling). Each mini sample is used to make a decision tree to predict the class of a certain variable (Survival
in this case). Out of the many trees created, the model picks out the patterns that apply to the majority of the samples by a vote (#democracy).
Train the model
Unlike rpart()
, the output of randomForest()
isn’t a mega decision tree forest. Instead, you get some numbers and values, including one called OOB or out-of-bag error. OOB is the average error (or difference between between predicted and actual Survived
outcomes) of the predictions of every tree in the random forest.
This is similar to what I calculated for the decision tree model, comparing the column of Survived
values in the train data with the predicted values from the model.
set.seed(299)
model2 <- randomForest(Survived ~ .,
data = train,
importance = TRUE, #returns output of variables ranked by importance
mtry = 3, #default is the the square root of the number of predictor variables, rounded down
ntree = 2000) #number of trees
model2
##
## Call:
## randomForest(formula = Survived ~ ., data = train, importance = TRUE, mtry = 3, ntree = 2000)
## Type of random forest: classification
## Number of trees: 2000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 14.13%
## Confusion matrix:
## 0 1 class.error
## 0 357 33 0.08461538
## 1 55 178 0.23605150
Note: When taking a closer look at the OOB error, it looks like the model is much better at predicting death than survival. Not sure why, though. Might be the same reasons why it’s easier to foresee a shitty day than a good one - but that might just the Nordic winter blues getting to me.
oob.error.data <- data.frame(
Trees = rep(1:nrow(model2$err.rate), times = 3),
Type = rep(c("OOB", "0", "1"), each = nrow(model2$err.rate)),
Error = c(model2$err.rate[, "OOB"],
model2$err.rate[, "0"],
model2$err.rate[, "1"])) %>%
mutate(Type = recode(Type, "0" = "Died", "1" = "Survived"))
ggplot(data = oob.error.data, aes(x = Trees, y = Error)) +
geom_line(aes(color = Type)) +
ggtitle("Random forest using default parameters") +
scale_color_manual(values=c("#5c0099", "#fdc500", "#DDDDDD")) +
theme_minimal()
Looking at variable importance by random forest tree “votes”
Right off the bat, it’s clear that instead of Title
, the trees in the random forests voted FamilySurvival
as the root node. This means that if I tried a few hundred (or in this case 2,000) rounds of decision trees I would probably get more with FamilySurvival
at the top than any other variable. Also,SibSp
and Parch
went from the middle to the bottom of the list. Whether or not this makes the model more accurate when applied to unseen data was still a big question mark.
varImpPlot(model2)
Random forest model accuracy on unseen data - validate
: 95.23%
predict_train <- predict(model2, validate)
table_predictions <- table(validate$Survived, predict_train)
accuracy_Test <- sum(diag(table_predictions)) / sum(table_predictions)
print(paste('Accuracy for validate', accuracy_Test))
## [1] "Accuracy for validate 0.952380952380952"
Random forest model accuracy on unseen data - test
: 76.55%
Compared to the first decision tree model, this more complex model was actually -2.05% accurate. Yikes, not what I had hoped for, especially with the performance during validation.
test$Survived <- predict(model2, test)
model_predictions_rf <- test %>%
select(PassengerId, Survived)
write.csv(model_predictions_rf, "./model2.csv", row.names = FALSE)
In attempt to Google why, I came across the following in this StackExchange thread:
“A tree within the random forest has the same bias as a random forest, where the single tree is restricted by bootstrap and no# of regressors randomly selected at each split (m). A fully grown, unpruned tree outside the random forest on the other hand (not bootstrapped and restricted by m) has lower bias. Hence random forests / bagging improve through variance reduction only, not bias reduction.”
Basically, by re-sampling we introduce more bias since we have less data used in each tree in the forest, which can counteract the benefits of weeding out spurious relationships through re-sampling. So, what if I had fewer trees (a.k.a. ntree
)?
Smaller random forest model accuracy on unseen data - test
: 76.94%
Ok, so barely a difference. Time to move on.
set.seed(2991)
model2 <- randomForest(Survived ~ .,
data = train,
importance = TRUE, #returns output of variables ranked by importance
mtry = 3, #default is the sqrt(#predictorvars) rounded down
ntree = 500) #number of trees
test$Survived <- predict(model2, test)
model_predictions_rf <- test %>%
select(PassengerId, Survived)
write.csv(model_predictions_rf, "./model2.csv", row.names = FALSE)
Method 3: Conditional Inference Tree (CIT) Models
A different but not necessarily “better” method for decision trees is a conditional inference tree or CIT. The main if not only difference between CIT and CART models are that CIT uses statistical tests with p values (something that I found a lot easier to understand) instead of a Gini measure to split nodes. Naturally, my first thought after reading that was: wtf does this mean?
As I understand it, rather than selecting splits according to how much they impact the overall fit, CIT models measure whether there is a difference in the response variable between two or more groups (determined by predictor variables). The relationship with the lowest p-value, which is the probability of seeing this relationship if it doesn’t exist, is selected.
In all honestly, I am more of a “try it first, understand later” person, so I created the following model before attempting to understand what happens behind the scenes. The result is a hideous and confusing decision tree. A great resource and critique of this method is this article, which was also comforting as the authors did note how confusing the documentation for cforest()
was. And I can confirm that I was, at many points, beyond confused as to why this was touted as so much better by the library authors.
set.seed(444)
model3 <- ctree(Survived ~ .,
data = train_tree,
controls = ctree_control(mtry = 0)) #Default for this function, I'll keep it
plot(model3, type = "simple",
inner_panel = node_inner(model3,
abbreviate = FALSE,
pval = FALSE,
id = FALSE),
terminal_panel = node_terminal(model3,
abbreviate = TRUE,
digits = 1,
id = FALSE))
Tree model (CIT) accuracy on unseen data - validate
: 88.88%
predict_train <- predict(model3, validate)
table_predictions <- table(validate$Survived, predict_train)
accuracy_Test <- sum(diag(table_predictions)) / sum(table_predictions)
print(paste('Accuracy for train', accuracy_Test))
## [1] "Accuracy for train 0.888888888888889"
Tree model (CIT) accuracy on unseen data - test
: 77.03%
I am not too sure why this did worse than the first decision tree with rpart()
. I do know that p-values are biased by sample size, and the larger the sample the more likely it is that you will find a statistically significant result. Considering this, I went ahead with a conditional inference random forest model, which splits the sample into smaller groups. I also did this because that’s what I did for my CART random forest and it improved accuracy once before, so why not try it again.
test$Survived <- predict(model3, newdata = test)
model_predictions_rfinf <- test %>%
select(PassengerId, Survived) %>%
mutate(Survived = recode(Survived, "Died" = "0", "Survived" = "1")) #Switch back to orignal outputs for Kaggle
write.csv(model_predictions_rfinf, "./model3.csv", row.names = FALSE)
Method 4: Random Forests (CIT)
This was ultimately the best method, but not at first. Without too much foreshadowing, here’s what I did and what the results were.
Train the model
set.seed(2222)
model4 <- cforest(Survived ~ .,
data = train,
controls = cforest_unbiased(ntree = 2000,
mtry = 5)) #the default for this model
model4
##
## Random Forest using Conditional Inference Trees
##
## Number of trees: 2000
##
## Response: Survived
## Inputs: Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, FamilySize, Title, FamilySurvival, CabinClass, AgeGroup
## Number of observations: 623
I used the train()
function from caret
to get the accuracy score for this model, since output from cforest()
is nothing beyond the argument values and sample size.
accModel4 <- train(Survived ~ .,
data = train,
method = "cforest",
tuneGrid = data.frame(.mtry = 5),
trControl = trainControl(method = "oob"))
accModel4$results$Accuracy
## [1] 0.8539326
Random forest (CIT) model accuracy on unseen data - validate
: 89.42%
predict_train <- predict(model4, newdata = validate, OOB = TRUE, type = "response")
table_predictions <- table(validate$Survived, predict_train)
accuracy_Test <- sum(diag(table_predictions)) / sum(table_predictions)
print(paste('Accuracy for validate', accuracy_Test))
## [1] "Accuracy for validate 0.894179894179894"
Random forest (CIT) model accuracy on unseen data - test
: 76.55%
So, not great again. So far the simplest model still performed the best - the first classification tree. So I went ahead and tried some parameter tuning.
test$Survived <- predict(model4, newdata = test, OOB = TRUE, type = "response")
model_predictions_rfinf <- test %>%
select(PassengerId, Survived)
write.csv(model_predictions_rfinf, "./model4.csv", row.names = FALSE)
Tweaks and parameter tuning for CIT random forests
The following script enters in a range of values for mtry
and prints the value with the maximum accuracy score for the train data. In this case we have 12 predictor variables, and when 12 is used then no random sampling occurs. When any other number occurs, that is the number of random variables sampled at each node.
#Create empty vector
accuracy_mtry <- vector(length = 12)
#For loop for different mtry values, store results in empty vector
for (i in 1:12) {
set.seed(666)
temp.model <- cforest(Survived ~ .,
data = train,
controls = cforest_unbiased(ntree = 2000,
mtry = i))
predict.temp <- predict(temp.model,
newdata = validate,
OOB = TRUE, #IMPORTANT! Need to use validate dataset
type = "response")
predict.table <- table(validate$Survived, predict.temp)
accuracy_mtry[i] <- sum(diag(predict.table)) / sum(predict.table)
}
#Create data frame using vector data
rfinf_mtry <- data.frame(matrix(ncol = 2, nrow = 12))
x <- c("Accuracy", "mtry")
colnames(rfinf_mtry) <- x
rfinf_mtry$Accuracy <- accuracy_mtry
rfinf_mtry$mtry <- 1:12
#Find optimal value for mtry
paste("Optimal value for mtry =", which.max(rfinf_mtry$Accuracy))
## [1] "Optimal value for mtry = 3"
Validate one more time - 89.42%
model4 <- cforest(Survived ~ .,
data = train,
controls = cforest_unbiased(ntree = 2000,
mtry = 2)) #This is what changed
predict_train <- predict(model4, newdata = validate, OOB = TRUE, type = "response")
table_predictions <- table(validate$Survived, predict_train)
accuracy_Test <- sum(diag(table_predictions)) / sum(table_predictions)
print(paste('Accuracy for train', accuracy_Test))
## [1] "Accuracy for train 0.888888888888889"
Final Results: 80.38% - A Top 4% Score 🎉
The last thing I expected when starting this challenge was to score well. But considering how many variations of trees and models I tried, it was inevitable. It is very likely that this score is a result of overfitting the test data, since I continuously used my Kaggle score to make iterations. If I wouldn’t have had that information available, I would have never been able to score so high. And I doubt I would again if I would apply my model to a different test sample.
Mistakes, learning outcomes, and future challenges
- Cheating by spamming Kaggle with submissions: I think I’ve made this point clear. In future, I will limit my submissions to a certain amount for each model.
- Not using
set.seed()
for sampling: One of the most annoying mistakes I made was forgetting to set the seed for the random number generator that samples and models the data. This made many of my results inconsistent and non-reproducible, and why some of the outputs here do not match the code. - Not using Github to track file changes and versions: I ended up wasting a lot of time trying to remember what I did, and losing information when R crashed after I accidentally (on several occasions) added an extra digit to
ntree
and sending my laptop into a frenzy. - Diving in without really understanding the data: Many early versions didn’t reclassify variables as factors, remove spaces, or deal with missing values appropriately (or at all). The majority of the work outlined here was understanding the variables I was looking at, and making sure that they were correctly understood and interpreted by the models.
- Not creating a data set for validation: Another side-effect of diving in too quickly and not thinking through my method. Initially, I was checking model performance on the same data used to build it, which resulted in most parameter tuning reducing performance. But now I know better.
- Feeling discouraged and demotivated when things didn’t work: Ah, toxic perfectionism and millennial impatience - expecting to understand something inherently complex after watching a couple of YouTube videos. If anything at all, I learned just how complex ML is, and how normal it is to be confused about it as someone with absolutely no background or experience in messing around with it. I can’t even recall how many threads, papers, videos, and notebooks I studied to figure things out. And even now, I realize that I barely scratched the surface.
- What’s next: As cool as this was, I’d like to try working with text data since much of my work is related to words rather than numbers. The next beginner Kaggle challenge I picked out involves using natural language processing (NLP) models to predict whether Tweets are about real disasters or not.