TLDR; I wanted to apply my beginner R skills to a complex problem, and chose machine learning because why not. I managed to do well, but my lack of knowledge blinded me to the fact that I was kinda cheating, as I relied heavily on trial and error to rise the ranks. Despite this, I learned a lot about classification models and random forests, wrote more complex R code that the beginner stuff I had done so far, and enjoyed messing around and reflecting on the experience. All in all, it’s still a win for me.
I think everyone can agree that 2020 was full of surprises, most of them unpleasant. But among one of the better ones was being introduced to R as a part of my psychology masters. Because psychology is faced with a serious replication crisis, where researchers are only able to reproduce 36% of studies with similar results, providing open-source code in journal articles is becoming the new standard.
Rather than gate-keeping data and using archaic tools like SPSS, the Wordpress of data analysis tools, many psychology researchers now publish their analysis code and raw data with their articles. By increasing transparency, researchers hope to improve the validity of a field that is often plagued with pseudoscience-y rhetoric and misleading headlines. I mean, articles that lead with “scientists say smart people eat more brie” based on a paper with 3-person sample will probably never disappear, but this is hopefully a step in the right direction.
Fast forward half a year later, I got over the initial excitement of learning something new, endured a lot frustration and feelings of defeat triggered by incomprehensible errors, and eventually became somewhat comfortable with R. I even started to like it. There’s something I always find so satisfying about building things from scratch, whether it’s an article or function that loops through columns without needing to use an interface. But at some point I got bored of running t-tests and power analyses on clean data prepped by my lecturers, so I turned to Google (mostly Reddit) to answer the question: what else can I do with this? And the answer was Kaggle.
Kaggle and the “Titanic - Machine Learning from Disaster” competition
Kaggle is a community that hosts data science and machine learning competitions. Before starting, I fell into the majority group of people who think they understand what an algorithm is, but wouldn’t be able to clearly explain what it is (which is the metric I use to gauge comprehension). Very often, people use the word magic to describe copywriting, a word that is often associated with data science and machine learning. So naturally, I’ve always been convinced that there is absolutely nothing magical about it. And for the first time ever, I felt equipped with at least some of the basic skills I needed to venture into this mystical realm of predictions.
The very first competition recommended to Kaggle beginners involves predicting whether or not people on the Titanic survived the disaster. What you get is a file with info like gender, age, how much they paid for their tickets, and bunch of other information you need to make “magic” with. I’ll explain to the best of my abilities how I did that, and how I managed to get to the top 4% of the competition leaderboard. But first, I need to explain why my rank means absolutely nothing, and how I (unknowingly) cheated and guessed my way through this process.
Why my high leaderboard rank is incredibly misleading
Training a model is all about finding relationships in data, storing them (as a model), and applying them to new data to answer a question. How well that question is answered is determined in testing. Say you’re given a multiple choice quiz in school to test your knowledge and understanding. If you can make unlimited attempts at submitting and reiterating your answers based on the results, you will eventually guess what most of the right answers are - even if you know absolutely nothing.
It was only after I finished the competition that I realized that this was basically what I did. I used many submissions to test my models, and made many random and uninformed guesses and only kept the parameter tweaks and methods that boosted my score. There is absolutely nothing scientific about this. That said, I still learned a lot and expanded my understanding on what R can do. But in terms of applying even very simple ML to a real-life problem, I would still be very lost.
![](https://cdn.prod.website-files.com/677075b735bf051110ca893f/6772d3d464ad1e5feb508554_6772d3d1e14c20d7a1c798e4_Line%2520break.png)
Edit: December 30th, 2024
Since I published this article, I migrated and rebuilt this website on Webflow. Previously, I was using an absurdly complex software package called blogdown that ran on R. Now that my obsession with learning this particular data science language has passed — I blame COVID — I made the sacrifice of no longer being able to display the graphs and scripts that were originally included in this article.
That said, if you would still like to embark on the full 39-minute read, you can view the original article here. It might not be as pretty, but hopefully it's still enjoyable.