Background
As a refresher, League of Legends is an online multiplayer game where teams try to destroy each others' base. The aim of this project is to predict the eventual winner of a game at early timepoints, using game conditions; for example, I would like to be able to say the Blue Team with a 10,000 gold lead has an X% chance of winning, and determine X by machine learning. I gathered data from Riot Games's API, and created pandas dataframes containing information from 30,000 games at 5 minute intervals. For examples, you can see previous posts. In this post, as mentioned above, I will explore how I optimized my random forest model.Forest size
The basic task my random forest is trying to do is predict the winner of a game of League of Legends, a classification task. Random forests work by generating a large number of decision trees, and averaging the results across trees. Thus one of the most obvious random forest parameters to optimize is the number of trees in the forest. To look at this, I ran the prediction algorithm many times, with different number of trees for each run.
Node size
The software package I am using for building random forests, scikit-learn, by default creates decision trees where each leaf is pure (that is, each group contains only wins or losses, and no mixture of wins and losses). This can lead to overfitting of individual trees, and by extension the random forest. However, you can set a parameter in the random forest to increase the minimum leaf size, or the minimum size of nodes for splitting. To see how this influenced prediction accuracy, I ran the prediction algorithm over a range of minimum node sizes.
The model accuracy goes up as the minimum sample size increases, plateauing around a 200 sample minimum. The model then maintains its accuracy for larger minimums, before decreasing once the minimum size is 5,000, or approximately 15% of the data. Once the minimum is that large, the trees are restricted in their depth, and cannot make enough splits to separate the data.
Rather than making graphs like this for every parameter, I ran a grid search over tree depth, leaf size, and node size. The best parameters for my dataset with 30,000 games, and a dozen features was a maximum depth of 10 levels, minimum leaf size of 10, and minimum split size of 1000. Optimizing these parameters yielded a 2-5% increase in accuracy over the default parameters (over a range of timepoints).
Rather than making graphs like this for every parameter, I ran a grid search over tree depth, leaf size, and node size. The best parameters for my dataset with 30,000 games, and a dozen features was a maximum depth of 10 levels, minimum leaf size of 10, and minimum split size of 1000. Optimizing these parameters yielded a 2-5% increase in accuracy over the default parameters (over a range of timepoints).
Dimensionality reduction
There is a lot of collinearity in my dataset, which can be a problem for certain machine learning algorithms; theoretically, random forests are resilient to collinearity. To see this for myself, I performed dimensionality reduction, specifically PCA, on my features, and then ran the random forest algorithm again. Here I am presenting the data slightly differently, in terms of prediction accuracy at specific times within the game.
The PCA model is actually a little worse! In the past I have used PCA on data before performing regression, and which improved regression performance by 5%. It's good to know that random forests work as advertised, and don't require dimentionality reduction beforehand.
Naïve Bayes
Part of the reason I did this project was to gain experience with random forests, and compare their performance with other algorithms. Naïve Bayes is a decent, fast default algorithm which is often used for benchmarking. I hadn't used it previously, as there is a little bit of manipulation necessary to combine categorical and continuous features in Naïve Bayes (like 5 lines of code!). As a final experiment, I wanted to see how Naïve Bayes compared to Random Forests.
Naïve Bayes performs almost as well! And runs a whole lot faster! What is probably happening here is that the dataset is just not deep enough for the strength of random forests to shine.