Last year, I made an attempt to predict the winner of the 2018 World Series (WS18). That model was built on only a few features, focusing only on statistically significant data that discerns the World Series winners from all other teams. These features included: ERA, Runs, and WP (win percentage). However, that model was a touch simplistic and not terribly accurate. The goal this year is to build a robust model without narrowing the dataset.
Before diving into the data, however, there is a bit of controversy that must be addressed. This concerns the definition of the "modern era" of baseball. There are two years that could be posited as the start of the "modern era": 1905 and 1969.
1905.
While the available data reaches back to 1876, it wasn't until 1905 that annual World Series were held. In 1905, there were significant rule changes. These included foul balls counted as strikes, RBIs and ERAs were tracked, the mound was moved further back (allowing for breaking balls), and spitballs were made illegal.1969.
The reason for the 1969 start date is due to significant rule changes that are more similar to current baseball rules. For example, 1969 marked the first year of division play and the expanded postseason, the strike zone shrinking, and the lowering of the pitcher's mound by five inches. All of these changes result in very different metrics posted by each team.
Data.
Two datasets will be obtained. The first will include regular season data from 1905-2019; the second will include data from 1969-2019. All statistics from hitting, fielding, and pitching will be included in the dataset, with the exception of calculated features.
Models.
Each dataset will be put through different readily available models. These models will be fine-tuned. The generated models will be validated against data from 2016-2018 to ensure that the predictions are accurate.
In the earlier rudimentary model, I used data from Kaggle. However, that dataset is outdated. In an effort to get the most recent data, I decided to scrape MLB and ESPN. All of the data were for teams and not individual contributors.
The notebooks for scraping and cleaning data can be found in my GitHub.
Each dataset has it's own set of problems. I generated some visualizations to briefly determine the differences between the datasets.
Of the 1905-2019 dataset, 4.59% of the population are winners. With the 1969-2019 dataset, 3.54% of the population are winners. Clearly, the data are not balanced.
With the number of features, it was important to see the differences between the winners and the not-winners for the two datasets. There are more positively correlated statistics in the 1969-2019 dataset compared to the 1905-2019 dataset. However, one of the highest correlation is CG (Complete Game). This statistic is important as it tells the health of the pitcher--the higher the CG number, the healther the pitchers on the team. Another high correlation value is SHO (Shut Outs). The higher the number, the tougher it is to score runs against that team's pitcher. In other words, the pitchers throw "dirty" stuff.
The notebook for data visualization can be found in my GitHub.
A number of different models were tested against the 1905-2019 dataset and the 1969-2019 dataset. As I tested these models, I ran into a problem of unbalanced data. There were far more not-winners of the World Series compared to actual winners of the World Series. Therefore, the models would be overfitted for not-winners and produce complete junk for predictions of the validation set.
The notebooks for upscale/scale/split data, model building/fine-tuning, and validation can be found in my GitHub.
Given the current standings in the 2019 regular season, the model predicts the Red Sox to win the World Series.