World Series: Prediction Model

Last year, I made an attempt to predict the winner of the 2018 World Series (WS18). That model was built on only a few features, focusing only on statistically significant data that discerns the World Series winners from all other teams. These features included: ERA, Runs, and WP (win percentage). However, that model was a touch simplistic and not terribly accurate. The goal this year is to build a robust model without narrowing the dataset.

Controversy.

Before diving into the data, however, there is a bit of controversy that must be addressed. This concerns the definition of the "modern era" of baseball. There are two years that could be posited as the start of the "modern era": 1905 and 1969.

1905.

While the available data reaches back to 1876, it wasn't until 1905 that annual World Series were held. In 1905, there were significant rule changes. These included foul balls counted as strikes, RBIs and ERAs were tracked, the mound was moved further back (allowing for breaking balls), and spitballs were made illegal.

1969.

The reason for the 1969 start date is due to significant rule changes that are more similar to current baseball rules. For example, 1969 marked the first year of division play and the expanded postseason, the strike zone shrinking, and the lowering of the pitcher's mound by five inches. All of these changes result in very different metrics posted by each team.

Data.
Two datasets will be obtained. The first will include regular season data from 1905-2019; the second will include data from 1969-2019. All statistics from hitting, fielding, and pitching will be included in the dataset, with the exception of calculated features.

Models.
Each dataset will be put through different readily available models. These models will be fine-tuned. The generated models will be validated against data from 2016-2018 to ensure that the predictions are accurate.

Who is the next World Series winner?

In the earlier rudimentary model, I used data from Kaggle. However, that dataset is outdated. In an effort to get the most recent data, I decided to scrape MLB and ESPN. All of the data were for teams and not individual contributors.

There were four scrapes for the following data: hitting data, pitching data, fielding data, and World Series winners. Each dataset was saved as a csv.
The World Series winners were obtained from ESPN.

Hitting, pitching, and fielding data were obtained from MLB. This involved a scrape using BeautifulSoup. The data were indicated by tr tags.

Data before 1905 was removed from all datasets. Additionally, data from 1994 was removed. Recall that in 1994, there was a strike that resulted in an appended regular season and no World Series.

Merge the hitting, pitching, and fielding data on team and year.

Add a winner column. 1's indicate a winner for World Series; 0's indicate not a winner for World Series.

Create two datasets that contains team statistics for 1905-2019, and 1969-2019.
Remove columns that have too many missing values. With these data, the missing data were scraped as "-".

The notebooks for scraping and cleaning data can be found in my GitHub.

Each dataset has it's own set of problems. I generated some visualizations to briefly determine the differences between the datasets.

Of the 1905-2019 dataset, 4.59% of the population are winners. With the 1969-2019 dataset, 3.54% of the population are winners. Clearly, the data are not balanced.

With the number of features, it was important to see the differences between the winners and the not-winners for the two datasets. There are more positively correlated statistics in the 1969-2019 dataset compared to the 1905-2019 dataset. However, one of the highest correlation is CG (Complete Game). This statistic is important as it tells the health of the pitcher--the higher the CG number, the healther the pitchers on the team. Another high correlation value is SHO (Shut Outs). The higher the number, the tougher it is to score runs against that team's pitcher. In other words, the pitchers throw "dirty" stuff.

The notebook for data visualization can be found in my GitHub.

A number of different models were tested against the 1905-2019 dataset and the 1969-2019 dataset. As I tested these models, I ran into a problem of unbalanced data. There were far more not-winners of the World Series compared to actual winners of the World Series. Therefore, the models would be overfitted for not-winners and produce complete junk for predictions of the validation set.

Upsample the dataset, scale the features, and split into train and test sets. With upsampling, there are a number ratios of not-winners to winners that were tried (ie: 1:1, 1:0.5, 1:0.25).

Train the model and evaluate F1-scores.
I tried a number of different models. Oftentimes, they were overfitted and produced pure junk. However, there was one model that worked fairly well. This was a grid-search with SVC.

Validate the model with data from 2016, 2017, and 2018.
With the SVC-grid model, all three upsampling conditions yielded appropriate results for 2018 and 2017. I am disregarding 2016 simply because the winner of 2016 was the Cubs. (And it's the Cubs.) It was an improbable win based on the heart of the team (something that cannot be measured by a statistic!) I will give credit to models that give the Indians for 2016 as they made it to the World Series in 2016.

The upsampling ratios of not-winners to winners worked for the first two conditions (1:1, 1:0.5). Below is the 1:0.5 model.

The models were saved as a pickle.

Predict with partial regular season data from 2019. Interestingly, all three models predicted the Red Sox to win it all. Second place was less certain as all three models gave a different result.

The notebooks for upscale/scale/split data, model building/fine-tuning, and validation can be found in my GitHub.

The 1969 data works better than the 1905 data. I would therefore posit that the start of the "modern era" of baseball starts in 1969.

The data needed to be upsampled due to the unbalanced nature of the winners sample to the not-winners sample.

The SVC-grid model worked well. There were other models that gave complete rubbish due to overfitting.

Using most of the statistics, as opposed to just three major features, can produce a solid model.

The model does not consider "heart", "will to win", or "chemistry" when determining the winner.

The model does not account for freak accidents sustained during the post season.

Predicting the 2019 World Series Winner

Revisiting an old problem.

Controversy.

The Big Picture.

Data.

Data Analysis.

Models.

Lessons Learned.

Next Steps.

TAKE HOME MESSAGE.

References: