Baseball has always been unpredictable. This is especially true When trying to predict the next winner of the World Series. Some have been reliant on statistics, while others simply rely on their gut instinct. In this study, I will use a machine learning model to determine the World Series winner for 2018.
Thankfully, baseball is a sport largely driven by statistics. The challenge, however, lies in narrowing the dataset, focusing only on statistically significant data that discerns the World Series winners from all other teams. To narrow the dataset, there are two major requirements to consider. The first consideration is to determine the relevant years; the second consideration is to determine the metrics that should be studied.
There is a bit of controversy in determining the "modern" era of baseball. Some references argue that 1905 is appropriate, considering that 1905 and onwards, consistent World Series were played on an annual basis. Additionally, it was at this time that foul balls were counted as strikes, RBIs and ERAs were tracked, the mound was moved further back (allowing for breaking balls), and spitballs were made illegal. However, other references argue that 1969 is the true start of "modern" baseball. The reason for the 1969 start date is due to significant rule changes that are more similar to current baseball rules. For example, 1969 marked the first year of division play and the expanded postseason, the strike zone shrank, and the pitcher's mound was lowered by five inches. All of these changes result in very different metrics posted by each team.
When considering which metrics should be studied, there were three that were truly important: ERA, Runs, and WP (win percentage). ERA determines the quality of the starting pitchers. Total runs during the regular season indicate the quality of bats in the dugout. Finally, percent wins is a basic statistic that can be used to compare teams. Data for the type of batting outcome (double, triple, home run) will be ignored as those statistics do not directly correlate to runs scored or to winning games. Data for defense will also be ignored as the assumption for each team is that differences in the quality of play in the defensive positions are negligable.
These two graphs depict the percentage of games each team won during the regular season from 1905-2015 or from 1969-2015. In both graphs, World Series winners are in blue, and non-winners are in red. Both populations are large and have a somewhat normal distribution. Means of each sample group (winners and non-winners) were calculated and compared. A Z-Score was calculated, resulting in the determination of P-Values and confidence intervals.
The test hypothesis is that World Series winners tend to have a higher percentage of wins than non-World Series teams. This is a single-tail test. The calculated Z-Score is -22.78. The corresponding P-Value is less than 0.00001. This means that I choose a confidence level of 99.99% (alpha = 0.0001) that World Series teams have a higher win percentage than all other teams. Since the P-Value is less than alpha, the test hypothesis is accepted (reject null hypothesis). There is a statistically significant difference between the World Series winners and non-World Series teams.
Again, the test hypothesis is that World Series winners tend to have a higher percentage of wins than non-World Series teams. This is a single-tail test. The calculated Z-Score is -11.93. The corresponding P-Value is less than 0.00001. This means that I choose a confidence level of 99.99% (alpha = 0.0001) that World Series teams have a higher win percentage than all other teams. Since the P-Value is less than alpha, the test hypothesis is accepted (reject null hypothesis). There is a statistically significant difference between the World Series winners and non-World Series teams.
There is a statistically significant difference between World Series winners and non-winners for win percentage. However, there does not appear to be much a difference betweeen the 1905-2015 and 1969-2015 populations.
These two graphs depict the total runs of each team during the regular season from 1905-2015 or from 1969-2015. In both graphs, World Series winners are in blue, and non-winners are in red. Both populations are large and have a somewhat normal distribution. Means of each sample group (winners and non-winners) were calculated and compared. A Z-Score was calculated, resulting in the determination of P-Values and confidence intervals.
The test hypothesis is that World Series winners tend to have a higher number of total runs than non-World Series teams. This is a single-tail test. The calculated Z-Score is -6.66. The corresponding P-Value is less than 0.00001. This means that I choose a confidence level of 99.99% (alpha = 0.0001) that World Series teams have a higher number of total runs than all other teams. Since the P-Value is less than alpha, the test hypothesis is accepted (reject null hypothesis). There is a statistically significant difference between the World Series winners and non-World Series teams.
Again, the test hypothesis is that World Series winners tend to have a high number of total runs than non-World Series teams. This is a single-tail test. The calculated Z-Score is -3.94. The corresponding P-Value is less than 0.0001. This means that I choose a confidence level of 99.9% (alpha = 0.001) that World Series teams have a higher number of runs than all other teams. Since the P-Value is less than alpha, the test hypothesis is accepted (reject null hypothesis). There is a statistically significant difference between the World Series winners and non-World Series teams.
There is a statistically significant difference between World Series winners and non-winners for total runs. However, there does not appear to be much a difference betweeen the 1905-2015 and 1969-2015 populations.
These two graphs depict the ERA of each team during the regular season from 1905-2015 or from 1969-2015. In both graphs, World Series winners are in blue, and non-winners are in red. Both populations are large and have a somewhat normal distribution. Means of each sample group (winners and non-winners) were calculated and compared. A Z-Score was calculated, resulting in the determination of P-Values and confidence intervals.
The test hypothesis is that World Series winners tend to have a lower ERA than non-World Series teams. This is a single-tail test. The calculated Z-Score is -9.32. The corresponding P-Value is less than 0.00001. This means that I choose a confidence level of 99.99% (alpha = 0.0001) that World Series teams have a ERA lower than all other teams. Since the P-Value is less than alpha, the test hypothesis is accepted (reject null hypothesis). There is a statistically significant difference between the World Series winners and non-World Series teams.
Again, the test hypothesis is that World Series winners tend to have a lower ERA than non-World Series teams. This is a single-tail test. The calculated Z-Score is -5.88. The corresponding P-Value is less than 0.00001. This means that I choose a confidence level of 99.99% (alpha = 0.0001) that World Series teams have a ERA lower than all other teams. Since the P-Value is less than alpha, the test hypothesis is accepted (reject null hypothesis). There is a statistically significant difference between the World Series winners and non-World Series teams.
There is a statistically significant difference between World Series winners and non-winners for ERAs. However, there does not appear to be much a difference betweeen the 1905-2015 and 1969-2015 populations.
So far, the data indicate that win percentage, total runs, and ERA are significantly different when considering World Series winners and non-winners. Additionally, the 1905 start point is not that much different than the 1969 start point when assessing P-values. Therefore in building the ML model, win percentage, total runs, and ERA from 1905-2015 will be considered.
There were two models tested. Initially, with Scikit-learn, the Random Forest model was used to evaluate the data. The model yielded a 32.7% success rate, a 67.3% failure rate, and a 4.14% false-positive rate. Due to the low success rate seen with the Random Forest model, a second model was utilized. Using Logistic Regression, the following was found: 82.7% success rate, 17.3% failure rate, and 19% false positive rate. The Logistic Regression model, while having a good success rate, had a very high false positive rate. Given the two models, the Logistic Regression model appears to yield better results.
The model was trained on data from 1905-2015, and the model was tested on data from 2016 and 2017. Data from 2016 yielded a successful test run where the prediction matched the actual outcome. The model predicted the Cubs to win, and the actual outcome was that the Cubs won the World Series. Interestingly, the total runs made by the Cubs during the regular season was the highest compared to other teams.
Data from 2017 did not yeid a successful run. The model predicted the Indians winning, while the actual outcome was the Astros winning the World Series. As with the 2016 season, the total runs made by the Astros was the highest compared to other teams. It may be the case that simply assessing the total runs in the regular season may be the metric to evaluate to determine the World Series winner.
Given the current standings in the 2018 regular season, the model predicts the Astros to win the World Series. This agrees with a prediction found on the MLB web site. However, following the past trend of teams with the highest total runs winning the World Series, the Boston Red Sox might be poised to win it all.