By Ron Yurko (@flyingSerb21)
In a recent blog post by the renowned Tom Tango, he points out a moment in which Brian Kenny (the sabermetrics frontman for MLB Network) is wrong about his reasoning for the best pitchers in baseball. Kenny only used stats such as ERA and FIP, and while FIP is intended to be independent of the defense, both of these metrics are based on the outcomes of at-bats. As Tango states, “The outcome numbers are just observations, filled with random variation. We really only care about the pitcher’s actual talent.” This led to me to consider only looking at measurements about a pitcher’s control, inducing a batter to swing, as well as batted ball information. Thanks to PITCHf/x we have access to much more information on pitchers and can get a really good picture of their true ability. Using PITCHf/x data, Bill Petti and Jeff Zimmerman developed several amazing metrics on a pitcher’s placement of his pitches such as the percentage of pitches they throw on the edge of the strike zone. Courtesy of Fangraphs and Petti’s great Shiny app, I chose 25 metrics that are only descriptive of a pitcher’s pitches and resulting batted balls with no information regarding the actual outcomes such as strikeouts, walks, hits, outs, or home runs. These are listed in Table 1 with their descriptions:
|Line Drive % (LD%)||Line Drives / Balls in Play|
|Ground Ball % (GB%)||Ground Balls / Balls in Play|
|Fly Ball % (FB%)||Fly Balls / Balls in Play|
|O-Swing %||Swings at pitches outside the zone / pitches outside the zone|
|Z-Swing %||Swings at pitches inside the zone / pitches inside the zone|
|Swing %||Swings / Pitches|
|O-Contact %||Number of pitches on which contact was made on pitches outside the zone / Swings on pitches outside the zone|
|Z-Contact %||Number of pitches on which contact was made on pitches inside the zone / Swings on pitches inside the zone|
|Contact %||Number of pitches on which contact was made / Swings|
|Zone %||Pitches in the strike zone / Total pitches|
|F-Strike %||First pitch strikes / Plate appearances|
|SwStr %||Swings and misses / Total pitches|
|Pull %||Pulled balls / Total batted balls|
|Center %||Center balls / Total batted balls|
|Oppo %||Oppo balls / Total batted balls|
|Soft %||Soft hit balls / Total batted balls|
|Med %||Medium hit balls / Total batted balls|
|Hard %||Hard hit balls / Total batted balls|
|Horizontal Edge %||See article by Bill Petti|
|Top Edge %||(see above)|
|Bottom Edge %||(see above)|
|Total Edge %||(see above)|
|Heart %||(see above)|
|Out of Zone %||(see above)|
|Edge to Heart Ratio||(see above)|
Using these metrics, I want to see if there is some natural separation between various types of pitchers and their performances. However since there are 25 dimensions (which is more confusing than Interstellar…), I need to apply a dimension reduction technique in order to easily view the pitchers. Since it is the most widely used linear dimension reduction technique, and almost as old as statistics itself, I will use Principal Component Analysis (PCA).
BEWARE MATH DETAILS FOLLOW, IF YOU DON’T CARE SKIP TO NEXT PARAGRAPH:
(But you should read this, don’t be ignorant…)
The essential goal of a linear dimension reduction technique is to find straight lines in the feature space of the data set exhibiting the largest variance. Usually when people hear the term variance they immediately think of uncertainty and don’t associate it with providing information. But consider the following, if I’m looking at a 100 pitchers and all of them have the same ERA what does that tell me about the pitchers? There are no differences, the variance is 0. But if the ERAs of the 100 different pitcher vary greatly, now we have something suggesting differences between the pitchers. In other words, with dimension reduction techniques we are attempting to find dimensions with the most interesting or noticeable trends, providing us with useful information using fewer variables. PCA uses orthonormal transformations in order to project the observed data set of potentially related variables into a lower-dimensional space of uncorrelated variables. This is the beauty of PCA, while many of the above 25 variables are obviously related, the lower dimensional space consists of independent projections. To do this, we find the projections that maximize the variance. The first principal component is the direction in which the projections possess the largest variance. The second principal component is the direction which maximizes variance among all directions orthogonal to the first, i.e. so it’s independent of the first, providing additional information. This process continues finding the directions up to the number of original dimensions, which in this case is 25. What I’ll be able to look at, is the principal component directions showing how the 25 variables in Table 1 are related to each component as well as the scores for each component, which are the projections of the components for the pitchers under consideration.
For simplicity, I only looked at 2015 starting pitchers with at least 5 starts (prior to May 28th) resulting in a data set of 148 pitchers with the 25 different variables in Table 1. I then performed PCA on this 148 by 25 matrix (after centering and scaling the columns). Prior to viewing how the 148 pitchers compare in a lower dimensional space, it is important to note that with any dimension reduction technique you are losing information. However, we are able to calculate the proportion of variance explained by a principal component and are thus able to approximate the amount of information captured by a certain number of components. These percentages are displayed in Table 2 for the first 10 components:
|Principal Component #||Proportion of Variance Explained||Cumulative Proportion|
Although only 35.84% of the total variation in the data set is accounted for by the first two principal components, we can easily view the 148 pitchers by projecting into this two-dimensional space as seen in Figure 1:
Remember, these principal components are independent of one another so no relationships should be visible hence why a random scattering of players appears. However, there are some noticeable features in regards to where certain pitchers are located. For instance, Francisco Liriano and Phil Hughes are on complete opposite sides for both components possibly indicating how different they are as pitchers. Additionally, Clayton Kershaw, Chris Sale, and Corey Kluber are all pretty close together capturing the similarities between these aces. As mentioned before, we can view the component directions to see how each variable is related. The directions for the first two components are seen in Table 3:
|Variable||Component 1 Direction||Component 2 Direction|
|Line Drive % (LD%)||-0.055||-0.077|
|Ground Ball % (GB%)||0.239||0.073|
|Fly Ball % (FB%)||-0.225||-0.042|
|Horizontal Edge %||-0.056||-0.215|
|Top Edge %||-0.301||0.019|
|Bottom Edge %||0.126||-0.025|
|Total Edge %||-0.147||-0.211|
|Out of Zone %||0.358||0.204|
|Edge to Heart Ratio||0.251||0.001|
To understand how to interpret these values consider this example:
- The first component has a positive projection for GB% but a negative projection for FB%, meaning it separates groundball pitchers from flyball pitchers. Meanwhile, both GB% and FB% are barely projected at all for the second component.
We can also note the interesting contrast with Pull% and Oppo% in the first component, as well as between pitches in the heart of the zone compared to out of the zone. The second component seems to place more weight into projecting opposing batter plate discipline metrics as well as attempting to separate pitchers that induce soft contact.
To see how these first two projections compare with the outcome numbers, Figures 2 and 3 show the two-dimensional space but with a color scale for ERA and FIP respectively. For both, larger values are red while lower values are green and with values near the median as white.
Interestingly, for both plots there appears to be some distinction between the two groups based on the second component more so than the first. Furthermore, there appears to be more of a distinction in the plot with the FIP scale which should not be surprising since the point of using FIP compared to ERA is to isolate the pitcher’s skill from the defense. I decided to look at the correlations between the projections of all components and each pitcher’s FIP, with the first ten listed in Table 4:
|Principal Component #||Correlation w/ FIP|
Unsurprisingly, the first three components display the strongest correlations with FIP. While they are not necessarily perfect relationships the moderate values for components two and three show how only using descriptive metrics relates to the outcome statistics. Just to reiterate this point, Figure 4 displays the projections for components 2 and 3, while Figure 5 also includes the FIP color scale showing a slightly more noticeable separation. The directions for component 3 are displayed in Table 5.
|Variable||Component 3 Direction|
|Line Drive % (LD%)||-0.092|
|Ground Ball % (GB%)||0.405|
|Fly Ball % (FB%)||-0.383|
|Horizontal Edge %||-0.023|
|Top Edge %||-0.129|
|Bottom Edge %||0.218|
|Total Edge %||0.039|
|Out of Zone %||-0.126|
|Edge to Heart Ratio||-0.112|
Using the projections of components two and three I also decided to check out if there was separation between pitchers with great strikeout rates as well as pitch type usage, all metrics that were not included in the dimension reduction. Figure 6 displays the projections with K/9 innings color scale, while Figures 7 to 10 are with scales for fastball %, changeup %, curveball %, and slider % respectively (blue for low, red for high):
In this two-dimensional space, there appears to be pretty clear groupings between pitchers with high K/9 rates compared to those without, noticeably along the second component. The correlation between the two is .647, a fairly strong relationship. Revisiting the second component’s directions this is not surprising considering the positive projections for inducing the batter to swing at pitches outside the strike zone. Noticeably, pitchers with lower fastball usage tend to be grouped together with pitchers with lower strikeout rates in this space.
This was just my first attempt at trying to detect differences between pitchers without using any outcome statistics. Since Petti’s app has data going back to 2010, over this summer I’m going to try and develop a ranking for the best pitchers in baseball from 2010 to 2014 only using descriptive metrics. This initial attempt with just the partial 2015 season revealed how there is an underlying structure between types of pitchers without considering the outcomes of at-bats. The concept of evaluating a pitcher’s skillset in this manner is essentially embracing the concepts scouts use to evaluate pitchers. They don’t place grades on arms by merely counting outcomes, but rather by what they see in how the player pitches. With the plethora of data from PITCHf/x, we have measurements for the different takeaways scouts are looking for and can try to boil it down to a single number. Any feedback on what I’ve done so far or suggestions for this analysis would be greatly appreciated. Also, I want to point out I did not include velocity as one of the variables but I will be looking at how it relates to the reduced dimensional space.
Statistics courtesy of Fangraphs and Bill Petti’s app.
Hastie, Trevor, Robert Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer, 2009. Print.