Note: This was piece was written as part of a hackathon sponsored by TruMedia
By Steven Silverman (@)
TruMedia’s detailed pitch-by-pitch data allow for extraordinarily deep analysis. Having access to velocity, spin, and movement for every pitch, plus the identifying information about the involved parties and the outcome of the pitch and the plate appearance opens up many possibilities and avenues of exploration that were not available previously. In this report, I detail one narrow application of this new data set: clustering pitchers. I heavily modify an idea put forth by Vince Gennaro several years ago for use with TruMedia’s data. Should the clustering work, it could open up new options for player preparation, advance scouting and strategy, and player evaluation.
There are many clustering methods, but all seek to group similar things together to add information. In this analysis, I first separate pitchers by handedness and then examine their usage and movement against both handedness of batters: that is, no lefthanded pitchers will be clustered with any righthanders, but “Clayton Kershaw against RHBs,” “Clayton Kershaw against LHBs,” “Jon Lester against RHBs,” and “Chris Sale against LHBs” could all be in the same cluster. I use the provided data for 2013 and 2014 together to cluster the pitchers, first removing any pitchers with fewer than 100 pitches thrown across both years, then center and scale the data. (Pat Venditte, had he pitched in the majors in 2013 or 2014, would have had up to four entries, one for each batter pitcher handedness combination.) In particular, I use the following set of variables:
- Average fastball release velocity
- Average horizontal and vertical release point
- Swinging Strike %
- Zone %: percentage of pitches in the strike zone, using the rulebook definition for the sides of the zone and the provided variable for the top and bottom
- Edge %: percentage of pitches on the horizontal edges of the strike zone (not over the plate), defined as in this article
- Pitch usage Herfindahl index: the Herfindahl index is designed to measure competition among firms in an industry, but I have adapted it to measure how much a pitcher relies on one or two pitches rather than mixing them up. It is calculated by summing the squares of the usage percentages of each pitch.
- Average fastball spin rate
- Average non-fastball spin rate
- Average horizontal and vertical fastball movement: unfortunately, the TruMedia files do not include columns for pitch movement. However, by using the standard kinematic equations and the given position, velocity, and acceleration columns, I was able to back calculate the time each pitch took to travel to home plate, and from there infer the “movement” by subtracting the implied position at home plate from the actual position. For vertical movement, I removed the effect of gravity. Obviously these numbers will be slightly off, as baseballs spinning through air are neither entirely ballistic nor have constant acceleration, but the error should be negligible enough to not significantly affect the results. In addition, I believe there is systematic measurement error, as calculating the duration of each pitch based on each of the three axes yields three different results. However, since I center and scale the data before clustering, this will not cause any problems, since all measurements will be relative anyway.
- Average horizontal and vertical non-fastball movement (calculated similarly
A pairs plot of each predictor variable against every other doesn’t reveal any major issues with collinearity: more horizontal movement is associated with less vertical movement, and higher spin rates mean more movement in general. One interesting relationship is between spin rate and fastball movement, shown below for lefthanders:
Multiple relationships are in play: the total movement seems constrained—as pitchers add sidespin for more horizontal movement, they are sacrificing backspin and vertical movement. In addition, more total spin is associated with more total movement, as makes sense intuitively.After centering and scaling each variable, I calculate the pairwise Euclidean distance between points, and then run a hierarchical clustering algorithm (using Ward’s method) to create clusters. By in specting the dendrograms (tree diagrams) and choosing a suitable height to cut the trees at, I ended up with nineteen clusters of LHPs and twentyeight clusters of RHPs. (390 LHPs met my cutoff criteria, as did 987 RHPs.) Below is an example of one cluster of lefthanded pitchers.
After centering and scaling each variable, I calculate the pairwise Euclidean distance between points, and then run a hierarchical clustering algorithm (using Ward’s method) to create clusters. By in specting the dendrograms (tree diagrams) and choosing a suitable height to cut the trees at, I ended up with nineteen clusters of LHPs and twentyeight clusters of RHPs. (390 LHPs met my cutoff criteria, as did 987 RHPs.) Below is an example of one cluster of lefthanded pitchers.
Note the slightly different heights at which each level of the tree joins up, as well as how Chapman, Kershaw, Miller, and Dunn approach hitters the same regardless of handedness.
Next, I calculate the wOBA (using FanGraphs’ published weights) against each of the forty-seven total clusters for every batter with at least 100 plate appearances in 2015. I also calculate the 2015 wOBA for such batters against every pitcher whom they faced at least ten times in 2015, and the wOBA for those batter-pitcher pairs in 2013 and 2014 combined. I chose wOBA as my evaluation metric since it is fairly simple to calculate, does not require any park or league adjustment (admittedly, this limits its accuracy, and should be corrected in a more detailed study), and is a reasonable estimate of hitting ability.
With actual results, the cluster-predicted results, and historical results, we can now examine whether such clustering is indeed effective at predicting hitter performance.
The initial results are very promising for a rather superficial study. The mean absolute error of the projected wOBA values was roughly .148, while using historical results against just one pitcher led to a MAE of about .207. For a metric that averages a bit above .300, both these numbers are quite large, but improving accuracy by over 25% is helpful nonetheless. I will discuss how this method could be improved later on.
Besides creating a basic hitter projection, this method is advantageous for another reason: the identities of each of the clusters can lend insight into how pitchers that might not appear similar actually are, which could be of use in advance scouting and player preparation. Below, I present a few especially interesting clusters and pieces of information. (A (B) after a pitcher means their matchups against righties and lefties are both included in the cluster; otherwise, the included handedness is specified. Only pitchers who played in 2015 are listed.)
- Mike Dunn (B), Sean Doolittle (L), Andrew Miller (B), Clayton Kershaw (B), Aroldis Chapman (B): This interesting mix of tall lefthanders all throw reasonably hard (with Chapman pulling the average FB velocity up to 94) and work the strike zone (43.0 Zone%, the highest of any cluster). With the Yankees’ recent acquisition of Chapman, they now have two similarly elite bullpen arms. Other than Kershaw, they all rely on relatively few pitches (mostly fastball-slider), with the highest Herfindahl index of any cluster at 0.46. They also get lots of swinging strikes and have great spin rate and vertical movement (lack of drop) on fastballs.
- Randy Choate (B), Javier Lopez (L), Joe Thatcher (R), Alex Claudio (L): This is the soft- tossing lefty group, with the average fastball velocity a mere 85 MPH. They have very low fastball spin rates overall and throw lots of sinkers and two seamers, leading to the lowest fastball vertical movement of any cluster.
- Bartolo Colon (R), Matt Albers (R), Chad Qualls (B), Jared Hughes (R), Burke Badenhop (R), Ryan Webb (B), et al: Righthanders who throw sinkers and two-seamers without much velocity, these pitchers all induce lots of ground balls. (I elected not to include GB% and other batted ball metrics to avoid using results based clustering, with the exception of swinging strike percentage. Besides, vertical movement should serve as a rough proxy for batted ball distribution.) They have the third lowest fastball vertical movement among righthanded clusters, and the second lowest swinging strike percentage.
- Charlie Morton (B), Jeremy Jeffress (B), Sam Dyson (R), Craig Kimbrel (B), Cody Allen (B), Aaron Sanchez (B), Jose Fernandez (R), Shelby Miller (B) et al: This group of righthanders all have excellent movement on their breaking pitches. With the exception of Dyson, they all throw curveballs that break hard down and in on lefties and down and away from righthanders (Dyson relies on his slider). For the most part, pitchers in this cluster are included against both handedness of batter. They have the third highest non fastball spin rate among righthanded clusters, but the elite movement suggests that more of the spin is “useful,” with the spin axis more perpendicular to the direction of motion. Unfortunately the spin direction provided is only in the xz-plane, so that hypothesis is unconfirmable with this data set.
Each of the clusters reveals more and more insight about how pitchers are similar, sometimes in ways that are not intuitive from their results or from scouting.
With more time and computational power, I could significantly improve upon these results. One drawback of the current method is that there are only fortyseven possible predicted values for a given batter to go across all pitchers. This homogeneous strategy can cause distortions for pitchers who are on the edge of being in one cluster or another, since pitchers who are some distance apart are given the same projection, while pitchers very close to each other but on opposite sides of a break point are given different ones. Using all pitchers within a certain distance of the pitcher of interest is a better method, but is much more computationally expensive to perform.
Another drawback of the pure clustering is that the past quality of the pitcher is not included, except as a small part of the historical plate appearances against the cluster. Nor did I include an aging adjustment, regression to the mean, or any of the other standard modifications to projection systems that yield big improvements. As mentioned previously, wOBA does not control for park, league, or quality of opponents, which more complicated stats do. With that in mind, the large increase in predictive power from simple one-on-one historical results is actually quite impressive, and can be further refined to eliminate much of the error.
Finally, I would like to redo this study with a different set of predictor variables. With so many included, dimensionality becomes an issue, so a preprocessing algorithm like PCA could be useful. I could also examine the importance of each of the variables in the data set to past performance and either select or weight the variables based on that information. As a first step, however, this project is a good proof of concept.
- Vince Gennaro, “Clustering Pitchers By Similarity: Part 1”: http://vincegennaro.mlblogs.com/2013/04/22/clustering-pitchers-by-similarity-part-1/
- Vince Gennaro, “Clustering Pitchers By Similarity: Part 2”: http://vincegennaro.mlblogs.com/2013/06/03/clustering-pitchers-by-similarity-part-2/
- Fangraphs, “wOBA”: http://www.fangraphs.com/library/offense/woba/
- Fangraphs, “Guts!”: http://www.fangraphs.com/guts.aspx
Appendix: Summary Results
Below are the summary statistics for each cluster—first the table for LHPs, then RHPs.