This is a sequel to my “Similar Games” article last year. Welcome to the 2019 update.
Readers of my first article on my March Madness Comparables Model should know the methodology in predicting all 67 games of the March Madness tournament. If not, please visit bit.ly/Comparables2018 for an introduction to my work. In its conclusion, I noted several improvements that I would try to implement for this year’s Tournament. Through a combination of my work and a well-connected friend, I made two additions to my model to fix holes in my model.
Addition #1: Clustering Analysis
A quick primer to the time-crunched readers: my model is centered around using “comparable” games to predict performance. As an example, for the VCU vs UCF matchup, I calculated similarity scores to identify the most similar offenses to UCF that VCU has played this past season and repeated the same process for each unit matchup. However, I had no structure for where there were no “similar” matchups, and this was a common situation for 1 v 16, 2 v 15 games. Instead of applying black magic (otherwise known as my intuition) to pick winners, I decided to apply clustering analyses in these situations.
I took all 4F+ data from kenpom.com from NCAA Tournament teams from 2010 onwards and took their offensive and defensive 4F+ statistics, normalized them, and applied two clustering methods to group teams. I started with standard kmeans clustering and decided on 10 clusters. This weighted each 4F+ statistic evenly, which does not accurately represent basketball, as statistics related to per-possession efficiency (AdjOE, AdjDE) and shooting ability (eFG%) are 2-3x more important than the other 4F+ statistics. This led me to introduce a weighted clustering method courtesy of the R flexclust package. I applied a simple linear regression to identify the importance of each 4F statistic to winning and weighted each 4F+ statistic from there. Note that AdjOE and AdjDE were not included, as these statistics are a result of performance in the 4F statistics, and so these were left out of the regression. From this I determined that shooting explains about 50% of the game’s result, with turnovers, rebounding, and free throw rate (rate at which a team shoots free throws) splitting the remaining 50% nearly equally. The resulting weights vector gave adjusted efficiency and effective field goal percentage equal weights of 1, while the remaining 3 4F+ statistics were given a weight of .4 to represent the difference in importance.
After this, I had all NCAA tournament teams grouped using two different clustering methods. I combined each cluster to form a “combined cluster”, or a “ccluster” column to form more distinct groups of NCAA Tournament offensive and defensive units. For anyone who wants to see the results of the clustering, don't hesitate to reach out to me.
This created 48 unique clusters of NCAA Tournament offenses and 45 of NCAA Tournament defenses. As you can see by sorting through the tables above, most teams were clustered in a few clusters with outliers populating the rest, which suggests to me that exceptional offenses and defenses, while elite, are not revolutionary. I noted that outliers in team playstyle such as that of the vaunted VCU and Texas Tech defenses of years have higher upside, as teams have less time and familiarity to prepare for a team, and this clustering method places these two defenses into a sparsely-populated cluster. This clustering enabled me to search for the results of previous matchups when games from this current season did not suffice. I had an additional source from a lucky and well-placed connection that I was able to use for additional help.
Addition #2: Hoop Lens
First, this is not free advertising. Second, Hoop Lens is an incredible resource formed by my friend, Jay Cipoletti, and other partners that has the only lineup statistics available online at hooplens.com. After meeting him at the inaugural U.S. Soccer Hackathon, I gained access to this website for a limited cost. I can calculate an aggregate offensive and defense points per possession over a series of game, but I can also calculate on/off splits for teams, which came in handy in injury situations. For example, Kansas lost Lagerald Vick and Udoka Azubuike for the NCAA Tournament, and so I wanted to remove possessions that occurred with either of the two players on the court. KenPom statistics were unable to differentiate performance of lineups with those players versus those without, but with Hoop Lens I was able to check Kansas’ performance in groups without those two players to make a more accurate judgment on their play.
With improved regressions and two additions to my model, I was confident in a marked improvement in my performance for March Madness 2019.
This Year’s Performance
I ran less advanced versions of this model over the last two years. Each year, I had great Second Round results. I went 15 games perfect in Year 1 to finish 29-3 and 25-7 in Year 2. However, I had a middling 19-13 this year. I would have done better selecting games using black magic or my intuition as supposed to using several information sources to add more data to my prediction process. (As a side note, if I were President, I would mandate that all university Spring Breaks occur on the week after Selection Sunday so that I can run my model and make predictions in peace without dealing with schoolwork. Any 2020 candidate that mandates this gets my vote.)
Here are the units, players, or moments that thwarted my quest for perfection in the Second Round. In the order of games played in the Second Round (check out my projections for this round in Table 1 below):
- NCAA’s Pitino Punishment – Louisville was one of my biggest favorites. Apparently, Richard Pitino really wanted to avenge his father’s questionable recruiting methods at Louisville. I’ll take note of revenge narratives next year.
- Miye Oni, Yale – Crime: going 1/10 from the 3-PT line, and yet Yale only lost by 5. I’m sure he’ll be beating himself up with his Yale degree (Never mind, it’s political science).
- New Mexico State’s decision making – This time, (unnamed player) decides to pass to (UP2) for a 3-pt shot instead of shooting a game-tying, wide-open layup. UP2 proceeds to get fouled for 3 free throws … and misses 2 to return Auburn’s gift. I’m not picking the Aggies ever again.
- Northeastern’s offense – They went 34.5% from 2, 21.4% from 3 – that’s terrible. (And yes, I did pick Northeastern to beat Kansas)
- Caleb Martin/Jordan Caroline, Nevada – Your most important players cannot go a combined 5/18 from 2, 2/15 from 3, and 10/16 from the FT line if you want to win. The last 6 weeks of Nevada’s season undermined an exciting and promising start.
- Syracuse’s (formerly) vaunted 2-3 zone – Baylor went 16/34 from 3 and scored 1.3 PPP against a Syracuse zone that normally gives headaches to opposing offenses.
- Potpourri of issues, Wisconsin – 6/30 from 3, allowing 7/15 from 3 guarantees Ethan Happ an unceremonious sendoff.
- Potpourri of issues, Utah State Edition – The turnover influenza took form in a 30.4% turnover percentage, and allowing 10/17 from 3 brought a big blow to my Cinderella pick of the tournament.
- VCU’s offense – 13/35 from 2, 6/26 from 3, again – that’s terrible, and cost them a chance to play VCU’s special defense against Zion.
It is commonly stated that those who do well in the beginning often fare poorly in the more important latter stages and vice versa, and I can say that this was the case for my bracket.
In the Round of 32, I had solid success (see Table 2 below). My model was able to predict the correct winner in 12 of 16 games, though I note that I used updated information to make picks for later rounds. Thus, some picks I make with new information are different than those made in my brackets' games.
Table 3 then shows my predictions using updated data for the later rounds. I predicted 6 out of 8 Elite Eight members, 2 out of 4 Final Four members, and both participants in the National Championship Game. I also show in Table 4 my submitted brackets' correct number of teams selected in the Sweet 16 onwards.
Ultimately, my bracket results were good enough to place in the top 10th percentile, and if Jarrett Culver passed the ball at the end of the National Championship Game, potentially in the top .1%.
With additional information, I made some predictions that would have foreshadowed chaos in the Midwest Region. Prior to the Tournament, I felt that North Carolina was an extremely weak #1 seed due to them lacking a legitimate NBA prospect (in my opinion), their crashing out of the Tournament last season, and their overseeding by the NCAA Tournament committee (per KenPom pre-tourney rankings). There were several teams with strong offenses and defenses capable of holding North Carolina’s offense in their region, and I believed that this would manifest with multiple lower seeds making it into the later rounds in that region. While Iowa State, Utah State, and New Mexico State all lost in their first game, I was partially vindicated by Auburn, who were only a 5-seed but ended up making it into the Final Four.
My new additions to the model helped to account for the power of Texas Tech’s unique defensive style, with comparable defenses (such as VCU and Cincinnati over past years) having unprecedented defensive success in the Tournament. There were no special offenses in this Tournament, as KenPom’s top offenses of the year in Virginia and Gonzaga were in the well-populated, “regular” elite cluster. However, I was taken by one team’s offense who were in a cluster of their own and put together some special performances during the season. Thus, I was more inclined to select them to progress further in the Tournament. That team was Iowa State, who let me down, losing to a poor Ohio State team in the Second Round.
My performance with in the later rounds outweighed my failure in the Round of 64. I believe that my new additions that accounted for more nuanced situations, specifically injuries, and gave me a better structure in which to think about matchups and make predictions. For improvements, I will reference many of the same written in my first article. Time allowing, I will try to do some work into the fates of “overseeded/underseeded” teams and outliers in past NCAA Tournaments. For those who would like to offer their help or ask me for clarification, please reference my contact information in my first article (email preferred). Lastly, as the resident college basketball expert and as a gift for getting this far in my article, I will lay my reputation on the line and state that Michigan will be a Final Four team in the next year and is my favorite to be crowned the best team in the country.