Introduction to next-gen-scrapy

Written By: Sarah Mallepalle

Introducing next-gen-scrapy

The NFL's Next Gen Stats provides player tracking charts throughout the season, visualizing passing performances and real-time player locations. However, outside of these images, weekly-updated pass and player tracking data is not publicly available in the form of a dataset. In fact, as resident NFL-stats-master Ron Yurko (@Stat_Ron) loves to point out on Twitter, the NFL in the past has not officially released much open-source data out there for us hungry fans in the analytics community. However, that seems to be changing, with the NFL’s announcement of the Big Data Bowl and release of player tracking data for the first six weeks of the 2017 season. Big thanks to Michael Lopez for the initiative in this exciting time for the sports analytics community, and for making all of us (especially Ron) here at CMSAC very excited for the future of data analytics in the NFL!

To extend on 2017’s first six weeks of player tracking data from the Big Data Bowl, today, I’m happy to introduce the first version release of next-gen-scrapy: a GitHub repository which allows users to scrape all pass chart images and JSON data from Next Gen Stats, and extract all pass locations relative to the line of scrimmage from the 2017 regular season onwards. The output dataset, which will be updated weekly, can be used by anyone to examine pass performance for an individual player or team throughout a given season.

How next-gen-scrapy works

In every pass chart image on Next Gen Stats, we can see relative to the line of scrimmage the locations of four different types of passes: green circles are completions, white circles are incompletions, blue circles are touchdowns, and red circles are interceptions. Also, for every pass chart, the HTML stores JSON data that contains the number of all these types of passes. The task of pass location detection is overall pretty simple – for every pass type/color, extract all circles on the chart, and then run k-means on this data, with k equal to the number of passes as specified in the JSON data. However, there are a number of issues with the pass chart images and JSON data that makes this simple task much harder, so we have to do a little bit maneuvering to extract pass locations as accurately as possible. Some of these issues include, but are not limited to:

  1. The number of passes shown in the pass chart does not always match the number of passes given in the JSON data.
  2. Two or more pass locations of the same type frequently overlap.
  3. Regarding touchdown passes, the color of the pass location circle is the same color as the line of scrimmage, and each has a path trajectory line attached to it, which is almost the same color as the pass location itself.

To demonstrate, here's two example pass charts below: Josh Rosen in 2018 on the left, and Tom Brady in 2017 on the right. In Rosen’s chart, we can see examples of issues #1 and #3 – only 10 out of the supposed 27-15=12 incomplete passes are shown on the field, and the blue circle for the touchdown location is almost unidentifiable with the pass trajectory path right on top of it. In Brady’s chart, we can see issues #1 and #2 – Brady threw an interception in this game that is no where to be seen on the pass chart, and we can see on the right around the 15-yard line two incomplete passes that overlap. A very reasonable explanation for Rosen’s two missing incomplete passes is that these were thrown out of bounds; however I cannot think of any explanation as to why Brady's interception is missing from the chart.


The extensive details of how we deal with these issues by using both k-means and DBSCAN will be laid out in a more in-depth post coming very soon, in which I’ll explain the entire pass location detection process step-by-step in a painful amount of detail. But for now, I'll take you through an example pass chart and image cleaning, and show you a final dataset result of pass location detection.

Example Pass Location Detection

For this example, we’ll use a pass chart of Philly’s Lord and Savior who brought home a championship exactly a decade after my beloved Phillies won the World Series:  big... guy Nick. We start off with the image on the left scraped from Next Gen Stats. Next, we “clean” the image by getting rid of everything except for the football field and turning the trapezoidal field into a rectangle, to fix the field distortion and make all yard lines evenly spaced from each other. On the bottom right is a visual of the axes for determining pass locations, with the y-axis vertically running down the center of the field, and the x-axis lying directly on top of the line of scrimmage.


Once we have our cleaned pass chart, we perform color thresholding to extract all four pass types. Below, we see each pass type after performing this step. We can see in the top row of images that there are overlapping passes for completions and incompletions. In the Touchdown figure on the bottom left, the bottommost touchdown is so small, because it is covered by (1) its own touchdown trajectory path, (2) the trajectory path of the touchdown to its left, and (3) a complete pass circle. We also know the JSON data approximately how many of each pass type is present in the chart, which we use when performing k-means to find the pass locations.


Output Data

Once we extract all pass locations, we get a data frame in which each row is one pass location, relative to the line of scrimmage. The variable names and descriptions are as follows:

And here is an example subset of rows for Nick Foles in Super Bowl LII.

Example Usages of next-gen-scrapy

Let’s dive into a few examples from our next-gen-scrapy Shiny app, which shows the kinds of data visualization and analysis we can do with our output dataset. As a start, we plot all the pass locations in the 2018 Regular Season, and color the locations by whether they are complete or incomplete. We define complete to include completions and touchdowns, and incomplete to include incompletions and interceptions. To extend the data in the scatterplot on the left, for all locations on the field between 10 yards behind and 55 yards in front of the line of scrimmage, we use a generalized additive model to estimate completion percentages, taking into account individual and interactive effects of (x,y) coordinate field locations.

In the same way we fit a generalized additive model to the entire league, we can do the same for any individual team or player. The next-gen-scrapy Shiny app currently does this kind of modeling for quarterback completion percentage charts for players with more than 100 passes in a season in the dataset and team defense percentage allowed for all teams.

Quarterback Completion Percentages

First, let’s look at how we can predict completion percentages for quarterbacks across the field using the same type of generalized additive models as above. For this example, we’ll look at NFC championship contender and veteran, Drew Brees. He’s consistently thrown short for years with New Orleans, with Sean Payton’s offense heavily taking elements from a West Coast scheme. According to Next Gen Stats’ passing statistics for the 2018 regular season, Brees’ average intended air yards is 7.1 yards. His air yards differential of just -1.2 yards – the lowest of any quarterback this season – further attests to his incredible efficiency at completing these short passes.


Next Gen Stats gives Drew Brees’ pass charts for 12 games in the 2018 regular season, so we can also further support these claims using next-gen-scrapy’s data! From the scatterplot on the left, we see that in the regular season, Brees’ five longest throws past the 30 yard line are incompletions, while almost all passes before the 10 yard line are completions. In the middle, we see a clear divide in completion percentage between the 20 and 25 yard lines across the width of the field. On the right, the green vs. purple areas show that Brees has a higher than average completion percentage essentially everywhere before the 30 yard line, and is lower than average everywhere past that.

Furthermore, from the data we see that a whopping 46 out of 54 (85.2%) of Brees’ passes thrown before the line of scrimmage are completions. For all of his 30 touchdown passes thrown in the regular season, seven were thrown before the line of scrimmage, the median distance past the line of scrimmage is 9 yards, and 70% of those touchdown passes are thrown less than 15 yards.


On the other side of the spectrum from Brees, we have at rookie quarterback Josh Rosen, with the lowest passer rating of the regular season. For the 9 games provided by Next Gen Stats, Rosen’s pass location scatterplot shows very few completions past the 20 yard line, and his completion percentage versus league average chart is almost completely purple past the line of scrimmage. Looking at our data, Rosen’s completion percentage 10 yards after the line of scrimmage is 36.6%, and that number drops to 20.1% after the 15 yard line.

Team Defense Completion Percentage Allowed

While all-green charts are better for quarterback completion percentage vs. league average, the opposite is true for team defense charts – all-purple charts mean that completion percentage allowed is less than league average. Let’s take a look at Tampa Bay, a team who’s defensive roster has been obliterated by injuries this entire season, on top of the firing of Mike Smith after just five games. The Bucs ended the season ranked second-worst in the league for both average points per game allowed (29) and yards allowed per play (6.1), and sixth-worst in yards allowed per game (383.4). But looking at the completion percentage vs. league average chart, it’s interesting to see such a large purple patch where performance is better than league average. For the 14 games in our data, completion percentage allowed across the field before the 20 yard line is 77.2%, but that number drops all the way to 51.2% after the 20 yard line.


All of the source code of next-gen-scrapy and the final output dataset in this article are available on GitHub here. In the future, the dataset will be updated every Tuesday, as Next Gen Stats releases pass charts weekly for games played over the weekend and Monday. While this current release only processes pass charts, upcoming work will handle route charts, to extract and model wide receiver route locations. Wow!!

Most importantly, huge thank you to Sam Ventura, Kostas Pelechrinis, and Ron Yurko for all your help and guidance with the code and data analysis for this project!

Leave a Reply