Jan 4, 2017

2017 Quality Curve - January Edition

Custom has it that you always wish your readers a Happy New Year in your first article of a new year. I'm not going to do that since our New Year begins in November. Instead, I am going to dive into what could be a PPB tradition for the beginning of January: the first Quality Curve analysis of the current season. It seems fitting that we have enough information by this point in the season to examine its current state and what it may hold in store for March.

However, the 2016-17 season presents a brand new caveat: the methodology of the data set (KenPom Ratings) used to make the Quality Curve changed. Instead of blindly pasting the current curve to a historical curve and inferring tournament results from the year-of-best-fit, we first need to determine if the old way of doing things will work with the new data set.



REVIEWING THE CHANGES

If you are interested in the details of the changes, Ken Pomeroy summarized those changes here. If you want my summary of Ken's summary, here's what I have been able to gather.
  • Win Probabilities vs Score Probabilities
    • The old ratings system calculated a win probability of Team A against the average Division I team. For example, pre-tourney 2016 KU had a Pythag rating of .9503, meaning they would win, on-average, against 95% of the Division I opponents. Another method, which Ken termed the Log5 method, would allow you to compare the Pythag rating of two teams in a game and calculate the win probability of each team for that game. 
    • The new rating system calculates a margin of victory of Team A against the average Division I team. For example, post-tourney 2016 KU had a Rating of 29.64, meaning they would, on-average, beat a typical Division I opponent by almost 30 points. In order to compare two teams in a game, you simply subtract one team's rating from the other, adjust that value with a home-court advantage constant, and you have a the expected point differential of the winning team (Positive value for winner, negative value for loser).
  • Indeterminate Scale vs Linear Scale
    • The old ratings between two teams meant absolutely nothing without the Log5 method. Team A and Team B with ratings of .9500 and .9400, respectively, did not mean the same thing as Team C and Team D with ratings of .5700 and .5600. You would think that a .0100 differential between two teams meant the same thing no matter where the .0100 differential appeared on the scale. This is not the case with Pythag values.
    • The new ratings between two teams is linear, meaning a differential at one end of the scale means the exact same thing at another location on the scale.
  • Different Reference Point
    • The old ratings system used a reference point of 0.5000. Since the rating system calculates the win percentage against the average Division I opponent, if the scale ranges from 0 to 1, then we would expect the average Division I opponent to have a rating of .5000. When you average all of the ratings (for each and every team in Division I) in the old system, you typically approach the 0.5000 value, which is to be expected (sometimes more or less, which I think could be due to rounding or some other compensating factor like Home Court Advantage being factored into the ratings).
    • The new ratings system uses a reference point of 0. Since it calculates the score differential probability of teams against the average Division I opponent, the scale should have an indefinite range (no guaranteed maximum or minimum value) but the sum of all of the ratings (again each and every team in Division I) should approach 0, much the same way the old ratings would approach 0.5000 as the reference point.
  • Other changes were made involving Strength of Schedule, recency and importance of game, and the value of the home-court advantage constant, but Ken described those changes as insignificant even though he did elaborate on how they changed.
BACK-TESTING THE NEW DATA

By calculating a team's rating comparative to the average Division I team, you should produce a rating system that acts like a zero-sum game. For my numbers to improve, your numbers have to suffer, but the size of the pie never changes. Essentially, a ratings system with this build should illustrate parity in the game. If parity exists, ratings at the very top should be weaker than average and ratings at the very bottom should be stronger than average, and the tournament should be rather insane. If parity does not exist, ratings at the very top should be higher than average and ratings at the very bottom should be weaker than average, and the tournament should be rather chalky. So how does this year's parity compare with that of previous years?

Unfortunately, I do not have the pre-tourney ratings using the new system, which would be ideal since we make our bracket predictions each year based on pre-tourney ratings. However, Ken Pomeroy makes his post-tourney ratings available for everyone to see, and this is definitely an adequate, although not perfect, substitute. Thus, it is very important that I note this key detail: All comparisons of this year's ratings will made against post-tourney ratings, not pre-tourney ratings. There is significant movement in the ratings between pre-tourney and post-tourney, especially for teams that make deep runs into the NCAA, NIT and (since 2008) the CBI. Here is a chart comparing this year (in WHITE) to all years back to 2002. Another important note: Data for this year is all games up to December 26, 2016, because major conferences began playing significant quantities of conference games on December 27, and I did not want to include conference games in these ratings (read this article to see why I chose to avoid including conference games).


Yes, this chart is a hot mess, but I will try to make some sense out of it (If you can't see the whole chart, let me know in the comment section). In a previous article, I grouped tournaments based on the similarities in their pre-tournament quality curve and their post-tournament results. I tried to color-code this chart based on those groupings, where sane years (2002,2003,2007,2008,2009) received blue or greenish colors and insane years (2006,2010,2011,2014) received reddish and orangish colors. Moderate years I tried to keep yellowish and bright greenish. While it is hard to match the exact line with its reference in the legend, I'll try to do that with text (what a cheap plot to get you keep reading my work!).
  • The 2017 Line (WHITE) grades out as the third-weakest #1 team, but as other years quickly decline from their top team, 2017 declines much, much slower. In fact it declines so much slower, 2017 has the 2nd-strongest #5, #6, and #8, strongest #10 (tied 2013), third-best #11, strongest #12 and #13, and fourth-strongest #14 and #15. In other words, 2017 has some of the strongest potential 2-, 3- and 4-seeds (assuming the committee seeds all the teams this way, and they never do). From this point in the ratings, 2017 has below-average #18-#23 teams, average #24-#41 teams, and below-average #42-#50 teams.
  • For your own reference, the high-heading teal-blue line is 2015 and the bottom-left orange line is 2006. The upward-bowing (in the middle part) red line is 2014. The high-tailed dark blue line is 2007, and the high-tailed orange is 2010. The low-tailed yellow line is 2005.
As you can see, I worry how the new system translates into prediction-by-comparative-analysis. The chart doesn't really meet my expectations. Chalk years should start higher on the left and end lower on the right whereas insane years should start lower on the left and end higher on the right. Instead, the chalkiest years (in blue) start midways on the left and one ends among the highest two on the right. Likewise, the craziest years (in red) also seem to be in the middle on the left side as well as in the middle on the right side. Then, slightly less chalky years (in green) start highest of all with one among the bottom two on the left. Not to mention, slight less insane years (in orange) show up among the bottom on the left side whereas a moderate year (in yellow) shows up at the bottom on the right side. Simply put, this new ratings system shows chalk years the same area of the chart as insane years, and not how one would theoretically expect them to appear.

If we want to do a prediction-by-comparative analysis, we might have to look at the raw data, and that is what I have in the table below.

Let me give a quick summary of what you see. This is the new rating for each team at that particular rank in the ratings at the end of each given season (2002-2016). Above each year is three values (CovarT50, CorrelT50, and Stdev) that relate that particular year to the current year. The CovarT50 value is the 1-for-1 co-variance for all teams in the Top 50 of that particular year compared to its counterpart in the current year (2017). The CorrelT50 value is the correlation coefficient for the same. I'm not exactly sure what those data points are conveying because their calculation involves each year's statistical mean, and we already know that particular component may be unreliable since the ratings system approximates a zero-sum game. This is something that I will probably be testing for the next month and report back to you in the February QC Analysis. (I cannot stress this enough: we need to make sure our old comparative methods still work with the new ratings system).

The Stdev value is the standard deviation of all 1-for-1 differences between the given year and the current year. For each rank (Team #1, #2, etc), I subtracted the rating of the given year (i.e. - 2002) from the rating of its counterpart in 2017. Difference equals 2017 #1 minus 2002 #1. I did this for each and every team in the Top 50. Then, I took the population standard deviation for the 50 differences that resulted from this process. In theory, if the 2017 ratings match-up with the ratings of a historical year (2002-2016), then the standard deviation should be closest to 0. From the chart, the lowest Stdev value comes from 2013, the 2nd lowest from 2004, and the 3rd lowest from 2012. We can also see that the highest Stdev value comes from 2015, 2nd-highest from 2010, and 3rd-highest from 2002. Based on what I see in the data, I would expect 2017 to match tournaments like 2013 and 2004 (2012 is a surprise) and I would expect 2017 to have nothing in common with 2015 and 2002 (2010 is a surprise). I guess I will have to wait until these numbers finalize in March (and I get a better understanding of the new system) before drawing any significant conclusions. Until then, thank you for reading and the February Quality Curve Analysis should be up on the first Wednesday of that month.

2 comments:

  1. I found this blog by chance and I am glad that I did. I enjoy reading all of these items. Thank you for putting out all of this information.

    Officiating is just awful in every sport and ruins most games but march madness will always be fun (hopefully).

    Again thank you for all that you do. this is awesome.

    - HomelessSkittle

    ReplyDelete
  2. You are very welcome! I am sorry for the late reply. I've been so busy with other stuff, I haven't checked this site or my email for notifications, and as you can see by the article for 1/18, I haven't had any time in the last two weeks to put anything together. But I do appreciate you contacting me, and as always, Thank You for reading.

    ReplyDelete