Mar 11, 2018

The Mind of the Selection Committee

As readers of my blog already know, a proven method of spotting upsets in a tournament is exploiting the knowledge gap between the Selection Committee and the data scientists. In my introduction articles for the last two seasons (links: 2016 and 2017), I took a simplistic ex post facto approach to understanding the Selection Committee's seeding principles. In short, it seemed as if the Selection Committee put an added value on conference affiliation, where teams from conferences with a better conference-RPI received seeds higher than their individual resume would suggest they deserve. In effect, these teams were over-seeded and other teams from less-valued conferences were under-seeded, and this type of match-up usually favors the under-seeded team. In the 2016 tournament, the B12 and P12 were beneficiaries of this process, yet in the tournament, teams from these conferences went 9-7 and 4-7, respectively (combined 13-14, and 7 of the B12's 9 wins can be attributed to OU's F4-run and KU's E8-run). In the 2017 tournament, the ACC was a huge beneficiary of this process, yet in the tournament, the ACC went 11-8 total with only one team making it to the S16 (6 of the ACC's 11 total wins was UNC's title run). Now that I have seen this process in action for the last two years, I know what to look for in the 2018 tournament...................unfortunately, it may not happen this way. Why? The Selection Committee has a brand new toy for the 2018 tournament called the Quad System. How they implement it will affect our ability to identify potential upsets, and this fact will be the focus of this article. (NOTE: Some sections may be really long-winded due to quality of detail, so for this article, if I feel a section needs a condensation, I will provide one and give it the label of "TL/DR", which means Too Long/Didn't Read.)



The Nitty Gritty of the Selection Process

The obvious starting point is to understand the entire selection process. The 3-step process and the rigid stipulations for each step is detailed in an roughly 5-page document (Link: Here), and this document is my source for all information in this section. I have summarized them below with the numbered bullet points and provided my opinions on them in the succeeding paragraph.
  1. Selecting the Field - Picking the 68 teams to fill the 68 available spots in the field. (This is the only step of the three with which we are least concerned.)
  2. Seeding the Teams - Ranking all 68 selected teams in order from 1-68.
    1. Committee members submit "List X best teams". The eight teams from all lists receiving the most votes go onto the "next seed line ballot". The teams on this ballot are ranked 1 thru 8 to the member's discretion, with 1 being best and 8 being worst. The four teams with the overall lowest ranking totals are moved to the current seed line, and the remaining four go onto the list for the "next seed line ballot". Repeat until all 68 teams have been seeded.
    2. After all 68 teams have been seeded onto the true seed list, committee members can vote by simple majority to move a team along the true seed list. This is known as scrubbing the bracket, and it permits the Committee to "affirm true seed accuracy throughout the duration of championship week (the week of conference tournaments before Selection Sunday). At no point in the selecting or seeding steps can a committee member participate in a vote involving an institution in which the member or close family member has a working interest in order to avoid conflict of interest.
  3. Bracketing the Field - Putting together the seeds, pods, and regions that form the bracket.
    1. 1-seeds are bracketed first, and regions must be organized so that top-ranked 1-seed's regional winner will face the fourth-ranked 1-seed's regional winner in the Final 4 and that the second-ranked 1-seed's regional winner will face the third-ranked 1-seed's regional winner in the Final 4.
    2. 2-seeds are bracketed next, followed by 3-seeds, then 4-seeds, with each team being bracketed according to their order on the true seed list. Typically, these seeded teams are bracketed into their region of locational interest, but this can be relaxed to avoid one region have the strongest team at each seed line in the same region. After all sixteen of these teams have been bracketed, the committee can make adjustments to balance the weight of the four regions (the sum of the true seed value -- 1 thru 16 -- of each of the four teams in a region must be within five of the sum of the true seed value of all other regions). Each of the top four teams from a conference cannot be assigned to the same region as another top four team from their conference if they receive a 1-, 2-, 3- or 4-seed.
    3. Once the top four seed lines have been bracketed, each team is assigned in order of true seed value to a first-/second-round pod site. The highest seed of each pod (1-seed, 2-seed, 3-seed and 4-seed) shall not be placed in a pod that would put them in a locational or home-crowd disadvantage.
    4. At this point, 5- thru 16-seeds are bracketed according to a number of stipulations (some have already been described). Teams from the same conference cannot meet until an E8 game if they have played three or more games against one another prior to the NCAA tournament, cannot meet until a S16 game if they have played exactly two game against one another prior to the NCAA tournament, and cannot meet until a R32 game if they have played exactly one game against one another. A team cannot play at any site in which it has played three or more games prior to the NCAA tournament, and teams that are hosts cannot play on their host site.
    5. Teams on each seed line should be as equal as possible, and teams can be moved up or down one seed line to meet any of these locational, conference, or other game-day restrictions, and in extreme cases, a team can be moved up or down two seed lines.
TL/DR: I listed these selection, seeding and bracketing rules to make a couple of points. First, the voting processes in the first two steps and the bracketing principles and restrictions in the third step are setup in a way to filter out any potential individual bias in the process. However, all of the committee members are provided a standardized data sets (Nitty Gritty Reports and NCAA Team Sheets) in which to guide their decision-making. All ten members can have their own perspective of each and every team in the field and under consideration, but these standardized data sets gravitate their perspectives back to a common viewpoint. Second, in a strange and uncomplicated fashion, two processes are put into place that allow bias to re-enter the equation: The Scrubbing process in Step 2 and the full-participation of the Committee in Step 3. In the scrubbing process, the committee members by simple majority (5 out of 10*) can move a team up or down along the true seed line, but their initial position along the true seed line was determined by the combined opinion of all vote-eligible members (a member can't vote in the process involving their team). The bracketing process (Step 3) makes no mention of ineligibility like the first two steps, so I assume a member can bracket their team as well as their team's probable opponents. As someone who understands the game of basketball and knows my favorite team inside and out, it wouldn't be hard for me (even as one out of ten votes) to steer a potential Cinderella or match-up disadvantage away from my team's pod and/or region. In short, as many have said before me, "It is an imperfect process and imperfect processes can lead to imperfect results."

My Process

What I want to accomplish in this article is to produce a predictive blue-print. I want to produce a true seed list (or multiple ones) that predicts what the tournament field would look like if the Committee favors misinformed or misguided tools. As I implied in the beginning paragraph, conference affiliation has nothing to do with team quality. If a team is the 25th best team in the country, that team should be the highest 7-seed in the bracket (6x4=24, so 25 would be the highest spot on the next seed line). When the 33rd best team (9-seed quality) in the country gets a 7-seed in the tournament simply because of its conference affiliation, the seeding process is either being misguided or misinformed. Thus, I want to produce a blue-print of what the field could look like if a misguided or misinformed tool is applied. Then, when the actual bracket is revealed on Selection Sunday, I will be able to compare the actual bracket to the blue-print to see what the Selection Committee was favoring, and as a result, we will know where the knowledge gaps exists and where to look for the potential upsets.

Your first question at this point should be: How do you hope to accomplish this feat? The simple answer is that I'm going to take a short cut. Since the Committee uses the same standardized data sets
to select and seed the field, I'm going to identify a pattern in the data sets and use this pattern to select and seed the field myself. Furthermore, the Selection Committee released a mock bracket on Feb 11 where they selected the Top 4 seeds in each region (Top 16 teams). I simply find the pattern in the mock bracket and apply it to the current data sets. So, let's have some fun.

Methodology

To be perfectly honest, I've never done anything like this before. My specialty is predicting the bracket after it is released, not before, but as I stated in the previous section, I believe there is good value in knowing the mind (more specifically, the values) of the Selection Committee. Nonetheless, the only course of action I could take was trial and error. I knew the desired target was the true seed line, and the mock bracket showed the Committee's appraisal and order of the Top 16 teams. I just have to find a means to get to that end.

The first trial balloon was correlation analysis. I simply calculated the correlation coefficient of a particular attribute to the true seed line and did this for each possible attribute (RPI, SOS, Road Wins, Quad 1 Wins, etc). I knew right away the results were invalid. Some value were producing really strong correlations (RPI) whereas some were producing weak correlations like SOS and Non-Conf SOS (NCSOS). In fact, Q1 Wins and Road Wins were producing negative correlations, which is expected since teams ranked 1-4 (low values) should have more Q1 and Road wins (high values) and teams ranked 13-16 (high values) should have less Q1 and Road wins (low values). I couldn't comprehend how I was going to combine a broad array of correlations (strong positive, weak positive, strong negative, and weak negative) into a meaningful formula that would produce a predictive true seed line. Thus, I scrapped this means.

My next attempt was multiple regression. Intuitively, I thought this approach would fail as well and fail for similar reasons as correlation analysis. First, the team data sets and the true seed line values are not exactly a linear relationship, which is what multiple regression aims to quantify. Second, the individual attributes are not exactly independent variables, which multiple regression analysis assumes them to be. Yet, if this means can get me a usable and predictive approach, I will relax those assumptions. After the initial run, I knew I was already miles ahead of correlation analysis, but the results were still messy. I thought the mess was due to differences in data types. Simply put, RPI and SOS are examples of interval scale (rank-order data just like the desired true seed line) whereas win and loss values are ratio scale. My first solution was to convert everything into rank-order data, and let the multiple regression equation work everything out. This didn't work, so with all of the data in this format, I tried a simple average rank equation. This approach produced error residuals higher than I wanted, and even worse, I wasn't sure this method could be extrapolated to the full scale. (Keep in mind, I am only looking at the teams deemed to be Top 16 in the mock bracket. I do not know the true seed line values for all other teams outside the Top 16, like UK, WVU, URI, GONZ, and WICH were at the time. These teams also had good resumes, but the committee ranked their resumes lower than 16th without giving a true value for them. Thus, I need a means that will work for teams outside the Top 16 just as good as it does for the Top 16.) I even tried converting the data into winning percentage (for W/L data) and specific ratings values (for RPI and SOS) just so I am looking at all ratio scale data, but this did not produce any fruit either.

TL/DR: Since the end result was going to be rank-order data (the true seed line), I determined that only one input should be rank-order data and the rest of the inputs should be ratio-scale data that modifies the single rank-order input toward the intended result. The attribute with rank-order quality that best approximates the intended result is RPI. The remaining task is to determine which ratio-scale data should be used and which should be discarded, and I determined that Road Wins, Q1+Q2 Wins, Q1 Wins, Conference Wins, and Q3+Q4 Losses (shown in the table below).

Before moving to the multiple regression results, I should explain two things. First, I used four dates worth of data, Feb 7 thru Feb 10, because I wasn't sure which day's data set was used by the Selection Committee in making the mock bracket. From the media interviews conducted that day, I was certain that they did not use Sat Feb 10's data, but instead used the scrubbing process to account for game results occurring on Sat Feb 10th. Second, I used both Q1+Q2 Wins and Q1 Wins attributes (which double counts Q1 wins) because trial and error with both Q1 Wins by itself and Q1 Wins and Q2 Wins independently were producing high levels of error. Using both in conjunction with Conf Ws and Road Ws brought error residuals lower and allowed for a larger significance of the committee's new appraisal tool -- the Quad System -- in the results.

Results

After calculating the multiple regression equation, here are the results.

My summary of the results:
  • Negative coefficients means positive relationship to true seed line value. Since these values are being subtracted from the RPI component, they result in a lower value for the true seed line value, which means better seed. This is expected for road wins and Q1 wins, but for conference Ws this is surprising.
  • Positive coefficients means negative relationship to true seed line value. This is expected for Q3+Q4 Losses because bad losses should result in a lower seed. This is surprising to see a positive coefficient for Q1+Q2 Ws.
  • If I had to give a logical reason why conference Ws being negative and Q1+Q2 Ws being positive, the best explanation I have is the lack of independence between the inputs. Keep in mind, multiple regression typically assumes independence among explanatory variables. In these team attributes, there can be overlap among the categories. For example, an OKST win at KU can be a road W, a conf W, and a Q1 W all-in-one. Thus, it would make sense if the two are flip-flopped to account for the lack of independence. (NOTE: After the fact, I altered the regression formula so that "Conf Ws" gets the coefficient in the "Q1+Q2 W" column and that "Q1+Q2 Ws" gets the coefficient in the "Conf Ws" column.)
  • Finally, the 2/10 regression is the only one whose coefficient is negative for Q1+Q2. This is most likely the subtle proof that 2/10's data wasn't used in formulating the mock bracket. Not only that, the coefficients for conference wins and Q3+Q4 Losses are well out of the ranges of the other three days.
  • For those unfamiliar with statistics and multiple regression, RSS and R^2 are descriptive values. RSS is the Residual Sum of Squares, which takes each individual error residual, multiplies it by itself, then adds all of them together. Thus, the smaller the RSS value, the better. R^2 is the coefficient of determination, which is the proportion of the desired variable that is explained by the input variables. This value ranges from 0 (no explanation) to 1 (perfect explanation), so the closer to 1, the better. From these descriptive values, the regression coefficients for Feb 9 are the best estimators for the true seed line. In the application section, I will be working with all results from Feb 7, Feb 8, and Feb 9.
The final step in the results is to produce a blue-print by using our multiple regression equations to estimate the true seed curve of the 2018 tournament. I used team attribute data for the dates 3-6, 3-7, 3-8, and 3-9. I seriously doubt the Committee started deliberations on March 10, which is the only way they could use 3-9 data. 3-6 also seems way too early as very few (if any if memory serves me correctly) tournament eligible teams had played a game in their conference tournaments on this date. Finally, I took only the Top 75 RPI teams, and teams had to be in the Top 75 for all days to be included (this eliminated four teams).

In the chart below, the yellow/white columns represent calculated value of the true seed line applying the regression equations for each date - Feb 7, Feb 8 and Feb 9 -- to each of the corresponding four current dates of data -- 3-6, 3-7, 3-8, and 3-9. The blue/white columns represent the rank-ordered adjustment to the regression-calculated values.

In my honest opinion, the best two choices for a blue-print would be "2-9 applied to 3-7" and "2-9 applied to 3-8". In the image below, I have re-ordered these two blueprints according to the adjusted true seed value.

Application and Interpretations

In layman's terms, this blue-print is fuzzy. If I had to describe it in technical terms, I would say it is RPI-based with leanings towards Q1, which is pretty much what my methodology should have produced. What is this blue-print telling us?
  1. Foundational: This model was built using only the Top 16 seeds in February in hopes that it would apply to the whole field in March. With the exception of one glaring misfire (noted below), the model looks pretty accurate for 15 of the Top 16 on Selection Sunday, give or take one seed line (Examples include UNC/XAV, AUB/TENN/PUR, UK/WVU/TXTC/GONZ).
  2. Observational: Mis-seeds will happen, not just with this model but likely with the actual bracket too. I seriously doubt WICH is in contention for a 3-seed, and MIST is not in purgatory with a 5-seed. Likewise, this model assumes LOU and MTSU are safely in at the 9- to 10-seed range, and I doubt this result too. The real question is the magnitude of the mis-seeds. In 2014, seven teams were under-seeded by at least three lines, and six of these won at least one game (and the seventh was a loser to one of these six). If the Quad System produces this quantity of large-magnitude mis-seeds, this is where we will find our upsets.
  3. Empirical: Though I tried to construct a system to replicate the selection committee's process, maybe the bias is in the data itself. The results seem to highly favor the SEC, which also happens to be the conference with the most Q1+Q2 wins. If you have followed my blog, you will know that I thought the SEC was the most balanced conference, not necessarily the best, meaning each night any given team can win. With plenty of teams ranked high enough in the RPI, every night is a chance at Q1 or Q2 win. Keep in mind, the last two years, the process of weighting Conference affiliation resulted in upsets of members of favored conferences. If the Quad System shows this partiality, look for early-round upsets of teams from the favored conference. The next most favored conference seems to be the B12, followed by the ACC.
Well, the only thing left to do now is wait and see. As always, thanks for reading my work, and be on the look-out for Final Edition of the 2018 QC Analysis in the coming days.

4 comments:

  1. Very interesting analysis. Really shows how a group of committee members will each process quantitative evidence in their own way and negotiate a common “qualitative” solution. Question is, does that human processing step improve the quality of the seeding, or just add noise?

    ReplyDelete
    Replies
    1. Most likely, it is changing one set of noise into another. For example, RPI inflates quality of wins whereas advanced metrics do not. In RPI, a 26-4 team in the Sun Belt can be graded better than a 22-8 team in a power conference because RPI only sees 26W/30G not knowing than 16 of those 26 are not Ws over power conference quality teams. Changing from pure RPI rankings to Quad System breakdown is like pouring a 1-gallon jug of milk into four 1L bottles.

      Delete
  2. Thank you so much for this monumental amount of work. Basketball nerds unite!

    ReplyDelete
    Replies
    1. It is something I enjoy doing, and sharing it with others who have the same passion makes it that much more fun. As always, thanks for reading and I'm glad to hear from a fellow CBB nerd.

      Delete