The NCAA Men’s Basketball Tournament is a loser machine. Sixty-four teams enter the tournament after a long season of hard work, just for 63 to leave with another loss on their schedule.
A single-elimination tournament leaves room for a lot of unpredictability, with any team capable of running red-hot or run ice-cold at any time. Volatility is the name of the game (just ask Virginia).
That is why it is important to identify which strengths show through the most often in March, and which weaknesses are most easily exploited. This is the key to projecting the victor of what may be the most unpredictable tournament in the world.
Armed with the regular season per-game statistics for every college basketball team to make any of the last ten NCAA tournaments and their seed, as well as their performance in said tournaments, I got to work. You can go ahead and skip past this next paragraph if you aren’t concerned with my pencil-pusher mumbo-jumbo, but for the TL;DR crowd, this model should effectively predict March results.
First, I assigned each team to participate in a given game as “Team A” or “Team B”. Team A was not always the winner or the higher-seeded team, but rather a randomly assigned team of one of the two participants. Then, I marked whether Team A or Team B won a given game. In order to then use this data to make my predictions, I fit a few models predicting the likelihood that Team A would win the game, using a number of varying methods, including logistic regression, classification trees, and random forests.
Unsurprisingly, each model varied greatly in its predictions. Due to this, I chose to select the model that could most effectively predict the results of randomly selected tournament games that were held out of the model fitting process. My final algorithm was derived from a logistic model, which accurately predicted 135 of the 201 holdout games from the last ten tournaments.
While an accuracy rate of 67% may not sound incredibly promising, a more effective measure of model performance when predicting categorical data like this is cross-entropy loss, which takes into account the confidence with which the model predicts a victor. This metric is on a scale from 0 to 1, and with a value below 0.5, I felt confident taking this model into this season’s tournament and seeing what kinds of predictions it makes.
Now here’s where the real fun begins.
Thirty-two games that feature everything from David and Goliath to Blazers and Cougars. The dreaded seed #3 vs. seed #14 and the “c’mon that’s not a real upset” #7 vs. #10 games will be sure to give most bracket makers trouble, but the points you pick up in this round can be key to winning your pool. Here are some things to look for when making your picks for this round:
Wow, two dreaded #3 vs. #14 games are the most confident predictions this model has for the entire round of 64? Yes. Yale and Montana State are good teams that have a puncher’s chance at best—Purdue and Texas Tech are in a completely different stratosphere though.
Here is one of those #3 vs. #14 plays you were looking for. At this level of confidence, we think Wisconsin should win, but it’s worth poking around at the underdog and seeing what you like. Colgate lost its tournament game last year as a #14-seed, and is facing a Wisconsin team that turns the ball over less frequently than any other in the tournament. It will be no easy task, but the model is at least curious: are the Raiders the next #14-seed to advance to the Round of 32?
Another potential upset that could hit if the lower-seeded team can play its best basketball. Mind you, this should not necessarily be expected to hit, but could account for some early points that your pool-mates aren’t getting from a team that probably won’t make deep run in the tournament anyway.
No, you will not pick the next UMBC. If you’re right, you get one extra point over all your friends. If you’re wrong, you could be eliminating a potential national champion before it even breaks a sweat. Pick the #1 and move on confidently.
Sorry, sorry. I promise the fun stuff is at the bottom of this section, with some sneaky upset picks and games where you could see some unexpectedly close final scores. But these are not those games. Pick the two seeds, even if it’s just because you’ll be losing the same amount of points as your friend who goes chalk every year and always finishes top-3 in the pool.
When taking a cursory glance at the bracket, this was a game that I agonized over. How much magic does Sister Jean have left? Will E.J. Liddell and Ohio State have a new level of focus after last year’s embarrassing performance? In spite of the first round exit as a #2-seed last season, my model is sure Ohio State will head to the Round of 32. I appreciate its conviction.
Here’s our first upset, and it’s a good one. Iowa State dominated non-conference play and LSU could be in shambles after firing Will Wade just days ago, and the model doesn’t even know that. This should confirm any suspicions you had about this game.
Now there’s a First Four team I can get behind. The Scarlet Knights have more than enough talent to take down the Crimson Tide, and are more than capable of taking down highly regarded opponents this season. If you want to have the 2021 UCLA on your bracket, it should be Rutgers.
Here is another predicted upset. The fact that the model does not take recent play into account means that it’s all the more meaningful that it picks Virginia Tech over Texas, as this isn’t an overreaction to one hot conference tournament. This is the model correctly assessing what kinds of teams fare best in a winner-takes-all atmosphere.
These might be too close to call for some, so let this be your difference maker. Pick up two free points, on me. Or not. Either way, we’re not looking at contenders here. You probably shouldn’t have any of these four teams in your Sweet 16.
I promised Blazers and Cougars, and here I am delivering. The model gives a slight edge to the Blazers, so this could be a worthwhile upset pick, but the Cougars made a deep run last year, so don’t count them out doing so again this year.