Essays on education, debate, and math instruction; neat math problems; and whatever else I get around to.

Sunday, January 4, 2015

100th Post!!

My hundredth post! When I started 5 and a half years ago, I never imagined I would get here. It turns out that I have written a post, on average, every 20 days. I thought for this special occasion, I would go back to one of my original reasons for starting this blog: my dislike for the traditional methods of power matching in debate tournaments. My opinion was -- and still is -- that power matching doesn't give each debate team a fair experience at the tournament. Many debate teams make it to elimination rounds without facing good opponents. My solution was to create a strength-of-schedule pairing that power-matched but also attempted to even out schedule strength. That was five years ago. Since that time, I have come to believe that it's better to abandon power matching rather than try to improve it. The alternative is random prelims. Random prelims works for the N.S.D.A. Nationals (formerly the N.F.L.) and can even be improved with geographic mixing.

I decided to test it out with an experiment. How do different pairing methods compare at producing the actual ranking of the teams? Obviously, one only has actual rankings in an experiment. To start, I generated 200 random teams, giving each one a true strength. The true strengths were in a Normal distribution with an average of 27 points and standard deviation of 1 point. This is realistic, based on previous empirical analysis I've done. Next, I paired the teams against each other using one of four pairing methods. Each team's performance could deviate from its true strength by a random number that followed a Normal distribution, average of 0, standard deviation of 1 point. In other words, most teams would perform within +/- 1 point of their true strength about 68% of the time. This may seem like a lot but is realistic from the same empirical analysis. This deviation in performance accounts for off-rounds, surprise strategies, and judging variability (e.g., point trolls), too.

Based on the two team's strengths and factoring in their random deviations from their true strengths, I decreed a winner. Then I set up the next round using the stated pairing method. After all six rounds, I calculated each team's win/loss record, total speaker points, and median speaker points (more on this in a moment). I ran the same tournament four times, one for each pairing method: (1) simple random, (2) random within win/loss bracket, (3) high-high power matched, and (4) high-low power matched. I used the same round 1 pairing for all four methods to give them all even starting conditions. For each one of the methods, I used the results at the end to calculate a traditional ranking (win/loss, then total speaker points, then median speaker points) from 1 to 200 and also a "median points" ranking (first, median speaker points, then total speaker points, then win/loss record) from 1 to 200 -- and compared them to the true rankings. The results kind of blew my mind and switched my perspective around.

A few caveats for the nit-pickers: yes, I ignored low-point wins. Those aren't too frequent and, as you will see, including them would only make my case stronger. And yes, I ignored side constraints and I pretended like the teams were from 200 different schools. Again, using those constraints would only strengthen my case. Without further ado, here are the results:

This is the r-squared, the coefficient of determination you might have learned about in Intro to Stats, of the traditional and median points rankings to the actual ranking for each pairing method I tested out. A high r-squared is good; it means the listed ranking closely corresponds to the truth.

A couple of things to draw your attention to: (a) the median points rankings do not change much for any pairing method; (b) the median points rankings are higher than or almost equal to the traditional rankings for every pairing method; and (c) the traditional rankings are closest to true rankings for the high-low pairing, then the random within brackets pairing, then the simple random pairing, and lastly the high-high pairings.

It is actually worth looking at that last one:

Notice the clear pattern? In a high-high pairing, a ton of decent teams get screwed by getting several very hard opponents and therefore have terrible records -- these are the outliers that are very low on the y-axis (indicating true strength) but on the right side of the x-axis (indicating very poor records). Notice the one team in the far bottom right: pity the poor team that had true ranking of 19 but ended up with a 1-5 record and ranked 183rd by the traditional tiebreakers. Enragingly, a lot of weak teams somehow squeak by to great records. Notice the one team in the upper left: 148th in truth, but given several easy opponents, ending up 5-1 and ranked 21st by the traditional tiebreakers. Visually, you can see how unjust the whole high-high pairing is when coupled with using win/loss record as the primary criterion for ranking, as it is in the traditional method. The median points ranking does not suffer from the same problem; even the good team that gets several tough opponents and ends up 1-5 is not penalized in the rankings, as long as that team continued to earn high points in each one of its rounds.

For comparison, here is what the best correlation looked like:

To be sure, the correlation is far from perfect. But that's just about variability in the teams' in-round performances compared to their true strengths (that random deviation score I added). In other words, what you are looking at is just the off-days, surprises, and crappy judging that is unavoidable. It isn't really possible to do better than about 0.82 or 0.83 -- that's why the median rankings have about the same correlation, no matter what the pairing method.

On the other hand, the traditional rankings are very sensitive to the pairing method. Why? A team's record depends on both its true strength and the opponents it faces! In the high-high pairing method, many teams get unfairly hard or unfairly easy opponents. The method drives down the correlation between true strength and record by screwing some and blessing others. However, in the high-low pairing method, the assignment of opponents pushes up the correlation between true strength and record -- better teams face weaker opponents, so get a few more easy wins.

It can be a bit hard to interpret what these correlations means, so I also calculated the mean absolute deviations for each pairing method and ranking. For each team, I took its traditional rank and its true rank, found the difference, and took the absolute value. Then I averaged those to produce the mean absolute deviation (MAD). I also did the same thing for the median points rankings.

For example, for the random within bracket pairing method, the median points ranking had a MAD of 18.66. That means, on average, the median ranking was off by 18.66 places from the truth. The lower the MAD, the better.

The exact same patterns appear as in the correlations table. In general, the best we can hope for is to be within about 20 places of the truth. Given that debaters have off-rounds, and that our sample size is only six rounds, this isn't terrible: 20/200 is 10%. The true ranking is probably +/- decile from the median points ranking. Notice that the traditional rankings are sensitive to the pairing method in the exact same pattern. If one uses the traditional criteria for ranking, then the high-low pairing is best. So, was I wrong five years ago?

In both of the two tables I've given so far, the high-low pairing method plus traditional ranking was marginally superior to any pairing method plus median points ranking. But the problem is that it is not equally important to rank any team correctly. It is more important to get the top teams right. Enter the weighted rule. As I did for the MAD, I took each team's traditional rank, subtracted its true rank, and took the absolute value. But before I averaged, I divided by the team's true rank. Thus, getting a good team's results wrong by a lot was worth big negative points; getting a weak team's results by a lot was worth a few negative points. The results:

The pattern is almost the same as before, except that... high-low pairings and traditional rankings is worse (higher score) than random pairings with median rankings. This means that the high-low plus traditional combination made more mistakes ranking the best teams than the random plus median combination.

What are the take-aways?

1. If you are doing high-high power matching, STOP IT RIGHT NOW. Even one round of high-high power matching is harmful. You are screwing many teams over.

2. Consider using the median points ranking instead of the traditional ranking.

On T.R.P.C., it means putting the "drop two high - drop two low speaker points" as the first criterion for ranking. (For a three- or four-round tournament, the "drop high - drop low" option is equivalent to the median. For a five- or six-round tournament, the double-drop option is equivalent to the median. For a seven- or eight-round tournament, the triple-drop option is equivalent to the median.) You can make win/loss record the second or third criterion.

All the data from my experiment show that the median ranking is simply more accurate, no matter how you pair the tournament. The win/loss record is too variable.

3. Consider not doing high-low power matching either.

It is enormously time intensive to run a power-matched tournament. In some cases, power-matching adds 2-3 hours for a six-round tournament: 30-45 minutes after rounds 2, 3, 4, and 5 -- although one or two or those lag times might occur during a food break that had to happen anyway. But 2-3 hours might be used in other ways... say, to squeeze in an extra round. Another round would, in fact, yield more data and would improve the accuracy of the results far more than stopping frequently to power match. And, as the experiment data show, high-low pairings do not improve the accuracy any more than simply switching over to median rankings. (Furthermore, I suspect that high-low pairings plus traditional rankings' accuracy peaks at around five to six preliminary rounds; my suspicion is that for longer tournaments, the accuracy starts to go down again because the brackets start to get too small.)

High-low power matching does have something to argue for it: teams get to see more opponents of similar ability levels (to themselves). But there's a counterargument: random pairings enable teams to see a wide cross-section of opponents' skill levels, and better gauge where they fall on the spectrum. Getting your butt kicked can inspire striving, and besides, your tournament should have a novice and JV division for teams that are afraid of the best opponents in the top division.

However, if you feel like high-low power matching is something you want to preserve but you do want to speed up your tournament, then go to lag-powering. For example, round 3 would be power matched, but only off of the results of round 1. You can slip round 3 pairings under the doors while round 2 is wrapping up, cutting your turnaround time drastically. You probably won't push down your accuracy too much (see how well random within brackets plus traditional rankings compares) if you lag-power -- but especially not if you use median points rankings.

4. Do everything you can to help your judges give speaker points more consistently.

If speaker points are more accurate than records, that means we ought to put more weight on speaker points AND strive to make them seem less arbitrary. Brief training sessions at the beginning of the tournament for less experienced judges, clearly delineated rubrics for speaker points, or scoring grids for various attributes of speaking all help!

A changed perspective

I used to think that power matching started as the best way people had, when tabbing on notecards, to improve the accuracy of tournament results. Maybe that is why it got started, but as we can see, all it does is bring accuracy to parity with median rankings. Is there any other reason tabbers might have started to use power matching?

Then it dawned on me: power matching reduces the likelihood that two teams have met before will be randomly drawn against each other in later rounds. Team A might meet team B in round 1, win, then lose round 2. Team B might do the opposite and win round 2, giving the two opponents a possibility of being randomly drawn against each other for round 3 -- but overall, power matching makes it less likely than simply randomly assigning everyone in one big pool. When you're tabbing on notecards, it speeds things up considerably if this is a rare occurrence. Maybe power matching began, not with HH or HL but with the random within brackets pairing. In other words, how we pair double elimination tournaments (undefeateds and down-ones in two separate brackets, randomly assigned in each) might have gotten translated to all preliminary rounds at all tournaments.

Just a suspicion. It does make sense: high-low (or the awful high-high) brackets are hard to do on notecards, but random within brackets is easy to do.