While I was making this post, I got to thinking: we should be able to measure which is the best statistical tiebreaker. All we need are situations in which teams (preferably with the same record, but not necessarily) play each other more than once: for some fraction of these, the results for head-to-head, PPG, PPB and whatever other tiebreakers will differ. We can then easily see which is the strongest correlate to winning a given match (so, like, if PPG differential predicts the winner 87% of the time but head-to-head only 62%, we can quantifiably say that PPG differential is a better tiebreaker.)As such matchups happen not infrequently at tournaments, we should be able to assemble some data fairly quickly if some of you are willing to look over old stats. What do people think of this idea? If anyone wants to, I welcome them to find a tournament with such a matchup (a team playing another more than once) and see how often each tiebreaker correctly predicts the results of the actually played match.

MaS

PS: The first thing I came across was the finals of this year's IO, but this provides no useful data since the head-to-head, PPG, and PPB tiebreakers all predict the same result (unless there's a tiebreaker that's broken that I don't know about.)

I was actually thinking of doing this a couple of months ago, but I never got around to it. I think with my database oftournaments I was using to calculate my individual computer rankings, I should be able to write a script to do a bunch of these comparisons. I might try working on this over the weekend.

Let's say you had two teams play common opponents. One of them narrowly wins all of its matches, going 5-0 with an average margin of victory of 50 PPG. The other one blows out all of its opponents but goes on a negfest against that other team, going 4-1 with an average margin of victory of 250 PPG. If I had to predict which team would win a rematch, I would pick the 4-1 team. However, if I had to select one team to go into the Championship Bracket, I would pick the 5-0 team. In other words, the best team and the team most deserving of advancement are not necessarily the same team.

My example is a bit extreme, but there have been plenty of cases similar to it. At IHSA Sectionals, four teams play a Round Robin with one team advancing. If the team generally considered the best loses to the team generally considered the second best, then that generally decides who advances even if it is a very narrow defeat as long as those two teams win their other matches by whatever scores they rack up.

The fact that you are talking about a tiebreaker somewhat alleviates this, but there is still an issue of whether the team with a higher PPG has earned the right to advance because answering more questions is an accomplishment as opposed to finding a more complex metric that may better predict success at the next level.

David Reinstein, IHSSBCA Chair (2004-2014)New Trier Coach (1994-2011); Head Writer and Editor for Scobol Solo and Masonics (Illinois); Writer for NAQT; co-TD for New Trier Scobol Solo and New Trier Varsity; PACE Member; former writer for CMST; former editor for IHSA

One problem I can foresee is that there is a lot of interaction between the different tiebreakers. For instance, teams with high bonus conversions generally have high points per game.

I do not have the base of tournaments necessary to run this, but it strikes me that a more prudent approach might be to record each game as a six-dimensional vector where:1 means winner of game was higher in stat0 means winner of game was equal or lower in stat

We then put this into a 2x2x2x2x2x2 matrix, where each entry is the number of games with that particular vector.

For each cell, moving the equivalent of down/right would be the equivalent of changing a 1 to a 0. We can then get a ranking of what's most important by comparing that cell with all cells "above" it. So if there were 45 110110 cells but only 24 100111 cells and only 12 101110, then for cell 100110, "flipping statistic 6" is more likely to explain the winner than "flipping statistic 2", which is more likely to explain the winner than "flipping statistic 3". For each cell, then, we would have a "ranking" of which 0s flipping to 1s are most likely to explain the winner, given that the 1s stay the same.

Among the 64 cells, we have:1 ranking of 0 stats6 rankings of 1 stat only15 different rankings of 2 stats20 different rankings of 3 stats15 different rankings of 4 stats6 different rankings of 5 stats1 ranking of all 6 stats

We can then use any method we like to interpret these rankings ("play 30 games" pitting statistic A vs statistic B in the 16 cells they are ranked in, one ranked ahead in more cells wins the game, best W-L-T record wins, seems to me to be the best strategy; one could also use the 6-5-4-3-2-1 system with 438 total points to determine order, look for interesting trends in the data, etc.)

Shcool wrote:Philosophically, I wonder if this is the best way to go....The fact that you are talking about a tiebreaker somewhat alleviates this...

Yeah, I think you're misunderstanding me. We're talking about situations in which we need a tiebreaker. I'm just proposing a measurement to determine which popular tiebreaker is actually the most valid (best correlate to winning.) Obviously the team with the best record should win regardless of whatever tiebreakers another team may hold against them.

Dwight: I think you're misunderstanding the nature of what I'm proposing to do here. We don't want to compare W-L because that isn't a tiebreaker; only W-L against the same team. We can easily determine how predictive, for example, PPG differential is in the outcome of any game, but that isn't very useful because we can't make the same comparison to head-to-head unless in the case of a repeat matchup, which means we can't isolate the other factors (so no direct comparison can be made.) Only in the case of a repeat matchup can we isolate all the factors. Also, the fact that the tie-breakers are correlated isn't important; the proposed measurement measures only the differences between them.

It's an accepted principle in sports analysis (and yes, I will break my virulent opposition to sports analogies here because the sabermetricians are much more advanced with their data and mathematical thought than we are) that a loss or victory by a small margin means nothing, but long-term trends in scoring mean everything. Who wins a 280-245 quizbowl game comes down to the luck of the draw in terms of whether that third arts tossup was better for one team's opera specialist or the other team's architecture player; who consistently scores 35 PPG more over the course of the tournament reliably indicates more knowledge or at least more ability to play quizbowl.

I don't like any appeals to "you're making the head-to-head result meaningless" because:1) the head-to-head result IS meaningless, essentially, when we're talking about a tie situation--the teams must be very close in ability if they are tied, especially if the one who won the head-to-head game then went and lost to someone who the opponent beat, which mathematically must happen in the "two-way tie at the top of the standings" scenario. If the head-to-head result was a 300-point blowout and then the winning team went and lost to someone who the losing team also beat by 300, then something is wrong with the questions. In the more usual scenario, if the head-to-head result is a very close game, then it has very little value in determining who the better team would be in a longer series of games.2) the head-to-head result is taken into account to create the tie; without it, someone is 1 game ahead. That game has all the value in the world when we're talking about the difference between "you are 1 game ahead and you have won the tournament/earned the advantage in the final" and "well, I guess we have a tie now, let's find some way to break it." That's value enough for any one game without artificially adding any more.

The two long-term data trends that emerge from quizbowl games are PPG and PPB. I am in favor of using PPG (because it incorporates the entirety of quizbowl activity) when the teams have played common opponents. When teams haven't played common opponents, I think the only fair thing to do is to use bonus conversion, since that is much less affected by the opponents one plays than PPG is. Ideally PPB is context-neutral, but depending on how variation in packets and opponents line up, it might not be.

Well, look; if you guys believe these sports analogies, they should be reflected in the measurement I'm proposing to make, so you have nothing to lose and everything to gain. More importantly, if you believe in reason, you can't advocate uncritically using one tiebreaker based on those arguments; rather, you are compelled to acknowledge that relying on a priori arguments when a posteriori evidence is available is the very pinnacle of unreasonable, unscientific thinking.The case remains this: all else equal, long-term trends like PPG/PPB have lower fluctuations due to (massively) larger sample size, but are less predictive per datum, while the outcomes of previous games between tied teams are more predictive per datum, but potentially contain very large fluctuations. Therefore, until we can quantify things (which is exactly what I'm proposing to do,) doubt must remain regarding which is the better tie-breaker.In short, both of you are compelled to advocate this comparison as the justification of your beliefs or abandon reason (and, concomitantly, your arguments), in which case you must either form a new argument or not argue against this measurement. So far, we have one datum indicating unit correlation to winning (and to one another) for head-to-head, PPG, and PPB tiebreakers. I know we can do better than that.

Matt: Are you talking about 35 PPG over the course of an entire tournament that is true round-robin, or one that has several divisions of teams? If it is the former kind of tournament, then state so explicitly; if it is the latter one, then your argument holds little weight since common opponents must be factored into any metric attempting to break a tie between 2 teams with identical records. A 35 PPG differential means little, if anything, if the only common opponent between Team A and Team B is the other one. Team A may have played some of their 6 Divisional games against Middle School teams while Team B played games against Dorman B, Charter C, RM D, among others? 35 PPG more for Team A means very little if they finish with the same record as Team B, but lost to them head-to-head. If I am missing some piece of your argument, please clarify your post.

elrountree wrote:Matt: Are you talking about 35 PPG over the course of an entire tournament that is true round-robin, or one that has several divisions of teams? If it is the former kind of tournament, then state so explicitly; if it is the latter one, then your argument holds little weight since common opponents must be factored into any metric attempting to break a tie between 2 teams with identical records. A 35 PPG differential means little, if anything, if the only common opponent between Team A and Team B is the other one. Team A may have played some of their 6 Divisional games against Middle School teams while Team B played games against Dorman B, Charter C, RM D, among others? 35 PPG more for Team A means very little if they finish with the same record as Team B, but lost to them head-to-head. If I am missing some piece of your argument, please clarify your post.

I think in general people don't support PPG comparisons unless they're made against teams with all common opponents; if you have to compare across brackets, you always prefer PPB to PPG. The only rare circumstance in which this fails is if Team A wins all its games 600-0, getting twenty tossups per game and 20PPB, and team B wins all its games 80-0, getting two tossups per game and 30PPB--or some more realistic corner case, I suppose. But this relies on absolutely atrocious bracket balance. Getting at least decent bracket balance means that the teams that only get two tossups per game--the teams for which PPB means little due to a relatively small sample--will also lose a whole lot.

To continue with sports analogies, there is overwhelming evidence that the Patriots were the best team in the NFL last year. However, that does not mean that they should be considered the NFL Champions. Titles and playoff berths go to teams that earn them through criteria decided ahead of time, not to teams that prove themselves the greatest statistically.

If somebody knowledgeable with statistics goes through a large amount of data, they could produce a complex formula to determine which teams are better than which other teams. They will not find that PPG is always the best predictor--they will find that PPG correlates to a certain extent with being better, PPB correlates to a certain extent, team record correlates, etc. There very well could be correlations with the number of negs and, in NAQT tournaments, with the number of powers. If somebody wants to, as best as possible, determine which team is better, then they will need a formula that takes all available correlating statistics into account. Is your goal to use such a formula to break ties?

David Reinstein, IHSSBCA Chair (2004-2014)New Trier Coach (1994-2011); Head Writer and Editor for Scobol Solo and Masonics (Illinois); Writer for NAQT; co-TD for New Trier Scobol Solo and New Trier Varsity; PACE Member; former writer for CMST; former editor for IHSA

To continue with sports analogies, there is overwhelming evidence that the Patriots were the best team in the NFL last year. However, that does not mean that they should be considered the NFL Champions. Titles and playoff berths go to teams that earn them through criteria decided ahead of time, not to teams that prove themselves the greatest statistically.

This part of your post makes no sense to me. The whole point of this thread is to decide "ahead of time" the criteria to use to break ties in tournaments. The goal is not to retroactively change the outcome of tournaments, as calling the Patriots the NFL Champions would be, but to find a fair way to do it in the future.

Captain Scipio wrote:Dwight: I think you're misunderstanding the nature of what I'm proposing to do here. We don't want to compare W-L because that isn't a tiebreaker; only W-L against the same team. We can easily determine how predictive, for example, PPG differential is in the outcome of any game, but that isn't very useful because we can't make the same comparison to head-to-head unless in the case of a repeat matchup, which means we can't isolate the other factors (so no direct comparison can be made.) Only in the case of a repeat matchup can we isolate all the factors. Also, the fact that the tie-breakers are correlated isn't important; the proposed measurement measures only the differences between them.

I think that it is actually useful to determine how predictive any given stat is in the outcome of any given game (in order to better quantify "upsets" for instance). However, seeing what you are actually trying to do now, this seems to be a project reserved for a later time.

Are you looking for repeat matchups, or just repeat matchups between teams of the same record? If the latter, here are some data points. Statistics are calculated at the instantaneous point in time that the match began, not data from the entire tournament. If you want to know entire tournament data you can calculate that yourself but it's mostly similar.

Not only am I not wrong, but you've said almost nothing even germane to what I'm saying. I'm begging you and everyone to stop arguing with sports analogies (or really any analogies whatsoever): you're only confusing yourselves. Please look at the actual situation at hand.Nobody is talking about supplanting winning and losing games to determine tournament winners. That has nothing to do with anything. This thread is about measuring which is the best tiebreaker*. Of course, it would be easy to regress any number of formulae onto winning percentage as you say, but that's not of interest here.So, again, what we want to do here is to practically compare commonly used (or usable) tiebreakers. I've devised a method that seems to isolate other factors and allows us to draw an immediate conclusion regarding which is the best (most predictive) among the three common tiebreakers (PPG differential over common opponents, PPB differential, head-to-head.) If you or anyone else has an easily computable tiebreaker formula that you'd like to see in use, I invite you to publish it here: any such should be comparable by the method I've outlined. Of course, given enough data, we could use regression to determine a statistically best tiebreaker, but let's worry about that later.

MaS

*Maybe people are confused on this point. A tiebreaker is used to choose the best among several teams with equal records to determine, for example, seeding or sometimes other things. The impetus for this thread was a dispute in a previous thread regarding a tiebreaker to award a tournament championship, so it a positive fact that things like that are happening.

Mike, I take it that you're only looking at the *'d ones, so that's what I'll keep looking for. Unfortunately, because a lot of tournaments don't have round numbers attached, and because you have to manually add and subtract things (SQBS won't necessarily give you a tournament snapshot after, e.g., Round 9 of a 12 round tournament), I'm not sure there's an easier way to do this kind of thing. That said, if people want to put in the time and scour stats pages, this is what you should look for:

A tournament small enough to run a full round robin (usually <15 teams). Anything more and you get bracketed round robins, which skews the data. In these tournaments, the first or last game in a playoff bracket, or a finals game, is guaranteed to be between teams that have faced the exact same opponents (except for themselves). It's just then a matter of manually sorting through that subset of games to find ones between teams of the same record.

If you prefer the end-of-tournament overall data to the instantaneous-point-in-time data, then it's easier to just read numbers off the page; I think that the instantaneous-point-in-time data is more correct to use, but it's also more time-consuming to get.

Hi Dwight,Well, thanks for your effort, then! The point-in-time data are indeed what we want here, thought even the whole-tournament data have some validity. The unstarred data will be included later, but I consider them to be less predictive (since there are more non-isolated factors; the starred data isolate everything possible.)

With 19 games (all the *'d data): total points difference, 0.6 > PPB difference, 0.55 > H-H, 0.53 > H-H point difference, 0.5 = PPG difference, 0.5. I'll now include the non-starred data. If anyone else can get me more, I've found a method to enter them pretty quickly. I may just post the spreadsheet on Google Docs to let people enter them by themselves.

So, at this point, I'll conclude three things:1. Point difference is the best tiebreaker in these data by a fair margin.2. No normal tiebreaker significantly outperforms any other; they're all in the 50-70% range at predicting the right winner of an actual game.3. Relatedly, no standard tiebreaker is very good, so meaningful ties should absolutely be played off if a tournament wants to find a fair winner.I'd further suggest that, as I have little faith point 3 will carry the weight it ought, that we ought to take-up Coach Reinstein's suggestion and consider a better, composite tiebreaker. I'm open to suggestions in this area and will gladly test any. If we can find enough data, I will try a regression study.

I'll add the caveat that I'm currently confused about one thing in these data: how can a team hold points per game but not point difference if they've played the same number of games? Perhaps I've misunderstood what Dwight meant by point differential; I took that to mean difference in total points scored. Dwight, please let me know what's up; I can update this easily to reflect whatever changes.

MaS

PS: Perhaps point differential means, like, the difference between the teams' mean point difference per match. That might explain the discrepancy.

Captain Scipio wrote:I'll add the caveat that I'm currently confused about one thing in these data: how can a team hold points per game but not point difference if they've played the same number of games? Perhaps I've misunderstood what Dwight meant by point differential; I took that to mean difference in total points scored. Dwight, please let me know what's up; I can update this easily to reflect whatever changes.

If Team A has 350 PPG and 275 PPGA while Team B has 300 PPG and 200 PPGA, then Team A has higher PPG while Team B has higher point differential. Usually that means that Team B is better at answering tossups (hence less chance for the opponent to score) but worse at bonuses (hence lower PPG).

Also, while the data crunching is neat and all, I think the margin of error is way too great for what we have right now. But then again, I'm just eyeballing these numbers.

hwhite wrote:If Team A has 350 PPG and 275 PPGA while Team B has 300 PPG and 200 PPGA, then Team A has higher PPG while Team B has higher point differential. Usually that means that Team B is better at answering tossups (hence less chance for the opponent to score) but worse at bonuses (hence lower PPG).

This is exactly what I meant, and exactly what I think that statistic means (which is why it would be useful as a tiebreaker).

Mike, since I've given the exact scores for something like 37 of those games, would it be possible to run a regression involving not just who wins, but by how much (e.g. if team A holds PPG tiebreaker, but team B holds head-to-head, and team A beats team B 230-180, then it would be +50 for the PPG tiebreaker and -50 for the h2h tiebreaker).

Harry, can you elaborate about the margin of error? I think Mike is saying exactly that when he claims that no statistic significantly outperforms any other, though he hasn't quantified that significance/error.

I'll see if I can scrounge up some more data for all-else-equal matches.

cvdwightw wrote:Harry, can you elaborate about the margin of error? I think Mike is saying exactly that when he claims that no statistic significantly outperforms any other, though he hasn't quantified that significance/error.

(N.b. I don't claim to be a statistician, nor have I taken a statistics course, so I could be wrong)

If you remember the presidential election polls, it works in the same way. Long story short, if you want 80% confidence (which is rather low, but then again, tiebreaking is not perfect to begin with), then with the current sample size of 38, you have a 10% margin of error, which means that no tiebreaker is statistically significantly better than the other (H-H could be 10% higher than reported, and PPG difference could be 10% lower than reported). If you increase your sample size to 100 games, you'll be down to a 6% margin of error, which may start to allow you to confidently (statistically-wise) rule out options.

Regarding what the margin of error on these numbers is, I'm pretty sure you can just use a binomialdistribution. In that case, the standard deviation on the number of successes is just sqrt(n*p*(1-p)). So, for example, Mike said point difference had a success rate of 65.79 percent, out of 38 games. Inother words, there were 25 successes in 38 games. The error on that is sqrt(38*0.6579*(1-0.6579))= 2.92. So, we have (25 +/- 2.92)/38 = 0.6579 +/- 0.0768. The errors on the other numbers willbe similar. So, I agree with Harry that the errors on these numbers are too big to say definitively which tiebreaker is the best. We probably need to lower the error from the current 7.7 percent to about 3 percent or less to say with much confidence which tiebreaker is the best. Since error scales like 1/sqrt(n), this means we might need 6 times more data than we currently have. Whether that's feasible or not I don't know.

Schweizerkas wrote:Since error scales like 1/sqrt(n), this means we might need 6 times more data than we currently have. Whether that's feasible or not I don't know.

Considering that this is just from one small, isolated circuit that doesn't run a lot of tournaments (as compared to, say, the Midwest), we should be able to find (hopefully) a near-equivalent amount of data from the Midwest, Northeast, Mid-Atlantic, and Southeast circuits. Plus, there's an entire high school circuit, if we can find small enough tournaments that run double RR or single RR + playoff brackets. I'd say it's feasible to get a sample size of ~200-250 games if we work at it and include anything between teams of the same record (not ideal, but hey, it's the best we can do if we're looking at 250 games).

I don't think we can work with the assumption that data trends in past quizbowl matches necessarily predict game results of future matches. I'm unconvinced that the body of quizbowl match results as a whole to this point represent the expected outcomes of matches to come, and I certainly reject outright the idea that a data set that mixes non-common and common-opponent schedules, is heavily skewed towards sketchily edited west coast sets, TRASH regionals, and IS set tournaments, and has a whole bevy other other problems has any useful extrapolative value whatsoever. These data stem from activities whose commonality barely extends past the use of questions and buzzers. Who says that stirring up all of these (or any other concoction) yields something that will be predictive for future quizbowl as a whole, or more importantly, any individual tournament?

I hold that there is a hefty burden that resides with those who advocate using past data to remake the tiebreaker system, and that that burden is to show that there is a predictive relationship between what has happened in the past and what will happen in the future. Unless someone can show that one tiebreaker stands above the rest regardless of the type of questions, level of competition, a team's slate of opponents, and a plethora of other variables, I don't think we can safely use this kind of data at all.

I still believe that it's best to set a reasonable, intuitive goalpost as the tiebreaker and stick to that. As we see above, points per game and points per bonus correlate similarly to other methods; even if you claim that the above data are valid, you are still forced to admit that the traditional tiebreakers of PPG and PPB appear to be about as useful as any other proposed method.

Moreover, they have the benefit of being both intuitive and positive. It makes a lot of sense that the better team will score more points against common opponents, or score more points per bonus on a differing schedule. Furthermore, it's a positive tiebreaker; you start from zero and go up, there is a goalpost out there, and once you pass it and another team doesn't, you win the tiebreaker. Which is more appealing, that a team should strive to score as many total points and as many points per bonus as possible, or that a team should hope that their margin of victory in one game (or some amalgamation of all of the proposed tiebreakers that historically boosts correlation by X%) was good enough that results from 1994 Wahoo Wars combined with data from Tartan Tussle XX will indicate that they have a 2.5% better chance of winning a follow-up game?

In sum, I hold that Mike's argument that we must reject theory (which amounts to intuition and reason coupled with practice) because there are data out there is ludicrous. There is no reason at all to take at face value these data as useful.

theMoMA wrote:Moreover, they have the benefit of being both intuitive and positive. It makes a lot of sense that the better team will score more points against common opponents, or score more points per bonus on a differing schedule. Furthermore, it's a positive tiebreaker; you start from zero and go up, there is a goalpost out there, and once you pass it and another team doesn't, you win the tiebreaker. Which is more appealing, that a team should strive to score as many total points and as many points per bonus as possible, or that a team should be hope that their margin of victory in one game (or some amalgamation of all of the proposed tiebreakers that historically boosts correlation by X%) was good enough that results from 1994 Wahoo Wars combined with data from Tartan Tussle XX will indicate that they have a 2.5% better chance of winning a follow-up game?

What does this even mean? All the proposed tiebreakers and combinations of tiebreakers hold the following: it is better to win a game than not, it is better to answer tossups than not, it is better to answer bonus parts than not. We're using West Coast data because I know where those stats are and no one else has volunteered data.

Data is useful because it confirms intuition. Since there are good arguments to be made for various tiebreakers, it follows that we must go to whatever data is available, or collect new data, in order to verify one or more of these arguments. After all, in Georgia, they consider head to head to be "intuitive", a view with which you appear to disagree - therefore there is not a consensus on what is "intuitive". If you have a better set of data on immaculate questions with perfectly opponent-controlled matches, I'd love to see it, because it would be the best data set out there. But I don't think using the data the we do have is somehow invalid.

We're going back to the instantaneous point in a tournament at which the rematch occurs, and predicting which team will win given results of that tournament up to that point. We already know the result, so we're testing how often our predictor is right. 50% means it's a bad predictor, <50% means it's predicting that the team with the better stat will lose the game more often than it will win it. Can we extrapolate this to the future? I don't see why not. We're already pretty certain it can't replace tiebreaker matches, and as more tournaments happen we can feed more data into the machine and come up with the best "approximation" of a tiebreaker match for tournaments that don't have the luxury of that extra packet. I argue that doing this is independent of question quality and independent of strength of schedule; heck, I'm back with my "let's use W/L and predict outcomes of every match" suggestion.

The people pointing out that the data above is inconclusive are correct. If anything, they are understating how inconclusive it is. If two people each toss a coin 38 times, the expected value for the difference in the number of heads each one gets is about 3.5 heads, or about 9%.

David Reinstein, IHSSBCA Chair (2004-2014)New Trier Coach (1994-2011); Head Writer and Editor for Scobol Solo and Masonics (Illinois); Writer for NAQT; co-TD for New Trier Scobol Solo and New Trier Varsity; PACE Member; former writer for CMST; former editor for IHSA

theMoMA wrote:I don't think we can work with the assumption that data trends in past quizbowl matches necessarily predict game results of future matches. I'm unconvinced that the body of quizbowl match results as a whole to this point represent the expected outcomes of matches to come, and I certainly reject outright the idea that a data set that mixes non-common and common-opponent schedules, is heavily skewed towards sketchily edited west coast sets, TRASH regionals, and IS set tournaments, and has a whole bevy other other problems has any useful extrapolative value whatsoever. These data stem from activities whose commonality barely extends past the use of questions and buzzers. Who says that stirring up all of these (or any other concoction) yields something that will be predictive for future quizbowl as a whole, or more importantly, any individual tournament?

First of all, if you don't like these data, get me some more more to your liking. I've addressed your concerns by publishing means (now with error bounds; thanks, Brian! I was about to get on that myself...) for both isolated tiebreaking and non-isolated tiebreaking. If we get more data, I can address them further by publishing data for different kinds of situations: competition level, type of questions, etc. There really shouldn't be anything systematic that we can't deconvolve without enough data. However, the fact is, even introducing more random errors* by considering the non-star data (or considering "skewed data," though your criticism of skewed hiow is well wide of the mark: please reconsider what sets these data are from,) we should (must) converge to the correct mean with enough data; that's just statistics. This also addresses your claim that these data are useless: no data are useless, we just have to carefully consider the nature of the error we introduce and consider propagating fluctuations.Secondly, your arguments are massively unscientific. You're just arguing from untested dogmas and saying thing that, again, are not counter to what we're examining here. Again, if long-term trends are the best tiebreakers, that will (must) be borne out by the data; if it's not, then it's your dogmas that are wrong. This is what is known as science.

theMoMA wrote:I hold that there is a hefty burden that resides with those who advocate using past data to remake the tiebreaker system, and that that burden is to show that there is a predictive relationship between what has happened in the past and what will happen in the future. Unless someone can show that one tiebreaker stands above the rest regardless of the type of questions, level of competition, a team's slate of opponents, and a plethora of other variables, I don't think we can safely use this kind of data at all.

Okay. I turn that burden back on you: justify uncritically retaining the traditional tiebreaker system without an appeal to tradition itself or to unverified dogmas like "long-terms trends are always best." The simple fact is you can't: all untested dogmas are of the same standing and, as you are apparently opposed to looking at actual data and/or don't have any (or are holding out on me...) that's all you can possibly bring me.Consider, for example, that the whole impetus for this is another person's appeal to "reason" and tradition in favor of the straight head-to-head tiebreaker. Consider, further, that that same tiebreaker for those same reasons was widely considered "the correct ones" very recently in the college game and, further, that there's no reason it can't become so again. Evidently, you vehemently disagree with that person and with the practitioners of the college game of years past, but their arguments are just as sound as yours in the absence of data and analysis: you've all brought only your dicks to a sword fight.

MaS

*This is somewhat begging the question: Andrew evidently means to assert that competition level/question type may introduce systematic, rather than random, drifts. I don't know if I buy that, but, at the same time, it's not something I can safely dismiss out of hand. The answer is (you guessed it!) more data.

PS: Also, your argument is contradictory at least in this: You're arguing that different situations (types of questions, level of competition) may have different results for the most predictive tiebreaker. You're then arguing that everyone is therefore compelled to use the same tiebreaker in the name of reason. That does not follow.

You have not addressed my concerns, and your statement about error bounds reflects a fundamental misunderstanding of what I'm saying. Your error bounds are useless outside of the data themselves. You've yet to show that these data have any value outside of themselves (ie, some kind of extraordinary power to predict future action), and until you do so, I will continue to reject what you're doing. I do hold that your data are useless, just like golf ball trajectory data are useless in determining who should win quizbowl tiebreakers. Until you show that the data are applicable to the situation at hand, I hold that we have no reason to assume that the data are valuable. When Dwight says "I argue that [feeding a bunch of data from past tournaments into a machine and coming up with a statistical tiebreaker] is independent of question quality and independent of strength of schedule, why on earth should we take him at face value? This is the major contention in using past data; you can't simply argue it away by putting "I argue" in front of an opinion.

Moreover, why would the burden be on me to get you data "to my liking"? I am the one making objections here; either find a way to counter them, find new data, or abandon your argument. Don't tell me that I have to counter my own argument for you. And stop mischaracterizing my argument. I am not opposed to looking at data, I am opposed to assuming that the data are useful in describing the situation at hand, which I find a hefty precondition to looking at the data.

I merely offer PPG and PPB as reasonable, intuitive, and positive. I am by no means saying that these are the only reasonable, intuitive, and positive tiebreakers that exist. The fact that some people see head-to-head as a legitimate tiebreaker doesn't do anything to my argument; those people can show up and convincingly justify their beliefs as such, which would only show that there can be more than one legitimate tiebreaker. Or they can be wrong. Neither of these possibilities undermines what I'm saying. I see no reason to accept the "other people believe differently and appeal to some of the same things you do, abandon your argument" argument.

It may very well be that the current mode of tiebreaking is an untested dogma, but you've got a responsibility to show that your test is actually the correct one. You haven't done anything to shift the burden back to me. Show that your data are meaningful, or be forced to submit to bottom-up instead of top-down tiebreakers.

theMoMA wrote:I am not opposed to looking at data, I am opposed to assuming that the data are useful in describing the situation at hand, which I find a hefty precondition to looking at the data.

Andrew, unless I'm horribly mischaracterizing your argument, you are appearing to state that we cannot use the data that we have because it is not at all useful. Do you agree with the following method:

Hypothesis: A is a better predictor of B than C is.Testing Hypothesis: We make two Bernoulli random variables corresponding to A -> B and C -> B. We find a bunch of situations in which A occurs, and a bunch of situations in which C occurs. In each situation, either B will occur (a 1) or B will not occur (a 0). From this, we are able to guess the mean of these Bernoulli variables, i.e., the true probability that A -> B and C -> B.Data Analysis: We can run a one-sided z-test with H0: The true probability that B occurs given A and the true probability that B occurs given C are the same, and HA: The true probability that B occurs given A is greater than the true probability that B occurs given C.Conclusion: If we get a p-value of less than our significance level, say 5%, then we reject H0 and claim that the true probability that B occurs given A is greater than the true probability B occurs given C. This necessarily implies that A is a better predictor of B than C is. If we get a p-value greater than our significance level, then we cannot reject H0 and we're back to "intuition" in deciding whether A or C is better.

If you do not agree, tell me where there is a problem with this setup. If you do agree, tell me where I can find data that might be more "useful," or prove to me that no such data exists. Unless I'm terribly mischaracterizing your argument (and I think I am), you seem to be implying that the only useful data is future data, i.e., data that we don't have (and once we do have it it'll be invalid because it's now past data).

As Mike said, there may be some systemic drift between different types of questions, or between different records, and he's entirely right when he says that we can check this if there's enough data (using the method outlined above).

The only argument that I think you can really be making is that the data has not been randomly selected. I will agree with you there, because we don't have data from other circuits. From our small sample of data, we are making a generalization about the population of (rematches between teams of the same record on the same packet set). If there is a systemic reason why we should not include "old" or "poorly edited" tournaments in our sample, outside of that it might skew the data one way or another (which, as Mike said, we can deconvolve with enough data), then you need to explain to me what it is, because you haven't done that yet. Performance on 1994 Wahoo Wars is probably not well predictive of performance of 2007 ACF Regionals, but we are comparing data from within tournaments (and their fields), not between tournaments (and their fields). That is, we are not taking data from 1994 Wahoo Wars and extrapolating to 2007 ACF Regionals. We are taking data from Wahoo Wars and comparing it to other data from Wahoo Wars, and doing the same thing with ACF Regionals. As long as the match passes our exclusionary criteria (e.g. we need rematches so the teams are at theoretically the same level at which they played the last time, although this assumption does not always hold; furthermore, we need matches between teams of the same record because W-L record is probably the best predictor of who will win a given match), it should be included in the data set. You appear to be arguing that we need additional exclusionary criteria: please elucidate what exactly these criteria should be.

If there was some procedural change (for instance, if the halftime Whack-a-Mole game was played until 2002, then discontinued), then tiebreakers affected by that change would no longer be valid (we can't use Whack-a-Mole to predict tiebreakers because the probability that a team will win a tiebreaker given a Whack-a-Mole win is 0, since there is no chance a team will actually win the Whack-a-Mole game). The only "changes" that have really occurred in the past decade are that questions are almost uniformly longer and relatively easier. Neither of these are systemic changes that prevent us from taking a meaningful, for instance, bonus conversion statistic.

Okay, Andrew. I say that I have, in fact, understood and addressed your concerns. I will now try to do so a second time. If you see any objection of yours that isn't addressed, I invite you to point out what isn't addressed and how.

I hold as an axiom that past results are the best (and only reasonable) predictor of future results available. This is the farthest thing possible from "extraordinary power." I seek only to use the very basic predictive power of statistics. If you can see a better predictor than past performance, you're welcome to disclose what it is, but the claim I'm making here is hardly odd or extraordinary.However, the fact that you continue to denigrate even the principle that future results can be predicted by past data leads me to believe that it is you who lack understanding in this case. Therefore, let me take your argument to its logical conclusion: if past results have negligible predictive power, one cannot justly have resort to any tiebreaker whatsoever because, as they're all based on past results of some kind, they're all inherently unfair and baseless fiat judgments that a TD foists on their field. Now, I don't believe that, and your argument about the reasonableness of traditional tiebreakers leads me to conclude that you don't believe that, either. This is, in fact, a major contradiction in what you're saying and gives you every reason to abandon your argument.

Now, then, I've said already and say again that you make a valid criticism by saying that differing types of questions or levels of competition may introduce systematic drifts in the data that we have no good way of compensating for. For a second time, I accept that that may be, though I have my doubts. That means that I can only publish what you will see as lower bounds on the error, for now.However, I addressed that and address that again by saying that, with further data, we can observe what these drifts are by deconvolving whatever trends you like. Therefore, your criticism (to the extent that it is valid) is one of the data, not of the method per se. Given sufficient data, this method will observe which is the best tiebreaker in any situation that occurs frequently enough. But nobody, me least of all, has ever said that this method will work well with a paucity of data or with only certain kinds of data: in fact, I am saying and have always said the exact opposite of that.However, if you cleave to this criticism and want to convince me of it, it is incumbent on you to demonstrate it. Find data for a situation of import (well-edited sets or top-flight teams or whatever) and show me that my results are badly different from your results for those. If you can't or won't do this, your criticism is in the realm of conjecture and my (or anyone else's) counter-conjecture is equally valid.Now, I'll note for a second time that your argument that different tiebreakers may be more predictive in different situations directly contradicts your contention that we should just use PPG or PPB in all cases. Your argument, in fact, dictates that, if we would be fair, we must use the correct tiebreaker for the situation. That is a second major contradiction in what your saying and, again, gives you every reason to abandon this argument.

Now, if you understand what I've said, you understand that I'm not assuming that every datum is equally valid in every situation. In fact, I'm saying quite the opposite of that: I'm saying that we are compelled to examine different situations to determine if the most predictive tiebreaker may be different in different cases. So, if you're not opposed to examining the data and drawing conclusions, you have no further issue with what we're doing here. However, you claim to understand what I'm saying and yet continue to oppose it. That is a third major contradiction in what your saying and, again, gives you every reason to abandon this argument.

You say that other tiebreakers may be just as good as the ones you propose, even by your own standards. Then, I ask you: on what basis do you propose the ones you do and not others? The fact that the exact same argument you're making can be used to justify different conclusions (by your own admonition!) formally indicates that your conclusion does not follow from your argument. That is a fourth major contradiction in what your saying and, again, gives you every reason to abandon this argument.

In closing, I'll note that you're right that the responsibility is on me to show that my test is valid. Fair enough: I take as an axiom that, if we're fair, we are compelled to select the tiebreakers that would best predict the outcome of an actual match, since we would presumably play the match to break the tie if we could. However, I say that what is above shows precisely that, given enough data, this test will indicate which stats are most predictive of winning in any situations that you like. However, nothing substantive above is new; it is rather what I've been saying all along. Therefore, I claim that, if you have not heretofore understood that my proposed test is valid (or, indeed, if you don't understand that now), it is not because of my failure to demonstrate that it's so, but rather your failure to understand the principles of my arguments. I invite you to demonstrate that this is not so if you can.

MaS

theMoMA wrote:You have not addressed my concerns, and your statement about error bounds reflects a fundamental misunderstanding of what I'm saying. Your error bounds are useless outside of the data themselves. You've yet to show that these data have any value outside of themselves (ie, some kind of extraordinary power to predict future action), and until you do so, I will continue to reject what you're doing. I do hold that your data are useless, just like golf ball trajectory data are useless in determining who should win quizbowl tiebreakers. Until you show that the data are applicable to the situation at hand, I hold that we have no reason to assume that the data are valuable. When Dwight says "I argue that [feeding a bunch of data from past tournaments into a machine and coming up with a statistical tiebreaker] is independent of question quality and independent of strength of schedule, why on earth should we take him at face value? This is the major contention in using past data; you can't simply argue it away by putting "I argue" in front of an opinion.

Moreover, why would the burden be on me to get you data "to my liking"? I am the one making objections here; either find a way to counter them, find new data, or abandon your argument. Don't tell me that I have to counter my own argument for you. And stop mischaracterizing my argument. I am not opposed to looking at data, I am opposed to assuming that the data are useful in describing the situation at hand, which I find a hefty precondition to looking at the data.

I merely offer PPG and PPB as reasonable, intuitive, and positive. I am by no means saying that these are the only reasonable, intuitive, and positive tiebreakers that exist. The fact that some people see head-to-head as a legitimate tiebreaker doesn't do anything to my argument; those people can show up and convincingly justify their beliefs as such, which would only show that there can be more than one legitimate tiebreaker. Or they can be wrong. Neither of these possibilities undermines what I'm saying. I see no reason to accept the "other people believe differently and appeal to some of the same things you do, abandon your argument" argument.

It may very well be that the current mode of tiebreaking is an untested dogma, but you've got a responsibility to show that your test is actually the correct one. You haven't done anything to shift the burden back to me. Show that your data are meaningful, or be forced to submit to bottom-up instead of top-down tiebreakers.

I've written a script that scans the "*_games.html" SQBS files, and finds cases of where 2 teams face each other multiple times. It then finds all the opponents that those 2 teams have in common, and keeps track of both teams' stats for the games they play against common opponents (as well as for their head-to-head matchups). So, for each of the two teams, I keep stats for N games (N-2 games versus common opponents, and 2 head-to-head matchups). I require that the two teams have identical records in their first N-1 games, and I require N>5, so that the teams have at least a reasonable number of common opponents. If all these requirements are met, I calculate the teams' stats for those N-1 games (ppg, ppb, point differential, head-to-head), and see how well those stats predict which team wins their second head-to-head match.

This is essentially equivalent to Dwight's *'d data points, except I'm loosening the requirement on what rounds the teams play their opponents. For example, let's say Teams A and B play 8 rounds, with the following opponents:Team A plays [B,C,D,E,F,G,B,H]Team B plays [A,E,C,H,D,F,A,K]In this case, instead of using the first 6 rounds for comparison (where the teams don't play all common opponents), I can look at rounds [1,2,3,4,5,8] for Team A, and rounds [1,2,3,4,5,6] for team B. In those rounds, both A and B play each other once, as well as play C,D,E,F, and H. Assuming A and B both have identical records in those 6 games, we can look at their stats in those games, and see how they predict who wins their second head-to-head matchup (in round 7).

I've applied this script to about a year's worth of tournaments, looking for all the results I could find for the last year. This resulted in 54 data points. Here's the results:

Here we're to starting to see fairly significant differences between head-to-head and the other stats. We'll need a lot more data to distinguish between PPG, PPG differential, and bonus conversion, but in any case, it looks unlikely that there's a large difference between those three statistics.

Awesome! I'm glad people so much better at mining these data than I am exist. So, it seems what we need now are more SQBS files, then? How hard would it be to make splits for, like, record or tournament type using your script?

That looks awesome. If you need more data, you can try NAQT's database. You'd probably have to rework your script and make sure you don't get repeats, but I'm guessing that NAQT has some statistics that aren't elsewhere. Of course, you could also try going back to 2006-07 or 2005-06 too.

I've taken your data and really quickly run it through the 2-PropZTest function on my trusty TI-83+. I get the following p-values (I've defined the following: H0: p1 = p2; Ha: p1 > p2):

*Significant at the 5% significance level**Significant at the 1% significance level

Given that data, I think we can safely conclude that head-to-head is the weakest of the four considered tiebreakers.

For those of you who haven't taken a statistics class, or don't remember anything from it, I define a null hypothesis that the true percentage of games accurately predicted by one tiebreaker is the same as the true percentage of games accurately predicted by a different tiebreaker, and an alternative hypothesis that the true percentage of games accurately predicted by one tiebreaker is greater than the true percentage of games accurately predicted by the other. I plug the data into a fancy mathematical formula to get a z-score, which I can turn into a p-value. If my p-value is less than my significance level, I reject my null hypothesis (and am forced to accept my alternative hypothesis, assuming I've defined my hypotheses correctly); otherwise I cannot reject the null hypothesis (and thus I must continue to assume that one tiebreaker is not better than the other).

Dwight, that looks really nice. I didn't realize TI calculators had that type of capability built in. The p value looks like exactly the thing we want to be looking at.

Captain Scipio wrote:How hard would it be to make splits for, like, record or tournament type using your script?

Splitting up by tournament type is just a matter of sorting the SQBS files by hand into different categories (most of the tournaments I have are college ACF-style, but there are some NAQT and trash tournaments as well). So, that shouldn't be very hard. What exactly do you mean by "splitting by record"? I think it should be relatively easy to make splits based on any category you can imagine. One idea I had was to look at how predictive the stats are as a function of the stat difference between the two teams. So, for example, instead of just asking, "how often does the team with the higher PPB win the second H2H matchup?", we can look at, "how often does a team with 1 (or 2, or 3, etc.) higher PPB win the second H2H matchup?" This way, we can find out, is a 1 PPB advantage more or less significant than (e.g.) a 20 PPG advantage?

I'll see if I can make a Google Docs spreadsheet available with all the numbers for the 54 datapoints, so you can play around with the numbers yourself.

Also, I noticed that one of my datapoints accidentally appeared twice, because I had two copies of the same tournament in my directory. I removed the extra datapoint, and here are the new numbers (based on 53 points):PPG: 0.7170 +/- 0.0619PPG Differential: 0.6792 +/- 0.0641Bonus Conversion: 0.7547 +/- 0.0591Head to Head: 0.5094 +/- 0.0687

If two people each flip a coin 53 times, the expected value for the difference in their number of heads is a little over 4, which is approximately the difference in the number of successful picks between PPG, PPG Differential, and Bonus Conversion. P Values around 0.3 should not be used to draw any conclusions other than more research is necessary. (I'm not contradicting anybody--I'm just making the statistical uncertainties more explicit in case anybody reading this thread thinks it's a good idea to draw conclusions at this point.)

David Reinstein, IHSSBCA Chair (2004-2014)New Trier Coach (1994-2011); Head Writer and Editor for Scobol Solo and Masonics (Illinois); Writer for NAQT; co-TD for New Trier Scobol Solo and New Trier Varsity; PACE Member; former writer for CMST; former editor for IHSA