All mirrors of 2018 ACF Regionals have finished. The packets will soon be posted to the official packet archive (in standard docx format).

In addition, I am temporarily releasing the online packets, which allowed most moderators to track buzz points. Both versions of the set (1/20 and 1/27) are available. This is intended to be a reference tool for the buzz points (for use with the detailed stats). It is not a permanent packet archive; I cannot guarantee how long it will stay up or if it will eventually be posted elsewhere.

Thanks to Andrew Wang for letting me accidentally borrow your webcam for so long.
For the slim chance that I post more better-produced quizbowl videos in the future or even live stream myself working on the detailed stats, you can subscribe to my channel.

Detailed stats warm-up survey

I know that many people are eager to see the detailed stats, but it will take a few more days to iron things out. Sorry to disappoint everyone, and thanks for your patience.
However, while I work on preparing them, I would like to invite you all to participate in an informal survey.

I've said before that my priority is to finally make valuable new quizbowl datasets exist. By that, I'm not just talking about fancy numbers like conversions and buzz points. I'm really interested in learning more about the perceptions that players and authors have about quizbowl questions, and I think that addressing this topic is important to the progress of quizbowl.

For a long time, I have wanted to run an experiment before revealing the detailed stats. I don't have a formal survey ready as originally planned, but these are the goals I wanted to accomplish:

To measure the difference between expectations and reality. How does a question's perceived difficulty compare with conversion? How reliable is players' judgment of the whole set?
This might have taken the form of an exit poll given as a prerequisite to viewers of the detailed stats. For example, you might be asked of a given question: how many 10s/–5s do you think there were? where do you think the first buzz was? where was median/mean buzz location? I would have liked to poll authors too, but naturally they're too busy writing questions. In the future, an "entrance poll" could be integrated into the editing process. Most authors already consider such concerns anyway, but if they made explicit predictions, we could test how well they agree (like I once did for a vanity packet).

To get people to actually engage with detailed stats. Ask yourself, what questions do you want to have answered? What do you want to get out of them? What will it take for other people to start making them? To be honest, it can be discouraging to release detailed stats and rarely ever receive any germane responses (and to hear that I've ruined tournament experiences by inflicting miserable spreadsheets on people for nothing…). We now have an excellent opportunity for real discourse: more than 500 players and 100 staff took part in ACF Regionals. Many of you wrote questions that made it into the set, which was played in 69 rooms around the world – a historic sample size for detailed stats on the exact same questions. This is our chance to think seriously about what's the big deal with detailed stats before our observations interfere with any relatively unspoiled ideas about them.

So please post your hypotheses in this thread. For example, Victor predicted that very few teams regularly buzz before the 50% mark.

As an incentive to participate, email me a good-faith quantifiable prediction by Monday night and I will give the most impressive prognosticators early access to the detailed stats spreadsheets. Please state whether you were a player, moderator, or just an observer; and disclaim whether you have seen any preliminary ACF Regionals detailed stats or have seen detailed stats before.

The bonus on 4-color-theorem/computers/Tait was 20'ed in at least 75% of rooms.

Questions written by editors will have a lower standard deviation for both average buzzpoint and conversion rate than submitted questions.

There will be a positive correlation between a submitting team's performance and the average difficulty of their questions (that is, better teams will have written harder questions, on average).

Literature will have the highest bonus conversion rate.

EDIT TO ADD: "Bad" teams (let's say <12PPB) will have a higher percentage of 30'd bonuses than buzzes in the first two lines, and "good" teams (>18PPB) will have a higher percentage of buzzes in the first two lines than 30'd bonuses.

Last edited by CPiGuy on Mon Jan 29, 2018 1:30 am, edited 1 time in total.

CPiGuy wrote:
There will be a positive correlation between a submitting team's performance and the average difficulty of their questions (that is, better teams will have written harder questions, on average).

You mean after edits? I can assure you that there is no such pattern in submissions - in my experience, difficulty just fluctuates far more wildly in the packets of weaker submissions.

CPiGuy wrote:
There will be a positive correlation between a submitting team's performance and the average difficulty of their questions (that is, better teams will have written harder questions, on average).

You mean after edits? I can assure you that there is no such pattern in submissions - in my experience, difficulty just fluctuates far more wildly in the packets of weaker submissions.

Well, yes, obviously after edits. I think this is most likely to come out in dead tossup rate, since better teams might be slightly more likely to write tossups on harder answerlines, and the answerline is the thing the editors won't have changed.

I would have hypothesized that the variance in difficulty of questions by a certain team would correlate inversely with that team's PPB or whatever, but I figured that would be effectively edited out.

Having absorbed and sought out a lot of feedback about the set over the past week, I regret that a prevailing impression of it is that it seemed to lack middle clues. While it is true that the project I laid out in the general discussion thread led to many leadins that pushed the envelope, I felt as if I was in "middle clue mode" from the second sentence onward. Perhaps I was wrong in thinking that some of my more gimmicky tossups (such as Moby Dick from its first chapter, Ulysses from the Stephen Dedalus scenes, and sexual reproduction in fungi) held up throughout their cluing structures off the strengths of their conceits alone. Alternatively, maybe I banked too much on textbook knowledge or knowledge about other aspects of familiar quizbowl terms (thinking here of tossups such as the specific heat capacity and equilibrium constant ones) being familiar to swathes of the playing audience. However, in every circumstance I was always confident that I was following each clue with a markedly easier one. In any case, I will definitely use the conversion data on my questions to calibrate future questions I write at this difficulty. Maybe this will take the form of (assuming that I am significantly ahead of the production schedule, which I wasn't for this set) stepping back from each tossup that I write, evaluating exactly how many teams I expect to buzz on each clue, and to rewrite any clue that falls under a certain threshold (ie, fewer than 5, 10, 20, 30, etc for successive clues).

I would like to push back against one unstated assumption that I've heard echoed in Victor's post and in the takes of a few others. A team that is regularly buzzing even at the 60-70% mark at this tournament or is getting 13-18 ppb is a decent, even good team and should be proud of how they performed! My hope is that any team in those spots feels like their knowledge was rewarded by at least some part of these questions, and has a sense for what they need to do to kick it to the next level.

Last edited by Auroni on Mon Jan 29, 2018 3:21 am, edited 1 time in total.

CPiGuy wrote:the answerline is the thing the editors won't have changed..

This is an incorrect assumption. Editors will often rework a submitted question to be on a different answerline that preserves the theme of the question, since, as Will suggests, achieving a consistent difficulty across packets is the overriding priority.

Thanks for taking your time to do this, Ophir. Everyone I know is eager to see the stats when they are up.

My bold prediction: The stuff Thomas Young did in his spare time bonus is the easiest bonus in the set by overall conversion and over 50% of teams 30'd it. I will email you some others if I have time to think about them.

hftf wrote:Ask yourself, what questions do you want to have answered? What do you want to get out of them?

To me, detailed stats serve as an effective tool to compare teams and players while divided into specific subject areas. Unfortunately, for the player, it seems that advanced stats is nothing more than a way to boast or feel bad about being hesitant. This is highly self serving, but could also be useful for teams trying to optimally substitute with a team of 5, or looking at which ones of your new players out of a mountain of fresh recruits would combine to make the best D2 team. You could do both of these things by inspection but with advanced stats allows you to back it up mathematically.

Having written my first substantial collegiate tournament this year in WAO, I feel that detailed stats would have been really helpful in improving my writing. This would be a direct way of getting feedback without only relying on talking to people about it. This would have especially been useful for music questions where note clues are not easy to parse some of the time and would have been a good way to tell me what worked and what didn't, helping also to improve the set weekend by weekend if the tournament unlike ACF or NAQT has mirrors lined up over a few weeks.

hftf wrote:What will it take for other people to start making them?

How much work do you estimate it take someone to set up a generic detailed stats system (to the level of detail you did for CO Art and EFT)? How much more work would it take on a per-set basis? Would pack editors be able to help out by tagging questions with categories or whatever? I would be willing to chip in a couple of extra bucks a tournament to ensure that work on the per-set end can get done by whoever takes your helm. I have no idea how much work one would put into these things or how much money is usually transferred in a large tournament such as SCT or Regionals so I can't really be more specific.

I think this stuff can be super helpful especially to writers to see how their questions played out so it should definitely be continued.

Having just read the packets, I really love the way the pronounciation guides go over the words -- having them after the words can be confusing both for long words or multi-word phrases and for things I already know how to pronounce; these guides are significantly better.

Hey folks - much like Ophir, I would be very interested in having a more proactive discussion using testable hypotheses based on comments or assumptions (thanks, Conor, for proposing a number of these!) However, I'd like to go beyond specifics like that and try to dig deeper and better inform a number of common areas of discussion.

Here's one I've been thinking about - what do the buzz numbers of "outlier tossups" look like? People often make comments to the effect of "I'm not sure how appropriate these outlier tossups are at Regionals" while others

For example, we can actually set a definition of "outlier" difficulty - maybe "equivalent to or harder than a solid middle part" i.e. 40 to 60% conversion. Outlier tossups would thus be tossups with less than 60% conversion. We can look at the distribution of buzzes in these outlier tossups and see if they're similar to other tossups (and simply fewer buzzes overall) or whether they are empirically more top-heavy.

Apparently these tossups are outliers using these criteria (in no particular order): Lycidas, Anne Sexton, Concept of Mind, Rebecca Solnit, superconducting critical temperature, Middletown, Nicholas Malebranche, Charles Bukowski, James Longstreet, Cindy Sherman. Polynices and Million Man March ring in at 61% and there's a couple others in the mid-low 60s. These weren't all evenly distributed across packets.

As a bit of a semi-related aside - I personally have not looked at the buzz distributions of my tossups, but if I had to guess which ones had later buzzes, here's my go for world history. I approached each tossup with the idea of creating a similar difficulty structure, but given awareness of people studying based off old packets, people will be more familiar with some topics than others. These would be areas I would think people would be most familiar with, ranked from most to least:

Part of the discussion about this tournament has been about how it felt harder than usual, with the counter argument being that it's just the longer question length that makes it seem harder than SCT Div 1.

People often talk about the first line of Regionals packets being too difficult, but what I find is that buzzing in on middle clues is unusually difficult and where I personally neg the most. I think there's a gap in between when the good players buzz in on their core subjects vs. when the most of players might get something or be confident in buzzing near the end of the question.

Because of this, I think that many of the first buzzes at each site are going to be distributed bi-modally on the second and third line area and the lines directly before FTP. I'm gonna take a look at the Canada site first, and then other sites with stronger and weaker fields.

EDIT: I'll post the total results tomorrow, but for the Canada site this hypothesis is looking to be quite wrong.

Erik Christensen
University of Waterloo - School of Planning Class of '18
I write trash
Defending VETO top scorer

I would predict that a large majority of music buzzes (maybe up to 90%) were after the third line or so, even more than for other categories. From personal experience (both from playing the set at UCSD and from watching other teams play), good players were converting music consistently around the fourth or fifth line and less often near the beginning or end of a tossup.

I really appreciate reading advanced stats; they not only help me get an idea of where people are buzzing as a whole, but provides pointers for the specific subcategories that I can improve.

Last edited by nsb2 on Tue Jan 30, 2018 3:23 am, edited 1 time in total.

At the Southeast (Georgia Tech) site, the majority of music buzzes were VERY late in the question, generally close or in to the last line. Percentage wise, I'll guess that the average buzz was beyond the 75% point.

My reasoning for this prediction is that we (Auburn) have no "music" player, and we got the music tossup in something like 7 or 8 of our 10 games, and I believe I got all of those buzzes except for maybe 1, and I've done very little music studying ever.*

Perhaps this was just dumb luck in our games, or maybe the detailed stats will confirm what I suspect. I don't have enough knowledge at all to say whether or not the music in this set was too hard/just right/too easy, just wanted to observe that music buzzes in our games were generally very late.

I greatly enjoy browsing advanced stats, and hope they continue to be produced. I'd also be willing to pay a little bit extra per tournament to have these continued.

*All of my music buzzes were last line (I think).

Chandler West
Auburn University 2016-20xx
Good Hope High School (Cullman, AL) 2012-16

So I was wrong for the most part about when people were buzzing. In general, buzz rate increased as the questions got longer. UCSD notably buzzed in correctly much earlier on average. The strange dip in the late-middle of questions I was expecting isn't showing up.

Interesting, UCSD had far less conversions at the end of the tossup, which is partially caused by a 3% lower neg rate then the other American sites.

Here's a thought experiment - what if ACF Regionals had powers? Not to argue for whether it should or not, but rather to test difficulty predictions / presumptions.

In my experience editing, power marks usually are somewhere between 40 and 70 percent of the way through the vast majority of tossups - we'll call 55% a fair midpoint. How many buzzes are by the 55% mark? It looks like about 20% of buzzes are (25% at UCSD). So, if this set had powers - those would be just in or just out of power, so you'd have a power rate of at most 20% (in all likelihood closer to 15, and closer to 20 for UCSD). A 15-20% power rate is, to my understanding, what NAQT aims for in its tournaments - not sure what median SCT power rate is, but the median ICT and HSNCT power rates usually fall within that range. So if we were to use the same criteria then it would seem Regionals is about on par in terms of "power rate," perhaps a bit lower than ideal but not egregiously so.

Is the perception of a lack of early buzzes affected by a lack of powers? There are definitely some assumptions in my logic here, but this seems about right to me. I think you can make an argument for perhaps trimming wording, leadins, etc. but it doesn't look like the number of early buzzes was super off base.

Periplus of the Erythraean Sea wrote:Here's a thought experiment - what if ACF Regionals had powers? Not to argue for whether it should or not, but rather to test difficulty predictions / presumptions.

In my experience editing, power marks usually are somewhere between 40 and 70 percent of the way through the vast majority of tossups - we'll call 55% a fair midpoint. How many buzzes are by the 55% mark? It looks like about 25% of buzzes are. So, if this set had powers - those would be just in or just out of power, so you'd have a power rate of at most 25% (in all likelihood closer to 20). A 15-20% power rate is, to my understanding, what NAQT aims for in its tournaments. So if we were to use the same criteria then it would seem Regionals is about on par in terms of "power rate."

Is the perception of a lack of early buzzes affected by a lack of powers?

I think this perception may also stem from the differences in question length between ACF and NAQT - the character cap on NAQT questions leads to shorter questions than in ACF. So while the hypothetical power numbers are similar in terms of the percentage distribution, you're hearing more question in ACF before someone buzzes than in NAQT, in absolute number of words heard before the buzz.

cruzeiro wrote:
I think this perception may also stem from the differences in question length between ACF and NAQT - the character cap on NAQT questions leads to shorter questions than in ACF. So while the hypothetical power numbers are similar in terms of the percentage distribution, you're hearing more question in ACF before someone buzzes than in NAQT, in absolute number of words heard before the buzz.

Sure - there are arguments for making questions shorter. In general I think we could have (mostly) gotten away with seven line tossups (ignoring PGs etc) if we tried harder, and there are good arguments in favor of this. But in terms of how things actually played out, it seems like miscellaneous comments I have gotten about a "lack of early correct buzzes" are not corroborated, and that most tossups had solid middle and mid-late clues. I think one could argue that a distribution that looks more like a right triangle / pyramid slope would be ideal (in which 25% of cumulative buzzes would be by the 50% mark, integrating under the curve). However, it's worth keeping in mind that negs are A Thing and looking at neg distribution might help figure out what's pushing things away from that triangle. As Erik pointed out, a 3% dip in the neg rate (across 20 tossups per round, 10 teams, and 10 rounds per team, that's 50 games played x 20 tossups each = 1000 tossups heard, meaning thirty fewer negs, meaning just three fewer negs per team across the whole tournament) can make a real difference in how the curve looks overall.

FWIW I'd love for someone to mount a good attack against this line of reasoning, since it's no doubt in part motivated by my attempt to say "hey our set wasn't as brutal as you say!"

EDIT: This also seems to me to be further reinforcement of the "negs are really bad" theory of quizbowl. Obviously it's hard to quantify aggression unless we are able to quantify player certainty about an answer (no way in hell for now) but if three fewer negs per tournament makes that big of a difference on your buzz curve, imagine what having six or nine fewer negs is like! I am sure there are other confounding factors, such as other teams buzzing earlier and giving you fewer opportunities to neg. To compare some actual teams: consider Berkeley (22 negs across 10 rounds), Columbia (17.5 negs, normalized to 10 rounds), [Aidan-less] Penn (27.3, normalized), Chicago (25.4, normalized, with a much higher SOS than these other teams), and Cambridge (20.9, normalized).

cruzeiro wrote:
I think this perception may also stem from the differences in question length between ACF and NAQT - the character cap on NAQT questions leads to shorter questions than in ACF. So while the hypothetical power numbers are similar in terms of the percentage distribution, you're hearing more question in ACF before someone buzzes than in NAQT, in absolute number of words heard before the buzz.

Sure - there are arguments for making questions shorter. In general I think we could have (mostly) gotten away with seven line tossups (ignoring PGs etc) if we tried harder, and there are good arguments in favor of this. But in terms of how things actually played out, it seems like miscellaneous comments I have gotten about a "lack of early correct buzzes" are not corroborated, and that most tossups had solid middle and mid-late clues. I think one could argue that a distribution that looks more like a right triangle / pyramid slope would be ideal (in which 25% of cumulative buzzes would be by the 50% mark, integrating under the curve). However, it's worth keeping in mind that negs are A Thing and looking at neg distribution might help figure out what's pushing things away from that triangle. As Erik pointed out, a 3% dip in the neg rate (across 20 tossups per round, 10 teams, and 10 rounds per team, that's 50 games played x 20 tossups each = 1000 tossups heard, meaning thirty fewer negs, meaning just three fewer negs per team across the whole tournament) can make a real difference in how the curve looks overall.

FWIW I'd love for someone to mount a good attack against this line of reasoning, since it's no doubt in part motivated by my attempt to say "hey our set wasn't as brutal as you say!"

I had a longer response typed but it was eaten by the Forum Gremlins.

The crux of my point is the following: We can bin clues into three broad categories - "early", "middle", and "late". With longer questions, there is more text (words/characters) to distribute in each of those three categories, including "early" clues. With ACF questions being longer, you're spending more time as a player hearing the "early" clues, buzzing on the "middle" and "late" clues. That extra time spent waiting for the middle clues is, in my hypothesis, why you're getting (unfounded) complaints that this set was brutal.

To make this concrete: suppose you had a tossup with one early, one middle, and one late clue (a caricature of NAQT question length). Compare that to a second tossup with two of each clue type (closer to ACF question length). The clues are chosen such that the actual difficulty of the questions is the same. My guess is that people would consider the second harder, because they're spending more time listening to the early clues that fewer people are buzzing on (even if it forms proportionally the same amount of the question, and the question isn't actually harder), despite the fact that the buzz distribution is the same across the two questions. That's what I think happened here and why you're getting those complaints, despite the fact that the data doesn't validate the complaints. Effectively, I don't think that people properly recalibrate their difficulty expectations to account for the longer questions you get in mACF formats.

Periplus of the Erythraean Sea wrote:Here's a thought experiment - what if ACF Regionals had powers? Not to argue for whether it should or not, but rather to test difficulty predictions / presumptions.

In my experience editing, power marks usually are somewhere between 40 and 70 percent of the way through the vast majority of tossups - we'll call 55% a fair midpoint. How many buzzes are by the 55% mark? It looks like about 20% of buzzes are (25% at UCSD). So, if this set had powers - those would be just in or just out of power, so you'd have a power rate of at most 20% (in all likelihood closer to 15, and closer to 20 for UCSD). A 15-20% power rate is, to my understanding, what NAQT aims for in its tournaments - not sure what median SCT power rate is, but the median ICT and HSNCT power rates usually fall within that range. So if we were to use the same criteria then it would seem Regionals is about on par in terms of "power rate," perhaps a bit lower than ideal but not egregiously so.

Is the perception of a lack of early buzzes affected by a lack of powers? There are definitely some assumptions in my logic here, but this seems about right to me. I think you can make an argument for perhaps trimming wording, leadins, etc. but it doesn't look like the number of early buzzes was super off base.

I know personally I would have buzzed early a few times more if there were powers. It gives more incentive to buzz when you are fairly certain but not 100%.

At the same time, questions with power marks often have a certain rhythm that clue you in on an distinguishing clue that is just out of power. You would have to change how some questions are written or have long powers for some of the tossups in this tournament - where would the power mark on that Moby Dick or Mongols question be?

Also, my comment on negs was more focused on why less people buzzed at the end in UCSD (because, of course, there's no point in buzzing until the end if the other team has negged).

Erik Christensen
University of Waterloo - School of Planning Class of '18
I write trash
Defending VETO top scorer

ErikC wrote:I know personally I would have buzzed early a few times more if there were powers. It gives more incentive to buzz when you are fairly certain but not 100%.

At the same time, questions with power marks often have a certain rhythm that clue you in on an distinguishing clue that is just out of power. You would have to change how some questions are written or have long powers for some of the tossups in this tournament - where would the power mark on that Moby Dick or Mongols question be?

Also, my comment on negs was more focused on why less people buzzed at the end in UCSD (because, of course, there's no point in buzzing until the end if the other team has negged).

I agree about the distinguishing rhythm bit generally and about the tendency for people to buzz earlier when there are powers. I'd like to think that I wrote my tossups with the same sort of aim that I do when there are powers - given a seven or eight line tossup, by the end of line four or five I want to be in the realm of pretty famous stuff. I'd probably power mark the Mongols tossup at Rabban bar Sauma or Nestorians (likely the latter to err on the side of generosity).

I haven't seen the advanced stats yet to confirm this, but on my read-through the set, I recalled early buzzes by basically every team we faced through the day. Personally, my buzzes also seemed pretty evenly distributed throughout the tossups. With that in mind, I'm going to disagree with the notion that there was a lack of middle clues and instead praise the editing team for the presence of those clues. There are always going to be a few tossups that come down to a buzzer race somewhere or some hard parts that shoot way over - I don't think it can be avoided in a large set like this.

Periplus of the Erythraean Sea wrote:
EDIT: This also seems to me to be further reinforcement of the "negs are really bad" theory of quizbowl. Obviously it's hard to quantify aggression unless we are able to quantify player certainty about an answer (no way in hell for now) but if three fewer negs per tournament makes that big of a difference on your buzz curve, imagine what having six or nine fewer negs is like! I am sure there are other confounding factors, such as other teams buzzing earlier and giving you fewer opportunities to neg. To compare some actual teams: consider Berkeley (22 negs across 10 rounds), Columbia (17.5 negs, normalized to 10 rounds), [Aidan-less] Penn (27.3, normalized), Chicago (25.4, normalized, with a much higher SOS than these other teams), and Cambridge (20.9, normalized).

It’s possible you’re mixing up cause and effect here, because I’m pretty sure you could also interpret the 3% dip in negs to mean “the other team got the tossup faster so there was no time for us to neg” or something along those lines. This isn’t to say negs aren’t terrible and game-losing—they are, but the fact that these buzzes at our site happened earlier in the tossup on average AND resulted in fewer negs makes me think scaled down player aggression isn’t the causal mechanism in the “Neg vs. Average Earliness of Buzz” relationship in this graph.

Put in gameplay terms, I highly doubt that any team besides Berkeley A at our site (1 out of 5 rooms per round) was able to consistently assume that they’d get the tossup after the 15% mark but before the 70% or so mark against every other team and therefore waited a line or so after they thought they knew it. Even then, using that as a general model of a team rather than, like, one or two matches per tournament is pretty iffy to me

Evan Lynch wrote:I haven't seen the advanced stats yet to confirm this, but on my read-through the set, I recalled early buzzes by basically every team we faced through the day. Personally, my buzzes also seemed pretty evenly distributed throughout the tossups. With that in mind, I'm going to disagree with the notion that there was a lack of middle clues and instead praise the editing team for the presence of those clues. There are always going to be a few tossups that come down to a buzzer race somewhere or some hard parts that shoot way over - I don't think it can be avoided in a large set like this.

A couple of predictions:
Concordant with the above, I predict the British site will have the highest percentage of first buzzes (even after excluding British content).

The "Big Data"/Cathy O'Neill/recidivism bonus will have the lowest PPB of any bonus in the set.

I'd like to look into whether or not there is a correlation between opponent strength and neg rate. I predict that there is not one, since I don't think teams are as strategic as people might think in how they approach different games, and because the weaker team would have fewer opportunities to buzz anyway. I would also like to know more generally about where teams neg relative to their strength in that category, although I don't have a specific prediction for that.

Granny Soberer wrote:
A couple of predictions:
Concordant with the above, I predict the British site will have the highest percentage of first buzzes (even after excluding British content).

In terms of median first buzz, the UCSD site was the best at 74.5%. The UK site was second at 75%

If you only look at top 25% or lower of all of a site's buzzes, the UK starts to creep ahead of the field.

155 players from 91 teams had global first buzzes on tossups played in 5 or more rooms. The top team in terms of global first buzzes was Penn A with 20 first buzzes. Eric was first individually (11) and Jaimie was 3rd (7) . The second-fourth teams tied at 12, and there were teams at every number below that. I'm pretty surprised at these stats and would assume the better teams have a lot more global first buzzes. It's hard to interpret more because so many people had 1 or 2 first buzzes. I wonder how this spread would have looked in the past with more generational superstars than just Eric.

I will periodically update this post if Ophir's cool with me analyzing the detailed stats.

I'd be intrigued to see how Auroni's goal of including more clues about the "academic study" of literature/history/etc. in leadins worked out in terms of generating early buzzes -- I didn't see any buzzes on most of those clues, so it'd be interesting to see how they played out at other sites.

It's worth keeping in mind that first buzz data may be limited by some rooms not keeping advanced stats - there weren't too many issues with this but there were a few. So, keep that in mind when evaluating some peoples' performances. That said, to my knowledge this only affected the Kansas State and UCSD sites - one room at each, but affecting the former much more as there were only four teams.

So, Penn is probably still ahead on first buzzes - but if I had to guess, Berkeley "should" be second given the omitted data. Given that UCSD had five rooms (5/4) * 12 = 15, so we can probably estimate Berkeley's "true" first buzzes at 15.

The bonus on 4-color-theorem/computers/Tait was 20'ed in at least 75% of rooms.

Questions written by editors will have a lower standard deviation for both average buzzpoint and conversion rate than submitted questions.

There will be a positive correlation between a submitting team's performance and the average difficulty of their questions (that is, better teams will have written harder questions, on average).

Literature will have the highest bonus conversion rate.

EDIT TO ADD: "Bad" teams (let's say <12PPB) will have a higher percentage of 30'd bonuses than buzzes in the first two lines, and "good" teams (>18PPB) will have a higher percentage of buzzes in the first two lines than 30'd bonuses.

Well, that bonus was 20'ed in every room in which it was heard, although because it was bonus 20 there was likely a confounding effect where weak teams were less likely to hear it.

The predictions based on who wrote questions can't really be evaluated yet, I don't think.

Literature did not have the highest bonus conversion -- other arts, some history, and RMPSS all had higher conversion rates.

The last thing does not have immediately available stats, and hasn't been combined yet, so I'll wait to analyze this until I have more time.

-Hard parts were significantly harder than middle parts, leading to a "wall effect" around 20 ppb
-Science bonuses were, on average, harder than literature bonuses
-If you plot a histogram of players organized by # of first buzzes, it will have an exceptionally long tail that probably fits Zipf's Law with some crazy coefficient (meaning there are a ton of players getting 1 first, because they randomly have very deep knowledge of a particular thing)
-Jaimie Carlson brought the thunder and reaped the whirlwind

I'd also love to determine some kind of statistic for category control, maybe something like an LD50 - what percentage of tossups in a given category a given team or player answers by the halfway point.

Periplus of the Erythraean Sea wrote:It's worth keeping in mind that first buzz data may be limited by some rooms not keeping advanced stats - there weren't too many issues with this but there were a few. So, keep that in mind when evaluating some peoples' performances. That said, to my knowledge this only affected the Kansas State and UCSD sites - one room at each, but affecting the former much more as there were only four teams.

So, Penn is probably still ahead on first buzzes - but if I had to guess, Berkeley "should" be second given the omitted data. Given that UCSD had five rooms (5/4) * 12 = 15, so we can probably estimate Berkeley's "true" first buzzes at 15.

Yeah I assumed missing buzzpoints were distributed randomly which of course they're not. Do you know about the issue where some people were given a buzzpoint % of over 100%?

Another hypothesis: Judging by the low number of FBs claimed by even the best players and teams, the majority of first buzzes will have been by players from lower-ranking teams or lower-scoring players within higher-level teams. If the data records that such players and teams also have much higher neg rates toward the beginning of the question, an explanation that such players generally buzz less cautiously will be likely. If not, an alternate explanation will have to be found.

Jakob Myers
MSU '21, Naperville North '17"No one has ever organized a greater effort to get people interested in pretending to play quiz bowl"
-Ankit Aggarwal
Member, PACE
Memerator

There may be a useful statistic for cliffing - places in a tossup where an abnormal number of buzzes happen at once. That'd require figuring out what the standard distribution of buzzes across a tossup is though.

Sit Room Guy wrote:Another hypothesis: Judging by the low number of FBs claimed by even the best players and teams, the majority of first buzzes will have been by players from lower-ranking teams or lower-scoring players within higher-level teams. If the data records that such players and teams also have much higher neg rates toward the beginning of the question, an explanation that such players generally buzz less cautiously will be likely. If not, an alternate explanation will have to be found.

An alternate explanation may be that even weaker players often have extremely deep knowledge of one or two specific things, and getting a first buzz on a question only requires that, not any sort of expansive knowledge of the category as a whole. I claim that the best players aren't defined by their ability to pull off sick first-clue buzzes, but rather on their ability to consistently buzz before their opponents.

The bonus on the do-nothing congress will be in the top 3 of most 30d history bonuses.

Social Science, excepting econ, tossups will have the lowest conversion rates.

A few people will neg the Bradley effect clue in the Armenians one with African-American (hah David) and at least person other than me will mistake which Henry II died in a jousting accident and neg that clue in the Catherine de Medici TU.

Within literature, Non-Epic Poetry will have the lowest PPB and the lowest conversion rate of TUs.

Sima Guang Hater wrote:There may be a useful statistic for cliffing - places in a tossup where an abnormal number of buzzes happen at once. That'd require figuring out what the standard distribution of buzzes across a tossup is though.

I feel like it looks something like the Pareto distribution (possibly mirrored or reversed depending on how you denote the random variable. I was thinking the RV would be "lines left in tossup" so as lines increase->probability of a buzz occurring quickly increases).

I have argued before on this forum though that it's silly to make distribution assumptions about quizbowl statistics without looking at the data first. It's pretty clear to me that there is no proper way to sample easy, medium, hard bonus parts such that the P{bonus correctly answered} resembles the normal distribution. Maybe the mean PPB is ~15 and the distribution of PPB themselves may have 95% of the data between, say 10 ppb and 20 ppb, but I think the tail behavior of like the hardest bonus parts is not very symmetric (i.e. there is heavy skew). Anyone with the bonus conversion data want to graph it in R or something and print the result here? I'm actually more curious in bonus conversion data by bonus and the distribution's skew.

Harris Bunker
Grosse Pointe North High School '15
Michigan State University 2015-

With the encouragement of Ophir I'll post a piece of minutiae that's probably only interesting to me: although the UCSD site had the "earliest" cumulative buzz distribution, the SOS measures for teams at the UCSD site don't have the same dramatic difference with other sites.

Sima Guang Hater wrote:There may be a useful statistic for cliffing - places in a tossup where an abnormal number of buzzes happen at once. That'd require figuring out what the standard distribution of buzzes across a tossup is though.

Islam had the most egregious difficulty cliff out of all first half of tossup cliffs. 16/49 people knew that most major Chinese translations of this religion’s scriptures were authored by scholars with the surname Ma.

Progcon wrote:
I have argued before on this forum though that it's silly to make distribution assumptions about quizbowl statistics without looking at the data first. It's pretty clear to me that there is no proper way to sample easy, medium, hard bonus parts such that the P{bonus correctly answered} resembles the normal distribution. Maybe the mean PPB is ~15 and the distribution of PPB themselves may have 95% of the data between, say 10 ppb and 20 ppb, but I think the tail behavior of like the hardest bonus parts is not very symmetric (i.e. there is heavy skew). Anyone with the bonus conversion data want to graph it in R or something and print the result here? I'm actually more curious in bonus conversion data by bonus and the distribution's skew.

For what it's worth, I think 15 PPB as a median is an ideal trotted out for regular difficulty in general. I also have used it as a target for EFT because the audience for that tournament is (probably) larger and also somewhat weaker.

EDIT: And yeah, mea culpa on that Islam clue. I moved it to be in the fourth sentence in the set's second iteration - still not a great look. It feels like only a few years ago the Hui and Three Mas were pretty obscure to quizbowl. I guess I'm old now. Though admittedly, that should say 16/49 rooms, not people.

Here is a big dump of buzzpoint data by team. I feel that this is OK for how buzzpoints should be distributed for a tournament like this, but ACF could definitely afford to cut an early line and add more pre-FTP fluff.

Aaron Manby (ironmaster) wrote:Depending on what Victor used to define "regularly":

No team had 50% of their buzzes before the 50% or 60% mark.

4 teams had 25% of their buzzes before the 50% mark.

34 teams had 25% if the buzzes before the 60% mark.

55 teams had 10% of their buzzes before the 50% mark.

89 teams had 10% of their buzzes before the 60% mark.

Thank you for finding this information. I guess I didn't exactly define "regularly" before, and I hate to be defining it now after the fact, but I would define "regularly" based on this graph, and also by room instead of team (this was a similar analysis done of the buzz points, but with a smaller sample size from Terrapin). I'll be very specific in my ideal definition of "regularly."

On a graph of [% of rooms answering the question correctly] vs. [buzz mark], I think the ideal average should connect the following three points, not in a straight line: (0% rooms, 10% buzz mark) + (40% rooms, 60% buzz mark) + (80% rooms, 90% buzz mark). That should look like this (pretend that the y-axis is % of rooms, and that each of the lines are individual rooms). Few things:

1. For the first point, you can't start this line from the zero mark, because buzzes can only happen after the first substantive clue is read and comprehended. I think that starts around the 10% mark. I don't think this can really be disputed, although my next two points can be disputed.
2. For the second point, this is just a number that I think is ideal: if 40% of rooms have answered correctly a tossup by the 60% mark, I think the set won't be considered to be "dragging on."
3. For the third point, this is another number that I arbitrarily decided on. 80% of rooms answering a tossup correctly by the FTP clue (which I estimate happens around the 90% mark most of the time) seems like a number to strive for.

I think that Regionals (at least at Penn State) this year had this line depressed to the right, something like this. Only 20% of rooms were answering questions by the 60% mark, and only 75% of rooms were answering questions by the 90% mark.

I think that eliminating the first line or two in tossups could automatically correct this curve without sacrificing ability to differentiate between lower-tier teams. I urge writers to consider shortening their questions in the future.

I'll roughly concur with Victor, postulating an ideal buzz distribution as approximately equal to (2 * integral [0 to X] x dx) - this being a slope of a perfectly smooth pyramid. This gives the following numbers:

1% of all buzzes are by 10% (first-clues) - perhaps could be a bit harder, but in general I think the first clue of a tossup should be tough and also fresh (and also players may hesitate, reducing the total number of first buzzes)
25% of all buzzes are by 50%
81% of buzzes are by the 90% mark (FTP)
100% of all buzzes are by 100%

Naturally, there are some complicating factors:

1) Negs. Assuming a (fairly high) neg rate with 25% of all buzzes being negs, and further assuming that negs are evenly distributed throughout a tossup, this means (roughly - the math here isn't perfect) that 20% of all correct buzzes are by 50%. That's a 20% "power rate" - which I think we can agree is pretty reasonable. It also means that 25% of tossups automatically get converted at the end.
2) Tossups that not all teams convert. Assuming a (pretty reasonable) conversion rate of 90%, this means that you're ending up with an 18% power rate. However, since powers are disproportionately scored by good teams, who are also more likely to know tougher answers, I suspect the attenuating factor on powers and early buzzes wouldn't be quite as bad - tossups going dead are, more often than not, weaker teams failing to answer them after hearing the whole question.

Periplus of the Erythraean Sea wrote:
Apparently these tossups are outliers using these criteria (in no particular order): Lycidas, Anne Sexton, Concept of Mind, Rebecca Solnit, superconducting critical temperature, Middletown, Nicholas Malebranche, Charles Bukowski, James Longstreet, Cindy Sherman. Polynices and Million Man March ring in at 61% and there's a couple others in the mid-low 60s. These weren't all evenly distributed across packets.

I suspect that the low conversion on "superconducting critical temperature" was due to teams not knowing the exact name or mixing it up with the Curie point.