Tournament Reports and the Metagame "Data" Trend

I was not sure if I wanted to post this within one of the SMIP announcement threads or within General Discussion. PLease forgive me if you have real training in stats as I am trying to keep things relatively simple.

This kind of reporting allowed sharp members of the community to piece apart the metagame to a greater degree than possible with more simple Top 8 only or similar styles of TO reports. It would allow people to look at metagame saturation of individual decks (how many copies of that deck were played) or other similar data points. With the arrival of MTGO and the ability to replay games it has become easier to look at more than just metagame saturation. More and more, we have seen various members of our community delving even deeper into the tournament data.

Working with MTGO (or WER for the more dedicated) and third party software (generally Excel) the trend has become for TO's (or the like to @ChubbyRain & Co for the Online events) to report not only the top performing decks but to also include a more detailed and incricate "matchup analysis". Where after rewatching enough of the tournament to get the copy down who was on what archetype and extrapolate that using the final results of each round to get a total matches played/matches won for each matchup.

There are so many places where you can find this information but for ease of communication I have attached a copy of the NYSE #4 breakdown below.

NYSE #4 is to my knowledge the largest of the recent events to have received this treatment so this is a great place to start.

As cool as the pretty colours and lots of fancy numbers look most of these numbers mean absolutely nothing.

.
.
.
Thats's right. Most of this data is unusable for any real statistical analysis. There are simply too few data points to draw any conclusions from the data presented. Excel is a great program which I used every day for years and years but it is not a fantastic program for the kind of analysis we are trying to do here. Trying to put this kind of data into a more rigid program such as R-Tools really shows the deficiency in our data. For example, between the two largest groups (Gush & Shops) we have only 41 games. As you go down the list it gets worse and worse.

It is also quite poor as Vintage is not in the position it has been in the past with 60+ cards within archetypes being the same. By grouping decks like this we lose sight of any individual changes between the various decks. That being said, that has always been an issue with looking at tournaments as whole entities.

While 41 data points is enough for a binary Y/N kind of thing. Its hard to even ask something as simple as is deck X favoured in this matchup with any real certainty with so little information. As you may know, the higher the number of games you have the more accurate you can expect your data to be. Small sample sizes are more prone to variation from what you would expect. Take a look at the Oath of Druids vs Combo matchup in the NYSE #4 data above. If you were to take this as gospel and you expected your next event to have a lot of storm you might end up leaning toward Oath as a foil to that metagame.
.
.
.
Let me know how that works out for you.

Now I am not asking people to not do this kind of analysis. As I said, I have taken part in this sort of analysis but PLEASE take all of this data with a grain of salt. I don't actually think anyone is seriously taking the above as Gospel but words from the SMIP Podcast had me worried. This kind of analysis should have little to no place in B&R discussion. We simply do not have the information to reliably draw conclusions with any sort of certainty from what we have. Especially if we keep the information restricted to individual tournaments.

I see a few options going forward regarding this kind of analysis.

Keep doing this analysis but for the love of God keep it away from any kind of B & R discussionin its current form. I fear that this kind of data, as restrictive as it is, changes the focus from winning decks and decks which have an unhealthy metagame saturation to decks which meet various other factors/requirements. The numbers for decks with larger metagame saturation using this kind of data will naturally fall, despite perhaps having a good performance at top tables simply due to the numbers that did not make top 8 (which will generally be more than those that do make it) We saw an example of this in a recent SMIP pocast where despite Gush being 4/8 its MW% was low. This would always have been expected as 30% of the metagame cannot make top 8 etc. I don't point this out to rag on Gush but simply to use it as an example of the possible unintended consequences of changing our policies RE: B & R

With the current set up we can expect to see decks see well metagamed decks that appear in small numbers to hit hit much harder through B&R policy than decks with higher metagame saturation.

Expand the analysis. Working together with the various TO's who have made this data available to smash them up together to get a much broader "Vintage Metagame over Time" analysis. This does have its own issues however as you would lose all sense of metagame changes over time. Even more so than the standard loss of deck individuality in individual tournaments. We would also lose any kind of trend data if we were to rely on this kind of analysis.

Ignore this data.

Now clearly we should not use any single form of analysis as our sole source of data for B&R discussion. We should be using everything that we have at our disposal and working out what is correct from there - not that we have any real power to affect change at the DCI level. I simply want to avoid bad data and bad analysis being used as a soap box for Vintage community outcry - or lack thereof- at various cards/decks etc. Without proper instruction etc, this new rage for 'big data' may do more harm than good in the long run if it is kept in its current form.

I think this is important to contextualize in one way; Steve and I are guilty of not emphasizing how this matchup data is not precisely predictive. In the past, we've enjoyed to use recent data to predict metagames (mostly for Champs) and we've had reasonable success in doing so. We don't feel like we can (or do) make the same predictions about matchups. Our observations on matchups are only about what happened and how to react to it.

That said; I don't think it's appropriate to separate this data from B/R policy. These matchup results are just as valid for analysis as any other data (or lack thereof) that the DCI has relied upon in the past. There is certainly a risk of making inappropriate/unfounded assertions about certain matchups, but we are forced the evaluate actual results when making policy and this data is part of that.

We're all forging new knowledge and understand about Vintage these days, and that's a Good Thing(tm).

Edit: unless Vintage has a massive increase in popularity, we will never have enough data to make truly meaningful data sets by size. All adding time the equation will do is force us to speak increasingly generally, as archetypes shift over time. The DCI works on time frames (quarters) that are too short to rely too much on historical precedent (see what happened with Lodestone). There will never be enough data to speak with great certainty in the situations like Lodestone. There will always be room for debate ;).

When it comes to integrating matchup results into B/R policy discussion, we addressed during our last show how different combinations of true metagame representation, T8 penetration and performance against the field will be tricky to evaluate in the future. Understanding how small certain days sets are well be part of that evaluation.

To be clear, we include the number of matches to emphasize how unreliable most of the percentages are. Rather than us telling you which data has a large enough sample size that it should be looked at, we present all of it and let individuals decide what is meaningful and what is not.

As @CHA1N5 mentions, it is unlikely there will ever be enough data to make rigorous conclusions from the data. If you think the entire venture is basically useless, then that's your right, and perhaps your informed opinion. However I'm quite certain B&R or metagaming decisions in the past have been made on far worse data.

I personally look at matchups with a "reasonable" number of games played and look at the percentages. If it's off from what I expect then I try to see if I can convince myself that my gut feeling is correct, or perhaps it needs to be reevaluated.

"We saw an example of this in a recent SMIP pocast where despite Gush being 4/8 its MW% was low. This would always have been expected as 30% of the metagame cannot make top 8 etc."

I actually disagree with this. There were over 300 matches of gush played and the winrate, excluding mirror matches, ended up around 50%. I don't think you should shrug this off by saying that "30% of the metagame cannot make top 8". It seems like you are treating the "what made the top8?" metric as being less assailable to criticism when really it's just the only thing that's traditionally capable of being measured. I'm pretty confident any statistical analysis which tells you that 300 games is not significant will tell you the same thing about the makeup of a few top8s. I don't want to start a big debate, but as I said, the point of this is to make note of some things that previously were not being included.

The complaints about "bad data" or, more precisely, the limits of the data, are hilaroius because we now have more and better data than ever before.

It's like complaining about cash flow problems upon suddenly becoming affluent. We may not have enough money, but it's far more than before.

Does anyone feel that the old quarterly morphling.de top 8 analysis, aggregating top8 decks for tournaments of 33 or more players that I did for years for starcitygames.com is any less flawed? How can 10 tournaments in Q1 of 2008 (and therefore a mere 80 data points) be any more reliable than the deeper data dive we now have?

And yet, the DCI and players have based decisions on less. Some players and DCI decisions have been made on the basis of a single event top 8, or no events at all!

Our podcast never purported to present data as the end-all-be-all, or that what happened in one event perfectly predicts another. All these results should be taken with a grain of salt. Some people are over-reading/misreading our discussion. For example, no data point can predict how a White Eldrazi player, say, will make adjustments to shore up or improve the Dredge matchup, and the % will thereby change.

But to suggest that such data should not inform DCI B&R list policy is nonsense. It's no less flawed than looking at a top 8, with all it's variance.

Personally, I would appreciate transparency around B&R decisions that relate to vintage. This is not likely to happen, but in the least, it would be nice to have more details than what have been provided in recent updates. If they use data to drive decisions then I am curious about the reservations around publishing such data. That being said, it is entire possible that data is not the basis for decisions making. This however, is a scary thought.

My opinion is more data is always better BUT you must have the ability to accurately interpret it. How I generally handle it is,

Arguably, the most valuable information is still the top 8 or top 16 of a large event. These are the decks that are winning, either because they are good decks or good players thought they were good decks or otherwise someone was running hot that day. It's what most people look at and has the most influence on what people play (that and the VSL, though the VSL doesn't have the gauntlet of tournament play behind it).

The metagame breakdown is also very important in my opinion - what are people actually showing up to a tournament with? It influences my deck selection and deck construction greatly but also has more broad ramifications. Normally the top X is used as a proxy of the metagame but it needs validated and if there are incongruencies, I would look for reasons. For instance, the NYSE top 16 was all Gush, Eldrazi, and Dredge (with Belcher getting a scoop from Delver so make of that what you will). You don't know whether non-Gush Blue decks, Oath, and DPS were absent or there but none of the players did well. If absent, you might be tempted to look at these as potential solutions for the metagame. From the metagame breakdown, we see that Oath, DPS, and Blue Control were nearly 30% of the field - people were playing the decks but none did well enough to top X. Which leads you into asking why these decks didn't top X. Are the decks bad or are good/experienced players disproportionately not playing them? If the latter, why are they not playing them?

The least valuable piece of information by far is the matchup win percentages. There are so many factors that influence the outcome of a game of Magic - luck, player skill level, standings in the case of scoops or concessions, player fatigue, deck choices, sideboard strategies, etc. I lost the last round of Champs in 2015 when my opponent and I were making dinner plans for Fogo's Brazilian Steakhouse and I let his Xantid Swarm resolve while holding four counterspells (he's a bud and we had a prize split, but the indifference was real)...The sample sizes are so small compared to the effect of these confounders. It's also only a single event and decks change with time. Eldrazi has really fizzled from MTGO dailies over the past couple of weeks for whatever reason, but the increased numbers of Baleful Strixs, Snuff Outs, Balances, etc. have probably had a real impact on Eldrazi's prevalence and success. This also complicates longitudinal studies. I did what you suggested for the first quarter Power 9's of 2016. That does help with the sample sizes, but again you are sampling from different time periods with different trends in deck construction. The conclusion I reached tongue-in-cheek from this was that the prevailing opinions that Oath and DPS beat Gush needed to be reevaluated. There are obviously nuances here but "X beats Y" is a generic statement that can be evaluated by these types of broad analyses. The more valuable questions that follow "are these percentages significant" and "if significant, why", which provides direction for testing and increases your insight into the format. From the link above, Gush went 17-7 against Oath while going 29-16 against Combo, the majority of which was DPS. I would argue these records are significant given the spreads and the prevailing opinions if not wrong then in doubt. I actually got validation of my own testing and tournament experiences from this, which is a valuable thing for me to have. Ryan and I do these because we can and they do have some level of use when used intelligently, but I feel as you do when people overstate what it means.

Regarding B&R discussions, you will note that the DCI never references "win percentage" as a relevant statistic. They have more information than Ryan and I could possibly generate, but it just doesn't matter. The more prevalent a deck is, the more people gear to beat it. Eldrazi in Modern and Affinity in Block or Standrd probably had relatively unimpressive match win percentages, but that was with decks running gimmicky hate like Worship and Ensnaring Bridge in the case of Eldrazi or 10 artifact removal spells in the case of Affinity. I was running four maindeck March of the Machines during Affinity Standard - I beat affinity but it was absurd that that was a thing. If you are trying to use match win % in support of a particular B&R decision, you are frankly clueless... Diversity of competitive strategies and interactivity are the things explicitly listed, but these are of course very subjective. The data does have a role here as the winning lists and metagame breakdowns are reflections of a formats diversity.

"One key to the continued health of Magic is diversity. It is vitally important to ensure that there are multiple competitive decks for the tournament player to choose from."

Personally, I would appreciate transparency around B&R decisions that relate to vintage. This is not likely to happen, but in the least, it would be nice to have more details than what have been provided in recent updates. If they use data to drive decisions then I am curious about the reservations around publishing such data. That being said, it is entire possible that data is not the basis for decisions making. This however, is a scary thought.

I agree that DCI decisionmaking should be empirically based, but Mind's Desire was restricted before it ever appeared in a tournament.

There were years - like 1999 - when I can literally find almost no tournament data whatsoever.

We now have more than ever before, and it should all be considered - warts and all.

@Smmenen I once had a professor tell me that one of the best indicators of high intelligence was to be able to receive meaningless stimulus and to not draw conclusions from it. I still think that's one of the more brilliant adages I've encountered.

By the way, how much better do we think top 8 data is for large tourney's as compared to small ones. Jaco's deck for example, finished just outside of the top 8 at NYSE... after the intentional draw. To what degree does that indicate it's any worse that those other decks that outperformed it? And knowing that the answer is probably, none at all. What does that say about the data?

There could even be, depending on tourney structure, a diminishing value to tourneys as they get larger... The record for decks that finish just outside of the top 8 at very large tourneys are often very very good. The distinction between 10-0, or 9-1 or 8-2 (to say nothing of decks held out of top 8s on breakers or people getting byes) is at best 10% per tier. When an experienced player considers all those games that might have gone a different way if not for a 20% out getting hit one way or another, the difference between any of those decks as indicated by any of those records, is effectively nothing... or rather, it clearly falls within the space in which we just don't know the difference.

I personally put much more stock in play testing decks personally, than in this sort of stat taking. Which is absolutely how the DCI decisions should be done.

The Rich Shay Introspection of the day:
I don't know. It's better than dying... I guess.

My opinion is more data is always better BUT you must have the ability to accurately interpret it. How I generally handle it is,

Arguably, the most valuable information is still the top 8 or top 16 of a large event. These are the decks that are winning, either because they are good decks or good players thought they were good decks or otherwise someone was running hot that day. It's what most people look at and has the most influence on what people play (that and the VSL, though the VSL doesn't have the gauntlet of tournament play behind it).

The metagame breakdown is also very important in my opinion - what are people actually showing up to a tournament with? It influences my deck selection and deck construction greatly but also has more broad ramifications. Normally the top X is used as a proxy of the metagame but it needs validated and if there are incongruencies, I would look for reasons. For instance, the NYSE top 16 was all Gush, Eldrazi, and Dredge (with Belcher getting a scoop from Delver so make of that what you will). You don't know whether non-Gush Blue decks, Oath, and DPS were absent or there but none of the players did well. If absent, you might be tempted to look at these as potential solutions for the metagame. From the metagame breakdown, we see that Oath, DPS, and Blue Control were nearly 30% of the field - people were playing the decks but none did well enough to top X. Which leads you into asking why these decks didn't top X. Are the decks bad or are good/experienced players disproportionately not playing them? If the latter, why are they not playing them?

The least valuable piece of information by far is the matchup win percentages.

This is basically what I was saying. More data is always better but the rate at which TO's have picked this up spurred me to writing the OP. These MW% are not great indicators of anything, especially in such small sample sizes.

It is easier to be wrong in Vintage than any other format. It’s also easier to argue you actually are right. As a result, what people believe to be true is usually more relevant than the actual reality.