Best/Worst Performance Measures

Living in the era of big data and Magic Online, we have more data and performance measures than ever before, prompting some debate on the utility of various measures.

Performance measures we now have include:

Top 8/16 & Daily Event 3-1/4-0 appearances

Metagame Penetration (% of a deck/archetype/card/engine of total metagame)

Matchup Win Percentages

Specific Matchup Win/Loss%ages

We can even add other performance measures, such as average rank performance, mean rank performance, etc.

The goal of the DCI is to promote competitive balance, and to use restriction as a policy tool to maintain competitive balance, just as the FTC and federal courts use antitrust laws to maintain market competition. These tools are most obviously appropriate when there is a dominant deck or monopoly power.

But what's the best measure of dominance or performance?

For most of the history of the format, we lacked detailed overall metagame data. All that we had was Top 8 decklists, if that. Now, with Magic Online, we can see every deck that was played in a Premier event, giving us more data than ever before. We can see the entire metagame and compare it to the Top 8. We can see how players performed, and calculate win percentages. But what does all of this tell us?

Here are some initial thoughts:

1) Performance Matters

It's possible for a deck to have an absurd metagame presence, and yet perform poorly. For example, it is possible for a deck to be 60% of a field, but have 0% of the Top 8 or Top 16, and a sub 50% Match Win %.

2) Matchup Win %ages matter.

They tell us whether a deck has bad matchups or not. They are not the definitive answer, but they tell us something.

3) Performance Over Time Matters

Metagames are dynamic. A deck can dominate a tournament one week, and get crushed the next. Performance metrics at any single point in time cannot be determinative. We must understand how decks perform over time, and dominance demonstrate sustained performance, not performance that waxes and wanes.

I don't pretend to have the answers, but these three principles suggest a few things conclusively, as a matter of logic:

Metagame Saturation is not a Performance Metric, and cannot be a valid direct indicator of dominance. This metric is only useful as context - in helping us see whether a deck under or overperformed their presence in the field.

The best example of this is the May MTGO Premier event. Eldrazi were 14% of the field, but 63% of the Top 8.

For that reason, I consider metagame saturation the worst possible indicator, and is only a valid indirect/contextual indicator. It is logically incapable, alone, of telling you anything about performance.

Tournament size matters

Tournaments must be large enough to have enough players so that decks can have a chance to compete and show up in the data. For that reason, I'd give more weight to larger tournaments, and less weight to things like daily events. But within Daily events, I'd give more weight to 4-0 performance than 3-1.

My thought is that the daily event results should be disregarded all together. The sample size is too small (12 players usually), and the 3-1 and 4-0 deck lists don't provide any useful information because you don't know what they played against.

(number of appearances in top X) / (number of appearances in the field)

(number of X) / (number of players)

It could be usable to compare performance across multiple events as well.

In case this isn't clear; it's a ratio of ratios. I'll try to explain it another way:
A: number of appearances in top X
B: number of appearances in the field
X: number of top placing decks (arbitrarily decided whether you want the top 8, top 4, top 2, top 1, etc)
N: number of players

(A/B)/(X/N)

Equivalently:

(A*N)/(B*X)

It's chosen such that if the entire field is one deck and so is the top 8 the ratio is 1. If the entire field is all the same deck except 8 of them and those 8 make top 8 while the others do not then those 8 decks get a high ranking while the others get 0. It measures the penetration of the top x bracket for the chosen deck vs. it's presence in the field. This has been my chosen metric for performance ever since more metagame data has become available than just top X appearances.

I really like the formula. This formula can only be applied to a specific deck, though. Can you run it for a couple of decks in recent events to see what kind of values it gives? Why don't you do it for the NYSE, and the last two MTGO events? For, say Shops and Gush?

@Smmenen Using the variable names from my last post:
A: number of appearances in top X
B: number of appearances in the field
X: number of top placing decks (arbitrarily decided whether you want the top 8, top 4, top 2, top 1, etc)
N: number of players

@Smmenen Right.
You can chose X to include whatever bracket you want so that you can look at the performance of a deck that would get a zero when using X = 8 (Top 8 bracket).

Edit: Also, if you choose X for each event you want to compare such that you are including the top Y percentile performing decks in the field then you could use the scores to compare decks across multiple events. ie; If you chose X based on N so that the ratio of the two is the same (or as close as possible) for each event then the scores of each deck should be normalized.

The main problem with trying to make a performance measure is the problem of "deck identity." Let's say you put Scornful Egotist in Doomsday. Remove Black Lotus and you've strongly impacted the deck's ability to perform. Remove a Preordain and the resulting impact is so small that you can't measure it in a single tourney. So, how do you define a deck when cards make wildly unequal contributions such that some variations barely matter?

Ideally the performance measures would also take into account the strength of the competitors; i.e. it would jointly solve for an N x N matrix of matchup win probabilities, and an P x N matrix of how well each player can pilot which deck.

Of course this is only tractable if there are many players who play many different decks in the results data.

Is there a way to scale the score or value so that it resembles something more intuitive?

The formula as I read it is already scaled. You are basically taking the fraction of deck X in the top 8 (A/B) and dividing it by the fraction of deck X in the metagame (X/N). This is a measure of a deck's throughput into top X with a ratio of >1 indicating a deck outperformed it's slice of the metagame and a ratio of <1 underperforming it's percentage of the metagame. It's a really nice equation @Aaron-Patten and something I would like to use going forward (if you don't mind) but it's limited in that you need to have a complete breakdown of the metagame to calculate it.

"One key to the continued health of Magic is diversity. It is vitally important to ensure that there are multiple competitive decks for the tournament player to choose from."

Is there a way to scale the score or value so that it resembles something more intuitive?

The formula as I read it is already scaled. You are basically taking the fraction of deck X in the top 8 (A/B) and dividing it by the fraction of deck X in the metagame (X/N). This is a measure of a deck's throughput into top X with a ratio of >1 indicating a deck outperformed it's slice of the metagame and a ratio of <1 underperforming it's percentage of the metagame. It's a really nice equation @Aaron-Patten and something I would like to use going forward (if you don't mind) but it's limited in that you need to have a complete breakdown of the metagame to calculate it.

No, you are right.

Originally, I thought that upper bound might provide a really odd number, but the upper bound is always going to be 100, and the lower bound 0, with 1 signalling a useful pivot point. So, by "scaled" I meant translate the number into something that can be intuitively useful, but it already is!

So, how do you define a deck when cards make wildly unequal contributions such that some variations barely matter?

It seems like most people just pick certain key cards and count their inclusion as the definition of the archetype. Like calling any deck that plays both Gush and Monastery Mentor a Gush-Mentor deck. There are other ways bu they're a bit more involved. I'm still working through some ideas in that department my self.

Is there a way to scale the score or value so that it resembles something more intuitive?

The formula as I read it is already scaled. You are basically taking the fraction of deck X in the top 8 (A/B) and dividing it by the fraction of deck X in the metagame (X/N). This is a measure of a deck's throughput into top X with a ratio of >1 indicating a deck outperformed it's slice of the metagame and a ratio of <1 underperforming it's percentage of the metagame. It's a really nice equation @Aaron-Patten and something I would like to use going forward (if you don't mind) but it's limited in that you need to have a complete breakdown of the metagame to calculate it.

Exactly right, and I'd be delighted to see it used.
Within a top 8 you could still use it to measure the performance of the finalists or even the top 4 but It is limited, in that respect, by how much data is available.

Is there a way to scale the score or value so that it resembles something more intuitive?

The formula as I read it is already scaled. You are basically taking the fraction of deck X in the top 8 (A/B) and dividing it by the fraction of deck X in the metagame (X/N). This is a measure of a deck's throughput into top X with a ratio of >1 indicating a deck outperformed it's slice of the metagame and a ratio of <1 underperforming it's percentage of the metagame. It's a really nice equation @Aaron-Patten and something I would like to use going forward (if you don't mind) but it's limited in that you need to have a complete breakdown of the metagame to calculate it.

No, you are right.

Originally, I thought that upper bound might provide a really odd number, but the upper bound is always going to be 100, and the lower bound 0, with 1 signalling a useful pivot point. So, by "scaled" I meant translate the number into something that can be intuitively useful, but it already is!

The upper bound for this performance metric will always be:

N/X

So for any given event the values will all have the same upper bound and if you chose to maintain that ratio throughout your comparisons between different events they would all be scaled to match. If you decided to compare the top 10% for multiple events, for example, you would consistently have an upper bound of 10. If you chose to compare the top 20% you would consistently have an upper bound of 5. For the top 25% the upper bound would be 4, etc.

If I understand correctly, the aspiration here would be to find a community supported method of objectively measuring performance such that we could mathematically help determine banned and restricted list policy. A non-exhaustive few problems stand out:

22 years after opening my first Revised starter, I still have no idea who or what precisely comprises the "DCI." Since I have no idea how these decisions are actually made, or who performs them, I'm hesitant to believe that even with a "perfect" algorithm, this mysterious decisionmaking body would substitute the findings for its own procedures and judgment (or lack thereof, as some may criticize).

It's already been established that "fun" is a factor taken into consideration when changing the B&R list and this seems to on its face preclude the use of a strictly mathematical determination.

I agree with the DCI that subjective factors and factors existing outside of tournament results should play a role as we're talking about an enterprise whose goal is maximum enjoyment by the human beings who participate. Additionally, I believe the secondary market should be an acceptable factor for consideration. Restricting something like Mishra's Workshop for instance causes a great deal of tangible harm to community members who are our friends, teammates, TO's, and so forth. That's not something we can or should easily brush aside.

Focusing on performance of individual cards is likely to yield seemingly mathematically incontrovertible conclusions like "Polluted Delta should be restricted." Since we know intuitively, that is "wrong," we'd apply an asterisk to set the result aside. And then we'd do that for Force of Will, Wasteland, Flooded Strand, Tundra, and so forth and the whole pretense of objectivity would look like a sham. :-D

-B

Samantha: “Matt, the deck lost to Merfolk.”
Matt: “I don’t build decks to beat Merfolk. In any format.”

If I understand correctly, the aspiration here would be to find a community supported method of objectively measuring performance such that we could mathematically help determine banned and restricted list policy.

In brief, that is not the goal of the thread or the OP. Rather, it is an attempt to discuss and search for the best set of metrics (and to identify the limitations of others) for evaluating deck performance. Once possible application of such a metric is to inform DCI policy, but that is not the only such application.

Now that we have more metrics than ever before, such a conversation is valuable in it's own right, so that we can discuss the advantages and disadvantages of various measuring sticks.

A non-exhaustive few problems stand out:> 1. 22 years after opening my first Revised starter, I still have no idea who or what precisely comprises the "DCI." Since I have no idea how these decisions are actually made, or who performs them, I'm hesitant to believe that even with a "perfect" algorithm, this mysterious decisionmaking body would substitute the findings for its own procedures and judgment (or lack thereof, as some may criticize).

Those are two separate problems by my count. Who or what constitutes the DCI is not the same issue as to whether any set of metrics would be used by the DCI. But, in any case, it shouldn't undermine the search for better metrics. By the same token, no one knows exactly what kinds of metrics goes into the Fed Reserve's management of monetary policy, but data like unemployment rate, inflation rates (CPI, etc) are included. Quantitative information does not mean the objectification of policy making. The policy makers then have to figure out how to weigh all of the data they have, etc.

It's already been established that "fun" is a factor taken into consideration when changing the B&R list and this seems to on its face preclude the use of a strictly mathematical determination.

No doubt subjective information plays a role in DCI decision making. But that should not preclude the search for better metrics (or a discussion on the merits of existing ones) for measuring deck performance.

I agree with the DCI that subjective factors and factors existing outside of tournament results should play a role as we're talking about an enterprise whose goal is maximum enjoyment by the human beings who participate. Additionally, I believe the secondary market should be an acceptable factor for consideration. Restricting something like Mishra's Workshop for instance causes a great deal of tangible harm to community members who are our friends, teammates, TO's, and so forth. That's not something we can or should easily brush aside.

Focusing on performance of individual cards is likely to yield seemingly mathematically incontrovertible conclusions like "Polluted Delta should be restricted." Since we know intuitively, that is "wrong," we'd apply an asterisk to set the result aside. And then we'd do that for Force of Will, Wasteland, Flooded Strand, Tundra, and so forth and the whole pretense of objectivity would look like a sham. :-D

-B

Yeah - I think your points flow from the mistaken assumption that the purpose of this thread is the narrow goal of identifying a perfect measurement that can then solve DCI policy making. No such measure exists. But it is important to have a discussion on the advantages and limitations of existing metrics, and to search for better approaches in the era of big data.

Just as an additional examaple I've compiled a more complete breakdown for the top 8 of NYSE results in descending order:Shops: (2 * 157)/(17 * 8) = 2.309Dredge: (1 * 157)(11 * 8) = 1.784Gush: (4 * 157)(51 * 8) = 1.539Eldrazi: (1 * 157)(17 * 8) = 1.154
So from this we know that there are 2.31 times more Shops decks per capita in the top 8 then showed up to the event, 1.54 times more Gush decks per capita in top 8, 1.78 times more Dredge decks per capita, and 1.15 times more Eldrazi decks per capita.

@brianpk80 Though Steve's goal is not quite as you described, I completely agree with all of those points. The only reason I agree with 3.5 is that I don't want to see a mass exodus from Vintage like what happened in 2008. I would personally prefer that the secondary market not be a factor in their considerations if it were safe to do so. It seems like having that constraint could go against creating the best Vintage format possible. Your point about Polluted Delta is one I think a lot of people miss in general. It's easy to notice the occasional turn 1 Tinker for Blightsteel Colossus because the result is dramatic but I suspect that basing restriction purely on performance as gauged by popularity would result in a Vintage format that is very different from what we have today. It's hard to create an objective argument against restricting cards like Polluted Delta or Underground Sea if you base your performance metric on how frequently they show up. I wonder how each of them would score using the formula.. The difference between Polluted Delta and Gush for example is that the prevalence of Polluted Delta gets ignored as acceptable while the prevalence of Gush does not. This seems like an oversight caused by some psychological aspect of the audience. If I had a complete set of deck lists in some kind of easily manipulable spreadsheet for some large events I might be able to see how each individual card in each event's metagame scores according to the formula. This sounds like a daunting task though, especially if I have to do all of that data entry. For the namesake cards of each deck it's easy to calculate their score from the metagame breakdown because the namesake card is used in the deck name but for other staples such as duals and fetchlands it's not obvious without decklists. Another interesting point about evaluating based on frequency of appearance or metagame presence is that it's plausible that large numbers of people could just be wrong about what they should bring and that there are so many of them that the likelihood of their success is increased. Metagame presence is definitely an important measure but I don't think it carries any real weight on it' s own without a performance metric to show that those decks are also performing at a certain level rather than just being in great abundance. It's just a measure of what people were betting on rather than being a measure of what performed well in an event.

Yeah - I think your points flow from the mistaken assumption that the purpose of this thread is the narrow goal of identifying a perfect measurement that can then solve DCI policy making. No such measure exists. But it is important to have a discussion on the advantages and limitations of existing metrics, and to search for better approaches in the era of big data.

Oh I know that wasn't the explicit or avowed purpose of discussing which metrics should be used for determining [fill in the blank], but I sense that is the underlying interest that drives these discussions.

Samantha: “Matt, the deck lost to Merfolk.”
Matt: “I don’t build decks to beat Merfolk. In any format.”

Yeah - I think your points flow from the mistaken assumption that the purpose of this thread is the narrow goal of identifying a perfect measurement that can then solve DCI policy making. No such measure exists. But it is important to have a discussion on the advantages and limitations of existing metrics, and to search for better approaches in the era of big data.

Oh I know that wasn't the explicit or avowed purpose of discussing which metrics should be used for determining [fill in the blank], but I sense that is the underlying interest that drives these discussions.

It's certainly a part of it, but I think there is another major one: and that's the availability of more/different data sources than ever before. It's not just that we have more data sources, we also have more kinds of data: 1) more total metagame data, 2) win %ages, and 3) daily event reports, which aren't commensurate to Top 8 data sets.

Given the multiplicity of data sets, there is a need to discuss and explore what the data means, what our measures tell us, and how we should think about them. I think that's the larger driver here.

FWIW, this isn't a problem limited to Magic. In almost every field, there are new (old) debates emerging over data largely because of the availability of new data sets.

@Aaron-Patten something i would like to say about the classification of decks is that while what cards are in it are important, i think how the deck plays is the best way to classify them, the best example i can think of is if a sylvan mentor deck were playing a single mana drain dosnt make it a mana drain deck. i think thats what you meant as one of the more involved ways but i still wanted to put that out there.