Modern MTGO Deep Dive: Win Rate Analysis (Part 2)

In last week’s article, we took a deep dive into the world of unpublished MTGO dailies, capturing performance by both the 4-0/3-1 decks and the 2-2 or worse finishers. This dataset gives us a chance to look at a deck’s performance throughout an event, not just in the winner’s bracket. MTGO statistics in action! We saw the power of Abzan Liege, Amulet Bloom, and RUG Twin. We also broke down the statistical significance of all deck performance in the tier 1 and tier 2 categories, getting a top-level view of how decks stack up against each other.

In this week’s article, I want to dig a little deeper and highlight some of the decks that did not make the tier 2 cut. Our Top Decks page gives a list of both the decks and the criteria used to decide on tiering, and none of the decks I’m discussing today managed to meet that tier 2 definition. But as we will see soon, the data suggests they are all formidable contenders. Or, in a few cases, they might be the exact opposite!

Going back to our test of significance approach, we will break down the match win percentages (MWP) of a variety of decks on the Modern MTGO scene. This analysis will point to decks that are likely overperformers or underperformers online. As with all analysis like this, there will be a number of limitations and qualifications that must underscore our work. But if we are aware of the effect of small sample size and a non/semi-random sample, we can still get a lot of good information from this analysis of Modern MTGO deck performance. If you’re like me, you got into Modern to play all the interesting decks existing further down the tiering spectrum, decks like Soul Sisters, Mono U Tron, Esper Mentor, etc. Today, we are going to see how such decks are performing.

Also, a big thanks to the guys at MTGSalvation, especially pizzap, for their work on getting this dataset put together. Without their work, we’d still be stuck on public data from the mothership.

Dataset Description

We are using the same dataset described in the article from last week. The data covers 11 MTGO events from 3/24 to 4/12. The average event size was about 70 players and there are roughly 750 decks captured, covering thousands of individual matches between those decks. The big difference between this dataset and those we have looked at in the past is the inclusion of 2-2 or worse decks, not just 4-0/3-1 ones.

As awesome as this dataset is, it still has some important limitations we need to keep in mind. For one, this is by no means a complete population of all MTGO events in the time frame. It’s just 11 dailies over a 2-3 week period, which is actually less than an event per day. Although our deck and match Ns are very large, the event N is quite small, so we need to understand that in analyzing the numbers. Perhaps more importantly, it’s not even a random sample. Dailies were just observed whenever my partners and I had time, which is only semi-random at best. These limitations (and others I am sure you can think of; data entry errors are always at play in big, manually recorded datasets) do not undercut the value of what we are doing, but they do force us to consider their effect on our conclusions.

Additionally, there were a few dailies added to the dataset in the last 24 hours that might modify the final numbers when all analysis is said and done. Because those dailies were added so recently, they haven’t been fully input into the overall dataset, so it’s very possible their addition will alter some of our conclusions by the time we next revisit the numbers. In samples with a small N, the effect of 1-2 events can be quite noticeable.

Finally, as with all other articles of this kind, all statistics and data analysis disclaimers apply!

Overperforming Decks

I don’t want to go too in-depth on methods, but here’s a quick recap from last week’s article. I’m comparing the MWP of a single deck to the overall weighted MWP of all decks in the dataset. The goal is to see if any one deck’s MWP is above, below, or within the expected MWP variance for all decks. Just as a fair coin won’t always flip 5 heads and 5 tails, so too might an average deck under/over perform in only a few events. The more times a deck appears, the more likely we are to have its “true” MWP. Following this analysis, all decks receive a P value (ranging from 0 to 1) indicating how likely the deck’s performance is within expected variance. A high P value suggests it is well-within that variance. A low P value, particularly lower than .1 or .05, suggests it is outside of expected variance and is actually above or below average. Note: a low P value does not necessarily denote an overperforming deck. It just means the deck is probably outside of the expected variance.

Last time I gave some tables for tier 1 and tier 2 decks. Because there are so many more tier 3 or lower decks in the dataset, there isn’t a lot of point in looking at all the different entries on that list. Instead, I want to get right to the good stuff and just highlight the overperforming decks. We’ll go from least overperforming to most overperforming, building up to the awesome and surprising winner.

Grixis Twin (N = 7 ; MWP = 60% ; P = .285)

The Tasigur, the Golden Fang-powered version of Temur Twin hasn’t enjoyed the same success as another Grixis deck on MTGO, but it has still seen some play and some strong performances. Whether packing Tasigur in the main or shipping him to the board to rely on Terminate and company in the maindeck, Grixis Twin has a promising trajectory. No, it hasn’t reached a significant P value yet. No, its N isn’t as large as we would like it. And no, it isn’t doing quite as well as Temur Twin, which has just as many entries in the dataset but a higher MWP (66.67%) and P value (.085). But there is enough overlap between the two decks that I am willing to consider this as a promising riser.

In essence, both decks are running the Twin/Deceiver Exarch combo alongside a tempo-oriented backup plan. This was good enough for Temur Twin to give it a pretty sizable MWP lead over other decks, as well as one of the largest P values in the dataset. This suggests we aren’t totally off our rockers in thinking Grixis Twin could enjoy similar success with a similar gameplan. We also know the Grixis color pairing is enjoying lots of success on MTGO right now elsewhere (go, go Delver of Secrets!), which further suggests another Grixis deck could be successful for similar reasons. All of this points to us viewing the Grixis Twin MWP and P value more favorably than we might otherwise, because it is situated in a context where both Grixis colors and Tempo Twin builds are successful.

One reason I believe Grixis Twin is enjoying less success than Temur Twin is deck maturity. Temur Twin is a well established deck with considerable historical success. Grixis Twin had a few appearances at PT Fate Reforged but otherwise doesn’t have the same kind of established foundation we see with Temur Twin. This is evidenced in the decklists themselves, which vary between Tas in the main and Tas in the board, as well as different black card ratios throughout the whole 75. So although I think there is something promising in Grixis Twin, we aren’t quite there yet.

Esper Mentor Midrange (N = 5 ; MWP = 70% ; P = .107)

If you caught my article back at the end of March, it’s no secret I love Esper Mentor Midrange. The deck is cool even without a statistically significant win rate, so imagine my happy surprise when I saw it would make this list. Let’s get that small N out of the way before we go any further. N = 5 is not what we want to see to make a firm conclusion about a deck’s performance. So this is definitely something we need to revisit with more data. The big reason I am not as worried about the N = 5 value is the deck’s past performance in other events. Between its finishes in Japan, in past weeks on MTGO, or even its T8 appearance at SCG Providence just this weekend, we know this deck has legs. It would be a different story if our analysis identified some low-N deck like Zombies, Five Color Humans, or Tooth and Nail as a high performer. But it’s something else entirely when we already have datapoints suggesting the deck is good. This just confirms it.

We can take a closer look at the Esper Mentor matchups to get a sense as to why this deck might be a true MWP overperformer and not just sneaking in off a small N. Esper Mentor had a pretty typical distribution of wins and losses against different decks, but one matchup stood out over the rest: UR Twin. The deck was 5-0 against UR Twin, going 2-1 in three matches and 2-0 in two more. Any deck batting a 5-0 against UR Twin should catch our attention. Although it’s possible the Esper Mentor players were just facing the five worst Twin player on MTGO, this isn’t suggested by the data. Three of those UR Twin matchups came in round 3 of the daily, against opponents who would themselves go on to 3-1 finishes. So it wasn’t as if Esper Mentor just played against inexperienced Twin players. Of course, five matches isn’t exactly the kind of statistically commanding sample we want to see, but it’s enough that we should look at the deck to see if anything might explain the wins.

And honestly, the 5-0 Twin win rate makes sense. Esper Mentor packs a number of efficient clocks Twin can’t reliably burn to death. It also has as many as 8 hard removal spells (between Path, Murderous Cut, and Slaughter Pact), along with countermagic and hand disruption. That’s exactly the sort of nightmare anti-Twin package we would expect of a deck beating Twin. All of this suggests the deck is actually an overperformer, and we would only confirm this with more datapoints, not challenge it. And a lot of this is probably on the back of the deck’s Twin performance.

Soul Sisters (N = 14 ; MWP = 62.5% ; P = .036)

It’s weird enough Soul Sisters is one of the highest performing decks in the entire dataset. It’s way weirder that it’s actually THE HIGHEST performing deck in the entire dataset, beating out Amulet Bloom, Abzan Liege, Infect, and a huge range of other decks we would expect to rock in Modern. Astute MTGO players might have some suspicions about why this is the case (hint: it’s about matchups, not about statistics), but we’ll turn to that in a moment. For now, let’s just relish the fact that this budget, Goyf-less, fetch-less, Twin-less deck is such a rockstar. It’s not just the deck’s high win percentage with a respectable N. It’s that it is the only deck in the entire dataset to clear the magical .05 cutoff for statistical significance. As you might remember in your high school/college stat classes, the 95% confidence level is a sort of magical (and, admittedly, semi-arbitrary) point all P values aspire to cross. Soul Sisters is the only deck in Modern MTGO to do this, at least with the data we have so far.

The painfully obvious reason for Soul Sisters’ success is its Burn matchup. Surprising absolutely no one, Soul Sisters tore it up against Burn on the MTGO scene. We saw Soul Sisters battle Burn in 7 different matches, going 7-0 overall split with five 2-1 finishes and two at 2-0. Some pros and Modern experts have observed Soul Sisters doesn’t always have the strongest Burn matchup, and although that might be the case at a high level event like a PT, it is definitely not the case on MTGO. Against your average MTGO Burn player or grinder, Soul Sisters is just as dominant as we would think. But as you might expect, Soul Sisters would never be able to get to a 62.5% MWP off of the Burn matchup alone. By even the most generous assessment, Burn is only about 10-12% of the MTGO metagame, so you can’t bank on beating Burn alone. So how did Soul Sisters do it?

In my last metagame update, I talked about two rising decks: Affinity and Grixis Delver. I predicted Affinity would actually surpass Burn as a the format’s premier aggro deck, and I guessed Delver would keep rising. Both of these predictions are more or less coming true, and that’s the best possible news for Soul Sisters, which has an awesome matchup in both contests. In the dataset, Soul Sisters was 6-0 against Affinity (two at 2-0, four at 2-1) and 4-0 against Grixis Delver (two at 2-0, two at 2-1). You probably can’t climb to a significant MWP on the back of just the Burn matchup, but if you are also succeeding against Affinity and Grixis Delver, that’s a much safer gamble. Those three decks collectively make up about 25%+ of MTGO, and the data suggests Soul Sisters is strongly favored in all three. Who cares if the deck is bad against Amulet Bloom (0-4) and not great against Twin (2-4)? When you have a 100% match win rate against three of the biggest decks in the format, even if only in this small sample size, that’s suggestive of a very well positioned deck.

Underperforming Decks

With the exception of Scapeshift, none of the decks in last week’s tier 1 and tier 2 MWP analysis were even close to significantly lower than the MTGO average. Scapeshift was pushing it, with an average MWP of 39.33%, but it wasn’t significant at any noteworthy level (p=.635). But once we push down the tierings, we start to see some underperforming decks, not just overperformers. This doesn’t mean these are bad decks! It just means there are some factors preventing these decks from succeeding right now. Part of this might just be low card/deck quality, but it is more likely a combination of other factors such as pilot bias, hostile metagame, etc. So just because a deck shows up here, it doesn’t mean you need to abandon ship on it if you thought it was decent.

Storm (N = 14 ; MWP = 38.1% ; P = .273)

Poor Storm. First it loses Seething Song and it all but disappears for a year. Then it comes back after DRS follows Song, to prompt cries for a Manamorphose banning (a top 5 contender for “worst ban suggestions in Modern actually made”). Then it starts to come back again during the TC era, only to drop back to the bottom in recent months. Storm players have been trying to get this deck to work for years. When it works, it’s as if it works too well and Wizards hates it. When it doesn’t work, it really doesn’t work, and that’s the trend we are seeing right now on MTGO. In defense of Storm and Storm players, it’s not like our N is large enough to unconditionally condemn the deck. Nor is the P value low enough to prove the deck is “bad”. But these are not the kind of numbers we want to see of a successful deck, so we need to look at them and ask what’s going on. It would be one thing if we were finding these numbers for a deck that otherwise appears to be doing fine. But when we find these numbers for a deck that appears to be struggling on other counts, we need to pay attention.

Looking at the different Storm matchups in the dataset, we don’t see a lot to explain Storm’s lower performance. It’s 1-2 against Affinity, 2-3 against UR Twin, 2-3 against Burn, and has similar sub-50% MWPs against most other decks in the format. Which is to say, it’s not like Storm has some bad matchups bringing it down. Rather, Storm isn’t matching up well against a lot of decks. Grixis Delver appears to be even worse than the average, with Storm falling 1-5 to Delver deck over the 11 events. So that might be a legitimately bad matchup, which makes sense given Storm’s gameplan. But otherwise, it looks like Storm is generally struggling, not struggling because of some specific metagame context. I’m not even sure this is attributable to pilots, because only one pilot has actually returned to Storm in our dataset. Otherwise, it’s 14 appearances with 13 unique players. I’m not willing to go so far as to say Storm is a bad deck. After all, it has about a 1.7% overall metagame share, which it couldn’t get if it was just terrible. It probably just has a high skill cap, and perhaps many MTGO players who can master Storm might be trying their luck elsewhere.

Mono U Tron (N = 13 ; MWP = 35.71% ; P = .172)

But but but, isn’t Mono U Tron supposed to be “MTGO’s Best Deck“? Didn’t it have the highest GWP in that public dataset, even if the results weren’t statistically significant? I still believe Mono U Tron is a great deck rewarding tight play and experienced pilots. But Mono U Tron, like Storm, has a pretty questionable (at best) performance record on MTGO right now. Also like Storm, it doesn’t appear as if any single matchup is bringing the deck down. It’s just suffering from poor performance across the board. But unlike Storm, it actually has a smaller metagame share, which suggests it might be even worse overall. Not only is the MWP analysis suggesting a low-performing deck, but so too is its overall metagame share. That could be a bad sign for pilots who rely on one of Modern’s more traditional control decks.

As I see it, the key difference between Storm and Mono U Tron is deck presentation. Storm is presented as a tricky combo deck with a high skill cap. It is also often presented as a deck hammered by bannings and hated by Wizards. All of these qualitative factors influence the type of player picking up Storm. Mono U Tron is very different. It is often presented as the budget deck to get into MTGO. Its cheap, relatively competitive, doesn’t require fetchlands/Goyfs/Hierarchs/Commands/etc., and has the rogue feeling to it that still makes you seem like you are playing a neat Modern deck. These are all highly subjective classifications, but I think they match nicely with most MTGO players’ sense of this deck. This suggests players who pilot Mono U Tron might be new to the client and/or the format, people who want to get into Modern but want to make a smaller investment before going too crazy. If so, these players might be less experienced and less prepared for the format, which would definitely explain the deck’s lower win rate. Indeed, this is supported in the data, where we see regulars like shoktroopa still racking up a number of 4-0/3-1 finishes, but a lot of one-time players dropping to 2-2 or worse. So in the case of Mono U Tron, I don’t think we are looking at a bad deck. We are just looking at a deck that appeals to less experienced and newer players, particularly for budget reasons.

Next Steps

As with any dataset like this, the most important next step is to keep updating the dataset and keep revisiting past conclusions. Small N datasets can be necessary evils in in-depth data analysis, especially with the sheer quantity of information you can glean from just one MTGO daily. Adding events might change our earlier findings. For instance, maybe Amulet takes a big performance fall in 1-2 events. That would definitely bring down its overall MWP and probably raise its P value well outside the range of statistical significance. Or maybe a deck pushing the upper end of average variance (say, Infect) enjoys a string of daily successes. That could get it into the significant range. So we are going to need to revisit all of the past articles on this topic as the dataset evolves.

Perhaps more importantly, we need to check these quantitative findings against our qualitative experience with the decks. Because the dataset only encompasses 11 events (13+ after it is updated) in a 2-3 week period, it is very much just a snapshot of the metagame. The best way to shore up such a snapshot, apart from adding more datapoints, is to compare our quantitative-based conclusions to our personal experience in events. This can either confirm or challenge a finding.

In a few weeks, with some more events added to the dataset, I’ll come back to all of the findings in both part 1 and part 2 of this article and see how they are holding up. And if anything new has emerged, we’ll take the time to look at that too. So get on out there with your favorite deck, take down those dailies, and rack up some wins to boost up those MWPs!

Sheridan is the former Editor in Chief of Modern Nexus and a current Staff Author. He comes from a background in social science data analysis, database administration, and academia. He has been playing Magic since 1998 and Modern since 2011.

9 thoughts on “Modern MTGO Deep Dive: Win Rate Analysis (Part 2)”

Awesome stuff. I’m not surprised to see Sisters performing as well as they have, because of (as you astutely noted) excellent matchups against most of the aggro decks in Tier 1. My curiosity is piqued by your comment in the Esper Mentor Midrange section regarding decks like 5C Humans, Zombies, and Tooth and Nail. Would it be productive to maybe slap several of those decks with broad-strokes labels, such as “Rogue Aggro”, and then see if there’s anything there to be derived? I’m just thinking out loud here, but maybe if the “Rogue X” share does well enough, it’s a handy reminder to not get metagame tunnel-vision in a format with such a large card pool.

That’s not a bad idea at all. I think there are probably a few ways to pool small N decks that would make sense. As long as the rogue decks had relatively similar plans, even if their cards weren’t the same, we could probably group them without too much danger. I’ll look into it and see if there’s anything worth reporting!

1) I’ve been surprised and impressed by the quality of content here. The bold claim of “The Premiere site for Modern” has actually been substantiated nicely.

2) Do the rules of the /r/spikes forum allow posters to reference your articles to springboard discussions? I know that content creators can’t spam their work onto reddit or they get banned, and I’m not sure how you guys feel about potentially moving discussion of articles onto a forum not your own.

We fully encourage people to post/link our content to reddit as appropriate. We’re allowed to submit it ourselves provided we are active redditors and not just posting and leaving (though even then, some people get away with it). We submitted our “launch post” there and articles have been submitted by others, all to great reception, so please do!

For me the next step (in addition to increase the datasize) would be differentiating between Matchups. I don’t particularly care about my OverallWinrate, but against my winrate among specific decks. What is the Winrate between Twin and Esper Mentor? How hard does Merfolk actually lose to Affinity? etc.

We can go even further:
Winrates on the play, winrate on the draw in each matchup.
Winrate Preboard, winrate Postboard.

And NOW i will look for EsperMentor lists, because you got me fixed….Damn YOU!

Spoiler alert! This is definitely the next step and articles are in the works that will discuss this. My article for today, publishing in about 1.5 hours, does just this for one of the format’s coolest new decks. Sample size starts to become an issue when calculating matchups, but as long as a deck has at least 10 or so appearances, we can get some sense of MWPs against specific decks. I also try to incorporate this data in the explanations for why decks are doing well, so you can see some of this interspersed throughout the article.

Current metagame: 12/1 – 12/31

NOTE: Metagame % is calculated from the unweighted average of all MTGO leagues, paper T8s/T16s, and GP/PT/Open Day 2s in the date range. Data is tracked in the Top Decks page, which you can browse for more details.