The Making of a Metric: Part 1

Simulationcraft now has full support for protection paladin mechanics, and hopefully in a week or two I’ll get around to writing a how-to blog post on using it and interpreting the data. Once I write a few batch files, it should produce all of the DPS results I could generate from the MATLAB FSM code and more.

What it doesn’t have yet is a good smoothness metric with which we can assess our survivability. Of course, that’s not really Simcraft’s fault, because a good smoothness metric doesn’t exist yet.

So if we want a metric, we have to build it ourselves.

Up until now, we’ve looked at data and made qualitative assessments about the results to come up with statements like “X smooths better than Y.” But now we want to quantify that thought, which is a lot more difficult. We want a numerical estimate of how much better X is than Y. And to do that, we need to get a little introspective and think about what we’re doing when we make those qualitative assessments. We need to analyze our process and figure out how to translate it into numbers.

Data

So let’s start with some data. Below are the results of a 10k-minute sim like I usually show on the blog. We’ll use this sample data set for all of the following analysis. The gear sets are variants on the Control/Haste setup – the first is just C/Ha, followed by sets where I add 1000 of a given stat. In the case of hit and expertise, I subtract 1000 since we start at the cap. We’ll use a boss that swings for 350k after mitigation every 1.5 seconds, the standard SH1 finisher priority, back-calculated Seal of Insight with no overhealing apart from inherent, and Sacred Shield enabled.

For now, I’ve narrowed our focus to 4-attack moving averages. In theory this will be equally applicable to any string size, but there’s little point in providing data that we’re not going to use at the moment. And for Simcraft, we’re going to have to choose a default time window for our moving average. The 3- and 4-attack moving averages are the ones we focus on the most, and 4 attacks is closest to the time window I had in mind (5-6 seconds).

This is some Inception-level shit right here…

Now, let’s think about how we analyze this data. Normally we look at the top few categories and draw qualitative conclusions from that. For example, adding 1000 haste vs. 1000 mastery, we see that haste comes in lower (or equal) in spike representation for all rows above 90% player health. On the other hand, we pay less attention to rows that contain a large percentage of attacks across the board, because those arevery likely to happen, and reducing the amount is rarely meaningful. If something happens 8% of the time rather than 9%, that’s not a huge change because it’s still going to happen a lot during an encounter, so you need to plan for it happening anyway.

Within those top few categories that we consider, we tend to put heavier emphasis on the larger spikes than the smaller ones. If we can significantly reduce spikes that are 130% of our health, then that’s perceived as a lot more important than an equal reduction in spikes that are 110% of our health, especially if there’s still a sizable percentage of those events.

So according to this data, we would conclude that haste is better than mastery, though not by a huge amount. Dodge and parry are both worse than haste, but stamina is a little better. It’s hard to say how hit/exp fare since we subtracted 1000 points instead of adding 1000 points, so ignore them for now.

Analyzing the analysis

Our qualitative assessment primarily looked at two factors:

Spike magnitude – A 130% spike is more important than a 120%, than a 110%, and so on. We mentally assigned more importance to the largest spikes than the smaller ones. This has to be accounted for in our numerical analysis.

Spike frequency – This one is more complicated, because it’s not as straightforward as “bigger is worse.” We care about how frequently things happen, and we care about changes in that frequency. But not all changes are created equal. Some examples:

If we have 5%-10% representation in a category, those are going to happen unless we eliminate them. Going from 7% to 6% (as we do in the 80% spike category when we add 1000 stamina to C/Ha) isn’t really all that meaningful a change, and shouldn’t be worth a whole lot.

But a representation of 0.002% isn’t that likely at all, and may not even be worth worrying about. Going from 0.002% to 0.001% is probably not that meaningful either even though you’re halving the number of spikes, because those spikes weren’t very likely in the first place.

Reducing a 1% chance to 0.1% would be a meaningful change, because that’s a pretty noticeable reduction from a non-trivial amount. You’re taking something that was very likely and making it fairly unlikely. Similarly for 0.1% to 0.01% – something unlikely to something really unlikely.

Going from 0.01% to 0.001% is the same change (a factor of 10), but maybe not as important because it wasn’t very likely to begin with. Going from 0.001% to 0.0001% is almost irrelevant, because both are so unlikely.

That said, there’s always some comfort in the certainty that an event can’t happen, so there’s almost always some (admittedly nebulous) value in reducing a representation to 0.000%. But it’s obviously more valuable when you reduce a non-trivial chance into 0%.

It’s easier to describe how to do this with some pictures. Below is the overall spike histogram for the haste and mastery gear sets. This is exactly the same data that’s in the table, just in bar plot format, and instead of using bins that are 10% of your health wide, I’m using 2% wide bins. The x-axis is the percentage of your health (expressed in decimal form, so 1.00 = 100%) and the y-axis is the number of events. The distribution is roughly centered around 50% health, with a large spike at 0% due to 4-attack avoidance or full absorb strings. This is pretty standard for these sorts of histograms.

Histogram of the 4-attack damage string data.

However, we don’t look at the whole histogram (in fact, the table only ever shows the top half of it). We look only at the highest-damage parts, which you can’t really see on that plot because the bars are tiny compared to the bulk of events in the middle. So in the plot below I’ve zoomed in on the very top end of the distribution. This figure shows the top 5% of all events – i.e., the 5% of events that have the highest damage value. Relating this to the table, it’s cutting off every row that has a number greater than 5.000 in the C/Ha column (around 82% player health is the cutoff).

A close-up of the top 5% of the data.

We see that haste and mastery are pretty similar here, which isn’t surprising since their data isn’t that different. Haste is a little better, but it’s easier to see that on the table than in this plot because the table uses a coarser binning. Nonetheless, we want to use this plot (or rather, the data it’s showing) to generate a quantitative metric.

The first thought is to subtract the two bar plots from one another. In other words, do something like this:

Histogram showing the difference of the haste and mastery histograms.

We could just sum all of the bins and get a number that represents the difference between haste and mastery. However, that’s ignoring the fact that these events are not equal: the events at 1.3 (130% of your health) are much, much worse than the ones at 0.85 (85% of your health). So what we really want to do is apply a weight function. Basically, we multiply each bar by a number representing how important that bin is, and then perform the sum. You may be familiar with the term “weighted average,” which is exactly the operation we’re performing.

For a simple example, consider the table below. The first two data columns show the representations for the +1000 haste and +1000 mastery gear sets. The third data column shows the difference between these values, just like we’re showing in the differential histogram above. The next column is the weight factor, which represents how much we care about a certain category. The final column is the product of the difference and the weight factor, which gives us a numerical representation of how much “value” we’ve gained or lost in that spike category. And if we sum that column we get the weighted average, which is an overall “score” that tells us how much better or worse haste performs than mastery at smoothing.

Percentile

Haste Set

Mastery Set

Diff

Weight

Weighted
Value

80%

6.320

6.214

0.106

0.25

0.0265

90%

2.056

2.137

-0.081

0.5

-0.0405

100%

0.443

0.571

-0.128

1.0

-0.128

110%

0.066

0.08

-0.014

2.0

-0.028

120%

0.016

0.018

-0.002

4.0

-0.008

130%

0.001

0.001

0

8.0

0

140%

0.000

0.000

0

16.0

0

Sum:

-0.178

In practice we’d do things slightly differently to get scale factors. We would start the same way, by calculating the histograms for a baseline (C/Ha) and for new configurations with 1000 of each stat added. But then we would subtract all of the +1000 sets from the baseline (rather than from each other), and then perform the same weighted average sum to get the scale factors describing how well each stat “smoothed” our damage intake.

Of course, we’ve glossed over the most important part of the problem: what weight function do we use? This is a critical consideration, because the weight function is everything. Get it right and you have a robust metric that works well; get it wrong and you have garbage. This is where our breakdown of the factors we used in the qualitative assessment come into play. We want to weight the higher spike categories more heavily than the lower spike categories, and we want larger changes to be more valuable than smaller changes.

There are lots of functions we could choose – a simple linear function, the Fermi function we explored when modeling Seal of Insight, and dozens of others. But what felt natural to me was an exponential function. For example, $w(x) = e^{a(x-1)}$, where $a$ is some constant that determines how quickly the function changes and $x$ is the spike size (again, in decimal form, so 100% of your health is 1.0, 90% of your health is 0.9, and so on). For those who aren’t familiar with what an exponential function looks like, here it is:

Exponential weight function with $a=10\ln(2)$.

So for example, the bin corresponding to 100% of your health is worth exactly $w(1)=e^{a(1-1)}=e^{0}=1$. The bin corresponding to 90% of your health is worth $w(0.9)=e^{a(-0.1)}=e^{-a/10}$. This form is a bit unwieldy because it’s hard to translate between the constant $a$ and the behavior of the function. We know that a larger $a$ will make the function steeper and increase the value change between one bin and the next, and that a smaller $a$ will reduce that value change. But it’s not obvious from looking at it what happens to the relative valuation of one spike size to another with an arbitrary change in $a$.

To make this more intuitive, let’s make a substitution. Let $a=10 \ln(h)$. That makes the equation $w(x)=e^{10 \ln(h) (x-1)} = h^{10(x-1)}$. Why does this make it easier? Well, consider what happens if we evaluate $w(x)$ for increments of 10% of your health, as I’ve done on the table below:

50%

60%

70%

80%

90%

100%

110%

120%

130%

140%

1/32

1/16

1/8

1/4

1/2

1

2

4

8

16

In other words, for every 10% of your health you get a factor of $h$. A spike that’s 10% larger is $h$ times more important, and a spike that’s 10% smaller is $1/h$ times as important. The single variable $h$ controls how much weight a bin gains or loses if it’s one health decade away from 100% health. So we could call it the health decade factor, or HDF for short. If our top events are at 140% health and the smallest event we want to consider is at 90%, the top events will be approximately $h^5$ more valuable than the bottom ones. Very straightforward.

The other reason this is good is that it works no matter where you are on the x-axis. If our top events were between 50% and 100% of our health, each bin would get multiplied by a smaller factor, but you still have the same relative $h^5$ weight between the top and the bottom. If we used a different function this might not be the case, and the metric wouldn’t work as well for an arbitrary distribution.

Of course, we still need a good value for $h$ that represents the effects we want. There’s no obvious choice here, either. It is, by its nature, sort of arbitrary. We’re not attempting to model an exact number we can expect to see in game, like DPS or DTPS. We’re trying to come up with a number that represents smoothness in damage intake, but the actual value isn’t that important. What is important is that the value mimics a thorough qualitative assessment. It doesn’t matter whether the value we get is 10 or 100 or 1000 as long as a similar amount of improvement from another stat gives a similar value and a larger or smaller amount of improvement gives a larger or smaller value, respectively.

Meloree and I discussed this at some length and guessed at a value of $h = 2$, which is also what I used for the weight function plot above. The idea here is that a 90% health spike is about half as important as a 100% spike, while a 110% spike is twice as important. To see how this weight function affects the histogram, let’s multiply the entire histogram by the weight function. That gives you the plot below.

Weighted histogram generated by multiplying the raw histogram data by an exponential weight function with $h=2$.

If you compare this to the un-weighted distribution provided earlier, you can clearly see that the weighted distribution gets shifted up by the nonlinear weighting function. The events near 0 are practically worthless, while the events near the top get more valuable. If we zoom in on the top 5% of events again, we get the following plot:

The top 5% of the raw histogram data after multiplication by the exponential weight function.

This has clearly increased the value of the higher-magnitude spikes compared to the lower-lying spikes. Mission accomplished, perhaps?

Well, no, not quite. There are a few problems with this.

First, note that the few events occurring near 1.35 on the x-axis are still not worth very much – far less than the thousands of events that occur at ~0.85. Those events at 0.85 are going to happen, that’s around 1% of all events just in that bin. I’m not sure they should have so much more weight than the far more dangerous ones at the top.

Second, if you calculate the weighted average of the difference between the haste and mastery data, we get a number that suggests that mastery is better than haste. But our qualitative assessment gave us exactly the opposite answer! What’s going on here? Did we screw something up?

The result seems paradoxical at first until you look a little deeper and realize what’s happening. With this value of $h$ we’re still very, very sensitive to “edge effects.” You get a different answer if you look at the top 5% of events than if you look at the top 4%, top 3%, top 7%, etc. To illustrate that, here are the differential figures for the top 3%, 4%, 5%, and 7% of all events:

Differential histogram of the top 3% of the raw haste and mastery data.

Differential histogram of the top 4% of the raw haste and mastery data.

Differential histogram of the top 4% of the raw haste and mastery data.

Differential histogram of the top 7% of the raw haste and mastery data.

Differential histogram of the top 10% of the raw haste and mastery data.

You can see that the last bin on the left is generally the largest one, and while these plots are un-weighted, the problem is that the number of events in that last bin tends to increase in size faster than the weight function dies off. To quantify that further, let’s actually calculate some stat weights. Below is a table of the stat weights calculated this way for different cutoffs, starting from the top 1% and going to the top 10% most damaging attacks, along with a final row where we include 100% of attacks (i.e. all-inclusive) for a reference. Note that since we’re considering stat weights, a bigger number means it’s a better smoothing stat. I’ve also properly adjusted for the fact that we’re subtracting hit and expertise instead of adding them, which just requires an inversion (e.g. -5922 becomes 5922).

As you can see, the values fluctuate a lot as we change the percentage of attacks we consider. If you go down the haste and mastery rows, you see that they’re swapping places depending on which row you look at. That’s not good. Ideally they would be fairly consistent from row to row. Even if the values change (which is fine), the relative value of the two shouldn’t. But because that last bin is so important, our results depend heavily on what exactly we choose as a cutoff, even though the events near that cutoff are the least important!

Apart from the oddity with haste/mastery, the order is generally Hit>Exp>Stam>(Haste/Mast)>Dodge>Parry. This is about what we expected, so that part at least is good. Dodge beats out parry because of diminishing returns, as in these gear sets dodge is diminished much less. Hit and expertise are both very strong, just as we know they are.

The bottom row includes 100% of events, so it sums the entire histogram. This tends to inflate the value of dodge and parry more. In general, the more exclusive you make the metric, the worse dodge and parry do because they trade the presence of worse spikes for lower average damage taken. Why? Well, when you restrict your view to just the top X% of events, those worse spikes really hurt dodge and parry. As you increase the percentage of events being considered though, dodge and parry start to perform better because it starts adding value to the large mass of events in the middle of the unweighted distribution.

This all leads to the third problem: the choice of bin size matters. I’ve made the plots with 100 bins (between 0% and 200% health so bins of 2% health), but the data table above uses 200 (bins of 1% health). You get slightly different answers with the exclusive metrics because changing the number of bins sometimes shifts a chunk of events into or out of the part of the data you’re considering. That’s not good because again, the events near the bottom of the region we’re considering are supposed to be the least valuable ones, and thus have the smallest influence on the distribution. Being so sensitive to edge effects introduces this artificial dependence on bin size. And we don’t want to get significantly different results just because we altered the bin size slightly.

There are a few ways to try and fix this, but the most obvious to me was to increase the HDF. That increases the weight discrepancy between lower-damage bins and higher-damage bins, making the lower bins less valuable and the higher bins more valuable. So I repeated the calculation for an HDF of 3:

Much better-looking. The variation with percentage is smaller, though we’re still seeing edge effects. If you go down the haste and mastery rows you’ll see mastery jump around relative to haste a good bit. But the increased hdf has created a larger ramp on the weight function, making those edge effects smaller overall. The other interesting thing here is that the 100% (all-inclusive) version is a pretty good average – we’re still seeing the approximate 3-to-1 ratio of haste-to-avoidance value in the 100% row that we’re seeing in the 5% row. This all-inclusive version has another perk – it isn’t subject to edge effects at all, so it’s quite a bit more robust.

Just for completeness, here are the weighted histograms we get with $h=3$:

Weighted histogram with $h=3$.

The top 5% of the weighted histogram with $h=3$.

Of course, 2 and 3 are pretty arbitrary choices for the HDF. To get the best metric, we really need to nail down the best value for $h$. So the next step is to explore what happens when we change $h$ while keeping the percentage of events fixed. When we do this for the top 5%, 10%, and 100% of events, we get the results below:

First, a small HDF tends to keep the scaling between stats small, which means they’re more vulnerable to edge effects. We can see this from the first two tables, as haste and mastery swap positions from one table to the other. We already sort of knew that, but it’s good that the data reaffirms that expectation.

A really high HDF tends to increase that gap dramatically, which reduces edge noise.
On the other hand, a high HDF does something strange to dodge and parry. As you can see, the stat weight of dodge and parry both plummet, and the dodge value even becomes negative at one point.

Your first thought might be to rationalize these results. For example, dodge and parry tend to give a wider distribution, allowing for larger spikes but reducing overall damage taken. But remember, we’re not comparing a dodge/parry gear set to a control gear set here, we’re adding 1000 dodge or parry to it. There should be absolutely no circumstance where adding 1000 dodge actually makes your smoothness metric worse! It should always make it better, just not necessarily as good as other stats.

In fact, it’s a different issue entirely. If you look at the source data, adding 1000 dodge reduces spike presence in every category except one: the very top one, 140%, where all of the sudden we have a 0.001 instead of a pure zero. This isn’t because adding dodge suddenly created higher-damage spikes; it’s simulation noise. That’s what’s causing the huge drop in dodge’s value here, and a similar (though less pronounced) effect is happening in the parry data. This is bad, it essentially means that our HDF is so large that it’s becoming super-sensitive to simulation noise.

So we’re stuck between a rock and a hard place. If the HDF is too high it causes simulation noise problems, but reduces our edge-effect issues. But if it’s too low we have the reverse: edge-effect noise problems, but low sensitivity to simulation noise in the higher categories. One solution is to just simulate longer and reduce the noise, but that’s not a great solution either. Ideally, we want to pick an HDF somewhere in the middle, so that we get good data fidelity and low sensitivity to noise so that we don’t have to simulate for hours at a time.

For the tables where we look at only the top 5% and 10% of spike events, it’s clear that $h=2$ is too low. 2.5 is on the border of what I’d deem acceptable, but even that is a little volatile. On the high end, 3.5 is pretty good, but is right on the borderline of the region where simulation noise is a problem. But probably our ideal value lies somewhere between 2.5 and 3.5. Nominally, I’m going to say it’s around 3.0, because that data seems pretty consistent between both tables. But I have to do a more detailed analysis of this range to really feel comfortable picking a final value.

However, there’s also another solution to consider. The table that uses 100% of the data eliminates boundary issues entirely because it includes all of the data, and the weight function makes sure that each category is worth less as we work our way from higher to lower damage ranges. That makes it more stable at low $h$-values than either of the percentile-based versions. I’d want an HDF larger than 2, because if it’s too low it risks watering down the strength of small changes near the top end and over-valuing avoidance. But it’s even passable at $h=2$ in this table, unlike the percentile versions.

From considering all of this data, I’m leaning towards defining the metric using an all-inclusive histogram with a moderately high HDF, probably around $h=3$. However, we have a lot of further testing to do before we settle on final values. And of course, we’ll want to implement some sort of normalization scheme such that the values are independent of the number of iterations we use. I’ll detail all of that in the next two blog posts.

What’s in a name?

However, before we end, I want to offer a parting thought about the greater applicability of this metric, and suggest a name for it. While I’ve framed this entire discussion in terms of subtracting histograms from one another, the way you would likely do this in a computer is a little different. The distributive property tells us that $a(b-c) = ab – ac$. That means it doesn’t matter whether we subtract the two histograms first and multiply by the weight function afterwards or vice versa.

This is actually really good news, because it means you could multiply the weight function times the histogram for one experimental configuration (e.g. gear set, talent combination, glyphs, etc.) to get a single number. Then you could repeat that process for any other gear set or configuration you want and get a new number. And you’ll get a unique number for each configuration that describes the smoothness of your damage intake under those particular conditions.

There’s a bit of arbitrariness to this, in that the weight function is whatever we decided to come up with. However, that’s OK, since we’re not trying to give a measurable in-game number like DPS or DTPS. As an analogy, think about the stock market. The Dow Jones Industrial Average (DJIA) is a stock market index, which is just a weighted average of the stocks of 30 large companies. It’s used to reflect the performance of the entire stock market, but the companies chosen are pretty much arbitrary. Even if some sort of formula is used to choose them, that formula is essentially arbitrary, and there are other indexes that use different formulas (S&P 500, Russel 2000, etc.) and give slightly different impressions of the overall health of the market.

Another good analogy is your credit score. Your credit score is calculated by an algorithm, which is somewhat arbitrary and chosen to model credit-worthiness. But your credit score isn’t an estimate of some measurable value. Having a credit score of 750 doesn’t tell you exactly what APR you’ll get on your mortgage or how large a loan you can take out, even though it affects those things. But it’s also very clear that a score of 750 is better than a score of 600. And there are multiple ways of calculating credit score with different ranges, all trying to convey the same rough information about your borrowing risk.

And that’s really what we’re doing if we come up with a unique number for each simulation configuration: producing an index. You sim your gear set and you get a number out that tells you how smooth your damage intake is. You can re-sim with a different gear set to see if it improves. In this case, that means the number goes down, because just like DTPS a lower number is better with this type of smoothness metric. And you can calculate scale factors, which is just a fancy way of saying “sim once with a baseline gear set, then sim again with +1000 haste, then with +1000 mast, etc., and subtract each of those indices from the baseline value.”

In short, it’s exactly what an index should be: a solid number with a very clearly-defined calculation method that you could compare to any other configuration that uses the same boss mechanics. That’s really the only constraint here: the boss’ attacks need to be identical for two different indices to be comparable. You can’t compare an index generated with Hogger to one generated by Lei Shen and expect to get anything useful out of it. We’ll go into more detail on that point in the third post in this series, later this week.

But it will be clear that a result of 100 is a lot smoother than a result of 1000. It may not be clear exactly why until you look at the details – maybe one boss was Hogger and the other was Lei Shen, or maybe it was a gearing difference, or who knows what. Either way, it will be clear that the tank with the result of 100 wasn’t in much danger, while the tank clocking in at 1000 was. If we choose our overall normalization factor properly, we’ll be able to do this very accurately. For example, a smoothness value larger than 1000 would clearly be dangerous, while one below 500 wouldn’t be, or something like that. Again, we’ll talk about the details of that normalization process later this week.

The concept of the DJIA came up while Mel and I were discussing this, so my first thought was that we should call this the Theck-Meloree Industrial Smoothness Average as a bit of a joke. However, it struck me that there was a much more natural name that was less of a mouthful: the Theck-Meloree Index. Which, of course, would be abbreviated TMI. Which is not only amusing, but also eerily accurate based on the amount of background work we’re presenting here to develop it.

So yes, you heard it here first. We are going to get tanks to compare their TMIs. And it will be glorious.

39 Responses to The Making of a Metric: Part 1

It’ll take a while depending on how the metric is marketed and eventually understood by SimCraft users, but it will happen eventually. Not likely in this expansion though since it’ll take many months before this will be widespread. I’m just hoping people can be smart enough to realize that setting an arbitrary number for that isn’t going to work so well, particularly with how different tanks have different mechanics that make a number across all tanks untenable.

My biggest curiosity is still how Blizz will react to this kind of metric coming to all tanks eventually and what they’ll do with dodge and parry.

I’m actually really curious about one point you mentioned: how TMI will vary from tank to tank. I would not be surprised if there was some pretty serious variation based on class.

Regarding Blizzard’s response: I wouldn’t expect much at first. They may have their own internal smoothness metrics that they check while balancing tanks (or not – it’s clear they put a lot more emphasis on DTPS than smoothness based on their opinion of Dodge/Parry). But if the metric becomes widespread enough, maybe it will encourage them to give more weight to smoothness arguments.

Perhaps, however I’ve been playing long enough to realize that portions of the community seem to thrive on the touting of certain metrics as a means of deciding who is “good enough” to join in with them.

With the new Flex-Raid option I wouldn’t be surprised at someone latching on to this as a way to decide on who can join it for a weekly server raid that they are running, without fully understanding what the term means, only that the lower you can get the better.

It would go the same way that “gear score” did in BC and Wrath, and “item level” does currently. This concept simply is better at judging a tank’s theoretical durability, rather than a metric for all classes, which gives it more weight to the data that the number will represent.

I look forward to getting to test it out in SimC myself on my gear and set up, but at the same time I will still chuckle to myself when I see a post like that in trade chat, when someone is looking for a tank in the future.

At least TMI is a simulation-based metric. Meaning that it’s not something they can easily intuit from an addon like GearScore or read off of the armory. So unless they’re going to Sim your character to find out your actual TMI, they’ll have no idea whether you’re lying or telling the truth.

That’s probably the best reason that TMI won’t take off as a PUG filter statistic. It’s just not as easy to verify as ilvl or GearScore.

I would hope that you are right however I have been in a guild, for about a month, that had a “gear score” officer who’s job in the guild was to verify that everyone was “qualified” to join a particular raid group.

Needless to say the guild didn’t last long, but I still wouldn’t put it past people to Sim others characters to make sure they are up to their standards. The morality of doing that is up to discussion, but it isn’t out of the realm of possibility, and really isn’t too much of a time commitment, depending on how many people you are running with, and how often you get new people.

I don’t think that simming characters for TMI will be anywhere likely to happen.

See, the TMI that you’d receive, would depend on the rotation and what not. If the tank uses a different rotation, or prefers Shield Block over Shield Barrier, or something, the TMI value has just become entirely useless.

They’re actually different degrees of freedom. Even in your variable bin size example, the relative weight of the 0.5 bin (which could even be zero) to the 0.8 bin to the 0.9 bin and so on is determined by HDF (or a similar degree of freedom).

I have maybe an odd question. Where do I begin to learn how to understand this? I did trig and pre-calc in high school, then technical engineering (12T MOS) and radiation safety in the Army, which was mostly more advanced geometry, trig and calc, no algebra above high school that I can remember. This kind of math is way over my head, but still intriguing.

There’s nothing in this post in particular that requires more complicated math than trig/pre-calc. If you’re familiar with the exponential funcion (i.e. $e^x$) and the natural logarithm ($ln(x)$), and understand the basics of bar charts (histograms) and sums, you’ve got the background required. There will be a tiny bit of calculus in parts 2 and 3, but nothing worth agonizing over.

I spent the afternoon on Wikipedia, which I’ve been using a lot to try and keep up, and reacquainted myself with natural log, along with a few tangents. Get it? Eh? Alright. What made it really click into place was realizing that in your graphs you were referring to damage taken as percent of health, not percent of health remaining after damage. Seems so obvious in retrospect.

Interestingly, I think, I’ve typed 5 different questions into this box, scrolling back and tabbing around to form them precisely, and just the act of doing so has lead me to the answers.

I guess I’m realizing the decay of having been out of school for a decade, with only the Army’s notoriously sub-par technical standards to meet. Seriously, outside of radiation safety (which was very important!), most of my job was basic construction surveying, finding the optimal moisture content of a soil sample, and slump tests on concrete. Not very brain-challenging tasks. Use it or loose it for sure.

When Theck starts posting actual formulas, I usually just skim past them, because he could just as well have typed them in Chinese.

However, he does a good job of explaining what the formula is for. For example, the 1x multiplier at 100%, 0.5x at 90% and 2.0x at 110% makes sense to me. I understand the logic and reasoning behind it.

So I don’t think you have to understand the math, as long as you understand the logic that the math is supposed to represent.

@ Jack, thanks for the Intmath.com tip. I like it! While Wikipedia is very, very detailed and informational, it’s just not very educational by itself. I’ve always seen it as a good supplement to education. Intmath.com seems organized in a more educational, progressively advanced context.

@ Thels, I can’t personally disconnect understanding the math from understanding the logic the math represents. I think understanding one necessitates the other. This could be because I’ve always viewed math as basically an extension of language, and I always want to hear and understand everything someone says, not just part.

Either way, my brain certainly takes a deep breath of relief when it sees “Summary” or “Conclusion” in bold.

Great post Theck. Having seen your pre-implementation in SimC, the most interesting part was your elaboration of the health decade factor.
The good thing in SimC is that we can measure statistical noise, including that of the TMI metric. As discussed, my tests resulted in a pretty high variance. That would make the TMI more costly to create than other measurements ( dps, dtps ), but it should still be feasible with a bit of patience.
Bin sizes don’t matter in SimC, as you take the numbers directly from the distribution, without creating intermediate or merged histograms.

This was with my testbed “standard boss,” which has only one action:
actions=auto_attack,damage=1000000,attack_speed=1.5

I’m going to do some fooling around to see how much (if any) magic damage I want to introduce into that. I also need to figure out how to invert the scale factors (since TMI uses “golf rules” – lower is better).

Perhaps to make TMI get bigger you just do it as 1- TMI or something as the end result. Although maybe having it become lower isn’t such a bad idea, but it might need a name change (sorry!) to convey the notion that the character takes less damage or is less susceptible to spikes.

Those scale factors look pretty stable yes. I did my tests with the default enemy settings ( including spell nuke and dot ) on the Paladin_Protection_T15N.simc profile, and that resulted in a bit more variance. I still think the variance of the TMI directly correlates to how dependent the TMI is on the tail end of the distribution.

Is there a special reason to have inverted scale factors? I tried to make sure that negative scale factors are properly displayed in the graphics, so that shouldn’t be a problem. And if lower is better for a metrics, so be it. Export links, eg. to wowhead, which assume more is better, are already inverted automatically in SimC ( had the same problem with dtps ).

Yes, it would. If you extracted all of your damage and self-healing events from the log, you could perform the TMI calculation. That said, the sample size will be very small, so the error bounds on the estimate will be relatively large.

Still, it might be interesting to run it on the entire log of a series of 20+ wipes during progression just to see how useful it is.

Hey Theck, look at the 2nd and 3rd graphs in this post for me. In the 2nd graph, it looks like Mastery is totally missing from the first data point (leftmost, near 0.85). Is this because Mastery was 0? And then in the 3rd graph, it looks like you’ve ADDED like a thousand to the Haste value, as if Haste-Mastery was 2xxx – (-1xxx) = 3xxx. That -is- a RAW histogram, right? Cause it’s labeled with HDF=3.00, but even if it weren’t raw, the value should have decreased, not increased, since the weights are <1 below 100% of max HP… so I was totally confused about that. Is it from a different sim entirely, maybe?

If so, you may want to look at where you were showing the difference between the cutoffs of top X% of spike events, because there's the same graph with hdf=3.00 whereas all the others have 2.00… Yet in the 7% and 10%, the 3k+ spike is still there. I'm not really sure what it is, but something goofy is going on there.

The “hdf=0″ on those plots is irrelevant, because neither of them are weighted. I just hard-coded “h=0″ into the title to remind myself they weren’t weighted. Apparently I forgot to do that for the binned bar plots (like #3), which is also a raw histogram. Sloppy title management on my part, sorry.

As for your other question: it’s not mastery that’s missing, it’s haste. And the reason it’s missing is because of the plot limits. There’s definitely a haste bar in the bin, but because MATLAB’s bar plot system puts one bar to the left of bin center and one to the right, it’s being cut off. That’s probably why you were confused – the large haste-mastery value is because there’s a large haste bar just to the left of the plot limits that’s being cut off.

Different question: Is there an intuitive reason why the +1k Haste and +1k Mastery sets are so different for certain bins, causing these edge effects? These aren’t totally different gear sets; they’re C/Ha sets with a bit of stats added. I’m thinking it’s simulation noise combined with the fact that the possible number of attack reductions is discrete, and they all reduce the attack by a fixed amount (unless SoI/SS “absorbs” were not completely eaten, but does that happen often?). e.g. looking at just one attack in an attack string, +Haste will more often reduce a 0.4 attack to a 0.25, but +Mastery reduces it to a 0.23, putting the end sum in a different bin (one which may be impossible/unlikely to get to with C/Ha). Certainly a 0.4 can occur in either case, so some bins should look very similar, but I guess some can look drastically different.

If true, this is possibly another good reason you should remove the histogram middle-man as you do in part 3.

Some of it is certainly noise, but these are 10k-minute runs, so I don’t think that’s the dominant issue. I think it’s a combination of the histogram binning and the way mastery works. Boosting mastery by 1k increases your SotR mitigation by a bit less than 2%. As I said in the post, this tends to take that string of 4 attacks, at least one or two of which is SotR mitigated, and reduce the overall damage by ~1%, which shifts that event to the left on the histogram.

This “shifting” of damage spikes is exactly the smoothing mechanism that we want to get from mastery, of course, but on the histogram that means we’re often shifting over a bin boundary. Since we’re dealing with discrete events, some spike sizes are more likely than others – this sim doesn’t vary the damage of attacks at all, so each attack is limited to full damage, 70% damage (blocked), whatever the SotR-mitigated size is, and 70% of that SotR-mitigated size. Sacred Shield absorbs and SoI heals do add some variation on that, but those too are fixed values, so in the end you just get permutations of those basic building blocks.

As a result, if we take for example the second figure (raw histogram close-up), the haste set is a lot more likely to land in the 0.95 bin than in the 0.93. But shifting some of that haste to mastery pushes a chunk of those events into the 0.93 bin and empties out the 0.95 bin, creating an imbalance. That’s the major reason that we see differences in certain bins.

And I agree, abandoning the histogram middle-man gets rid of a lot of those issues. You’d still have the edge effects if you only considered events over an arbitrary threshold, but those effects are relatively small as long as you include enough of the histogram (and of course, in the end we chose to just keep them all).

Cool! I’m studying Control Theory, which is pretty general in the sense that it’s used it all parts of Science/Tech/Eng/Math, so I wouldn’t be surprised if you knew stuff about it. Thus, I’m kind of interested how you answered the question of “When should a Prot Paladin hit ShotR?” and I will probably check out your earlier blog posts about it soon I noticed your graphs (at least for C/Ha set?) have “finisher=SH1″ so I assume you played around with a couple rotations.

Usually in Controls, we examine the problem of “How can we control our state to our liking using the most efficient input?”, assigning a cost to the input of some kind. But here, the amount of HoPo you have is also a state, and the input is whether or not to spend it (discrete input, not continuous), which is quite a bit different. Also, the main state (last x attacks summed) probably can’t really be expressed as a clean diff eq here. I’m curious if there are many similarities, as a result… Problems like that have come up in my classes, but usually they’re a bit contrived, and the standard tools don’t usually work.

Actually I guess it wouldn’t be terribly hard to express the problem with difference equations (discrete time control problem). Continuous time with integrating delta functions at various time points seems unnecessarily complicated.

While I didn’t cast it in the form of a differential or difference equation, you’re absolutely correct that it could be. We’re essentially evaluating the state of the system represented by your holy power and recent damage timeline, with a moderately complicated decision function.