Introduction
The Coach Rating (CR) system is a way for coaches to get an idea of the relative strength of other coaches on FUMBBL. A higher CR indicates that the coach is playing at a higher level, whereas a lower CR means the opposite. It is important to understand that the system is tuned to give more weight to more recent results and that it does, indeed, estimate coach strength based on actual results rather than their true ability.

The difference here is that a number of coaches sometimes enjoy playing teams that is deliberately a tougher challenge (for example switching to a new roster they are not used to, or playing all-lineman teams). This will inevitably lower the CR value to a lower level than the coach "deserves".

Another important take-away is that the CR system is an estimation and not perfectly accurate. There is a fairly large variation of the actual rating number each coach has, and the precision shown should not be taken as an indication of accuracy.

Rating categories
Each coach has a number of different ratings. These are divided into one overall group and one group per competitive division (currently Ranked and Blackbox).

Each group has an overall rating, and one rating for each race available.

The overall ratings are completely independent from a mathematical perspective. This means they are completely separated and the ratings don't relate to each-other. An important effect of this is that there is no formula to convert divisional ratings to the overall one; the ratings are updated based on the opponent's rating in that specific category.

The racial ratings are slightly different. They correlate to some extent with the overall rating for the division they are in. The reasoning behind this is that coaches often do not play all races on their own. For example: A coach playing goblins may play against an opponent that doesn't play goblins. Comparing the goblin rating to the opponent's goblin rating would therefore be inaccurate (e.g. the opponent will have the default rating rather than an actual rating). Another way to do this would be to use the rating of the race the opponent is currently using. I chose not to do this because the ratings between the different races are not necessarily in the same scale. A goblin rating of 160 may not be the same level of play as an orc rating of 160, because orcs are probably easier to play than goblins for most people. As a compromise, racial ratings are compared against the opponent's overall rating for the division.

The fact that the racial ratings are inherently based on fewer games makes them less accurate, and you should take these with a larger pinch of salt than normal.

Definitions
CR = Coach Rating. A starting coach has a CR value of 15000 (no fractions allowed), scaled by 100 on display to 150.00.
CR' = New CR (ie, the CR after a match has been processed)
k = An amplification factor designating the effect of a match result on CR
S = Factor for the actual result in the game.
p = Win probability of the match.

The Math
The core of the rating system is based on the Elo system, which has been heavily modified for the unique parts of Blood Bowl (primarily that teams are not equal). Note that this is calculated twice per match (and CR type): Once for each coach involved in the match.

Next, we define the S value, which is the actual result of the match:
S = 0.0 for a loss
S = 0.5 for a tie
S = 1.0 for a win

Then, we calculate the basic amplification factor k:

k = 2 - This is the base k value that is used unless the exceptions below are in play.

We then evaluate the match:
outcome = (S-p)
bracket_diff = bracket_coach - bracket_opponent. This is a numerical value between 1 and 6 for the different CR brackets (experienced through legendary). Additional detail about these brackets is posted below.

k = 2 + abs(bracket_diff) / 2, if bracket_diff < 0 and outcome > 0 or if bracket_diff > 0 and outcome < 0
This means that if the coach did better than expected vs an opponent in a higher bracket, or worse than expected vs a coach in a lower bracket, the k value is amplified to somewhere between 2.5 to 5.5. In short, we amplify the result if there's a clear upset.

k = 2 - 2/(5 - abs(bracket_diff) / 2), if bracket_diff > 0 and outcome > 0 or if bracket_diff < 0 and outcome < 0
This means that is the coach did better than expected vs an opponent in a lower bracket or worse than expected vs a coach in a higher bracket, the k value is dampened to somewhere between 1.2 and 1.56. In short, we dampen the result if there's a clear non-upset.

After all this, we are ready to calculate the CR change:

CR' = CR + k * (S - p)

For each match, this is repeated a total of 8 times (overall, twice for the overall divisional one and the division/race specific one; all computer for each coach in the match).

Last edited by Christer on Oct 21, 2017 - 16:36; edited 2 times in total

The Rookie bracket is handled in a special way, where a coach who has played less than 10 matches for a given CR category is listed as a Rookie. This is a filter only applied on the display of the bracket, and does not affect the CR calculation above. Instead, coaches begin at Emerging Star as far as the rating system is concerned, much in the same way they start at CR 150.

The Legend bracket is also handled in a special way. It is loosely defined as the top 50 active coaches for a given CR category. Active in this context is defined as a coach who has played at least one match within the last 3 months (regardless of division).

This top 50 is reduced to roughly 14% (50/350) of the active coaches if the number of coaches in a category is less than 350 (although I believe this is not the case in any category).

The rest of the brackets, from Experienced through Super Star, are divided into CR ranges based on the standard deviation multiplied by a factor that is picked to make the top 50 legends. I realize this sounds complex, so let's review an example:

Let's consider a category where the average CR is 150 (this is not always exactly 150.0, but it will be close), and the standard deviation of the ratings is around 5.47.

We take the lower CR of the top 50 coaches (let's say 174) and do the following calculation:

When a match is played, and a CR increases or decreases, the resulting CR is compared against these brackets. To avoid a coach constantly swapping between two brackets, the resulting CR must exceed the lower limit of a bracket by 25% of the higher bracket size to be promoted or equivalently go below 25% of the lower bracket size to be relegated. To be promoted to Legend or relegated to Experienced, the 25% comes from the current bracket (as these two extremes are unbound and have no real size).

On a monthly basis, a script is performed (after recalculating the bracket limits) that moves coaches into their bracket, regardless of this threshold.

Last edited by Christer on Oct 21, 2017 - 16:36; edited 2 times in total

Now, what I've also done is to consider the specific domain of Blood Bowl and the TV diff was changed to be a normalized version (ie, we effectively look at the percentage difference between the two teams rather than the direct TV difference). This means that TV 1000 vs TV 1100 is equivalent to playing TV 2000 vs TV 2200. While this doesn't account for inducements, I feel that the normalized difference is better than the direct non-normalized one.

With CR, we used a linear difference for a very long time. This is the root cause of why it's beneficial for your CR to play down a lot in terms of CR (e.g. cherry picking rookies). The estimated win probability is simply too low considering the match, and therefore winning has a larger effect than expected on the CR of the high CR coach.

CR system updates (Oct 2017)
So, I recently introduced the exponential curve, effectively X + X^3, where the linear X needs to be there to avoid silly effects with matches between coaches with very similar CR. In a way, you can think of this like a "uncertainty effect". With coaches close to each-other in CR, the system considers them relatively equal since the CR itself is an estimate and uses a bit of caution. With coaches further apart, it becomes more and more certain that their skill levels are actually different and a stronger effect is applied to the expected result.

The constants used (f_CR, f_1, f_3) are chosen to give a reasonable curve for the different CR differences. In my work in choosing these, I simply made an assumption that the normal distribution of the coach CRs would have a mean of 150.0 and a standard deviation of 10. This means that 68% of coaches on the site would be between 140 and 160 and 95% of coaches would be between 130 and 170. Something like 0.3% would be above 180.

Then I made the assumption that CR 180 vs CR 150 would have the system assume something along the lines of 97% win rate, and CR 155 vs CR 150 would be relatively close to 55%. I tweaked this around a bit to get to a point where I felt comfortable with the curve (looking at CR 160 and CR 170 in between those points.

CR distribution
It's important to say here that overall low p values will generate a larger spread of the CRs for coaches, while p values that tend to quickly move from 50% will narrow the curve (this is an effect I've learned by trial and error mostly, looking at the effects of how tweaking numbers affect the overall distribution of CRs).

At the same time, having too "flat" p values, where the win probability is estimated to being too close to 50% will make it very powerful in terms of CR to cherry pick.

At this point, I think the basic foundation is strong. I have the tools to adjust the constants and adjust the curve that is used for CR differences in a fairly granular way.

However, the last CR update I ran gave me somewhat of a surprise. I picked the constants to the formula in a way that would effectively force the standard deviation of the ratings to be close to 10, and very very few coaches should be hitting CR 180 or higher. After running the script for a while (and also looking at the end result), we have loads of coaches at beyond CR 200. While this isn't a problem as such, it was a surprising result to me.

Finding problems
Looking more into some details, "cherry picking" type games were still giving too much CR increase, despite the exponential CR difference *and* the k-value adjustment that is in place (which in a way is even further penalizing the cherry picking behaviour). Also, race-specific ratings are all over the place and hard to understand. So what gives?

Looking back at the formula, you will see that there's a racial filter between the raw p formula, and what's used to calculate the CR difference. Thinking about this further, I am of the opinion that this p adjustment is problematic.

What it does is to take racial ratings and comparing the average result for a racial matchup to the win rating. My thinking is that most of these racial factors are very close to 50%, regardless of TV brackets or races. In reality, win rates for all sorts of TV differences is incredibly narrow (basically between 45 and 55%, regardless of races and TV difference, up to extremes like 500k).

This, I believe, causes the expected win probabilities to be dragged closer to 50% than I would expect. In turn, this will give cherry picking behaviours an underestimation for win rates (meaning cherry picking is good for CR) and at the same time increasing the standard deviation of the CR curve; ie, remember what I wrote about "flat" p values above.

What I am intending to do with the next update is to simply remove the racial factor. While I think it's still a good idea at its core, it's causing more problems than it solves. So removing it will hopefully make things better.

Last edited by Christer on Oct 21, 2017 - 16:37; edited 1 time in total

It seems to me the mechanics of the amplification factor, k, is central to many of the CR debates that rage across the forums from time to time. Finding the balance of getting to a somewhat accurate rating quickly vs a rating that resists wild fluctuations seems to be what "k" is all about - if I understand correctly. Is it better to have a rating that is much better that takes 1000 games to become effective or a rating that gets close after 10 but will move around a lot? It seems obvious to me that "something in between" is the answer, but how to do it, exactly, with the added challenge that if a coach switches what teams they have been playing then it throws everything out of whack...

Anyway, I look forward to the rest of your series of posts. Thanks for your time!

_________________Come join us in #metabox, the Discord channel for HLP, ARR, and E.L.F. in the box!

Ok very interesting, thanks for this
You're adjusting by TV differences,
and also tracking racial winrate for each race Vs each other race, at various tv brackets...

Are they both necessary? (Is it double counting the effect of TV, or did I misunderstand)

Could like a big matrix of race Vs race at all TV bracket combinations do the same job instead?

Maybe there are situations where tv has unexpected interaction with race... (inducements might make a difference)
Like maybe necro beat orcs if they can have babes, but they lose without them,
Maybe rookie lizards beats rookie chaos, but 2000tv lizards loses to 2000tv chaos, etc

On the other hand some matchups would just not have enough data to predict. Could sandbag the results by adding 100 fake draws to the stats or something, to make it lean heavily towards draws until it has enough data to overwhelm that

In Azure machine learning, there's a function called tune model hyperparameters
Which basically can work out what the weights of constants should be

Is it possible to derive your constants from your existing data? Eg f_CR, f_TV. Also for finding the actual win rates here:
"CR 180 vs CR 150 would have the system assume something along the lines of 97% win rate, and CR 155 vs CR 150 would be relatively close to 55%"

Follow on point, you're using linear CR diff, but 150-155 may not have the same influence as 170-175?

edit - actually, did you change this already, the thing with X+X^3 ?

Last edited by Sp00keh on Oct 17, 2017 - 23:37; edited 5 times in total

They are both performing as expected, so would have the same CR?
This sounds great to me as it means it doesn't matter which race you play if you want to farm CR, just go with the race that you perform the relative-best with, which would enable diversity

Hmmm, if they then both get beat by CoachC, they would both suffer the same amplification factor K no? As that's based on CR only
As Race1 loses more, so CoachA's CR will take more punishment from K?
Is masked a bit by f_race but still means that they wouldn't end up with the same CR, and you'd probably want to play tier1 races to make legend

Last edited by Sp00keh on Oct 17, 2017 - 20:27; edited 1 time in total

I have always (perceptions aside)defended CR as a reasonable attempt at defining coach rating. Go check, I stand by everything i have ever posted.

I struggle to keep up with the pure math, but can usually follow the reasoning and hence eventually get the math too.

But I am lost as to why a 4-0 win is better than a 1-0 win.
It would seem contrary to your stated aims.
Increasing, as it does, the value of unfair games with high score outcomes.
Promoting certain styles of play and certain races while devaluing others,
Bringing 'perception' of a 'good' win into something seeking a rational outcome.

So I am interested as to why you feel this is an appropriate measure? It feels like a step toward 'personal preference of play style' and away from 'how likely is a win' to me.

_________________Barbarus hic ego sum quia non intelligor ulli
I am a barbarian here because i am not understood by anyone

Kind of like school. Student 1 got an A in the class, student 2 got a C.

both students passed and will go on to graduate.

But student 1 is definitely the better student and therefore should get proper recognition over student 2.

Mmmm.. no. In one 4-0 loss a noob with 1 re-roll wasted it on turn 1 on a troll block, losing the game 4-0 against norse, or whatever. In a different 4-0 loss, a legend coach, victum of a BLITZ and several cas, who was down 2-nil at the half, tried the best possible plays he could in a effort to salvage a draw, or even a win. These amazing plays didn't get the dice, hence the 4-0 loss. These are two dramatically different games, with the same lopsided score, both which happen often. Rewarding 4-0 is an error, imo.

_________________
Join the wait-list. Watch the action. Leave the Empire. Come to Bretonnia!