Suggested changes to Rating system in our new Smogon ladder

chaos has asked about suggesting a rating system for our Smogon ladder, and here are my suggestions.

Basically, I propose to use the glicko2 system, which is exactly the same as the one implemented in the Shoddy ladder, with a few modifications. Assuming that R is the mean rating, RD is the rating deviation and v is the volatility of a player, the changes I suggest are the following:

The Rating displayed to the player is just round(R), not R - 4*RD as is used on Shoddy.

The Rating of a player is not always shown, however. It is only shown if RD<100, otherwise the Rating of the player is provisional. This way, a new player would need to play between 20 and 25 games for his or her rating to become visible. This should hopefully deter players from creating multiple accounts.

RD cannot drop below the threshold value of 60. If the RD of a player becomes less than 60, it becomes equal to 60. This allows for the rating of a frequently-competing player to continue to change at a nice pace instead of very slowly, which should help players keep playing with their current account.

RD cannot go above the threshold value of 350. If it becomes greater than 350, it becomes 350. This is a very minor change, done to make a player's rating deviation be at least that of a beginning player even if the player stops playing completely.

If a player does not battle in a particular day, phi (which is equal to RD / 173.7178) becomes equal to sqrt((phi^2) + 4*(v^2)) instead of sqrt((phi^2) + (v^2)) as is currently implemented (and then the new RD becomes the new phi * 173.7178). This change makes a frequently-competing player's rating go provisional after about 14 consecutive days of inacitivity, which should deter players from occupying the top of the ladder for a long time without playing. It also has the effect of making a player's rating become as uncertain as that of a beginning player after about 9 months of inactivity (which means that if you don't play for 9 straight months, the ladder would consider you a noob even if you were #1 before stopping playing.)

I'd like to have some comments from players that participate on the ladder to see if the above points address what they believe are shortcomings of the Shoddy ladder, and points for further improvement.

Obviously we talked about it on #insidescoop, but I agree 100% with these changes. I hated having to make new nicks and alts because my progress on the ladder was basically halted after a certain amount of time. I feel that a rating system that rewards (or at least doesn't punish consistency) is the best way to go.

Public Relations

Okay, I'll make it 14 days. It's a pretty simple fix; I just need to replace the '6' in the formula with '4'. :) As a result, the time taken to return to an RD of 350 is now 9 months, not 6 months.

Just wanted to ask something. The Shoddy page says that the ladder system tries to match you with a player having conservative rating estimate (CRE) close to yours. The CRE is the infamous R - 4 x RD used by Colin to represent a rating. Since we're going to just use R to represent a player's rating, that part of the program should be fixed to make the ladder system search for the Rating R that's close to yours, not the CRE.

Unfortunately, I am unaware of just how much someone would have to play to get their RD below 100, so it's possible that rule means this first part isn't an issue.

The reason Shoddy uses the 4*RD part is that because Glicko doesn't attempt to give you a single rating, but rather, a range of values. Displaying just R is saying that the player has a 50% chance to have an actual skill level at or above that value. For new players, their rating range is rather large because Glicko isn't quite sure just where they are. When Colin looked at the list when sorted by R, nearly every player at the "top" was someone he and I had never heard of. Subtracting four deviations is saying "This player has a 99%+ chance of having this rating or higher." which has the effect of only including more certain players.

As for rule change 5, that really gets to the heart of what the purpose of the ladder is. If the purpose is to create an environment in which people are trying to get to the top and then have to fight to maintain it, then yes, having more "rating decay" is good. If the purpose of the ladder is to rank players in terms of their skill, then the "rating decay" should be roughly equal to the loss of skill over time (so much, much lower than on the Official Server).

As far as I can tell, in combination with what you proposed in 1. (the use of R over anything involving RD), this will give no "rating decay", so the only issue is keeping yourself from becoming provisional.

Unfortunately, I am unaware of just how much someone would have to play to get their RD below 100, so it's possible that rule means this first part isn't an issue.

Click to expand...

It takes roughly 20 to 25 battles for your RD to become below 100.

The reason Shoddy uses the 4*RD part is that because Glicko doesn't attempt to give you a single rating, but rather, a range of values. Displaying just R is saying that the player has a 50% chance to have an actual skill level at or above that value. For new players, their rating range is rather large because Glicko isn't quite sure just where they are. When Colin looked at the list when sorted by R, nearly every player at the "top" was someone he and I had never heard of. Subtracting four deviations is saying "This player has a 99%+ chance of having this rating or higher." which has the effect of only including more certain players.

Click to expand...

I know this, and this is why I'm making all ratings having RD 100 or more provisional. If RD is that large, the rating isn't reliable, but is extremely uncertain; hence, it's provisional. And yeah, I looked into that list that Colin made. All of those players that came up at the top that you 'did not know' would have had a provisional rating in this new system, so they would actually not appear at all (or appear at the bottom as 'provisional').

Here is an old list that Colin has posted to prove his point that R - 4 x RD is the way to go. I added the RD at the end of each player's list:

In this new system, the only players out of the above that would be listed on the ladder are the ones in bold. They would be listed as #1, #2, #3, etc. All the other players would have provisional ratings.

As for rule change 5, that really gets to the heart of what the purpose of the ladder is. If the purpose is to create an environment in which people are trying to get to the top and then have to fight to maintain it, then yes, having more "rating decay" is good. If the purpose of the ladder is to rank players in terms of their skill, then the "rating decay" should be roughly equal to the loss of skill over time (so much, much lower than on the Official Server).

Click to expand...

There would be no rating decay in this system, and that's why I made the RD increase faster in this system. One could obtain a top 10 ranking and then stop playing, looking at his rating up there. With RD increasing faster, he would have 14 days for his rating to drop to provisional (and only if his RD is 60; if it is less, it would take him even less to become provisional).

As far as I can tell, in combination with what you proposed in 1. (the use of R over anything involving RD), this will give no "rating decay", so the only issue is keeping yourself from becoming provisional.

Click to expand...

Exactly. And that's why I made the rating go provisional quicker than normal. I actually made it to go to provisional in one week at first, then I made it 10 days. Then people suggested to make it drop to provisional in 14 days and I fixed it that way.

I did a simulation on Excel using the proposed rating system. Interestingly, the volatility increases dramatically when the people playing each other have their mean rating very far from each other. This happens in Shoddybattle because the player it finds to play against you is the one that has the nearer CRE to you, not the nearer rating. I tested by playing two games yesterday (with a crap team) to confirm this.

By making the player that plays against you have nearer mean rating (and, if possible, close RD as well), the volatility was better.

I don't know if this is possible to implement, but I'd suggest that the opponent that is proposed for playing against you on the ladder is one that has similar mean rating and similar RD to yours, not similar CRE.