I think the rating system in Ludum Dare is not good (and in many other websites as well). I don’t know a better solution, but I think it is worthy thinking about this problem.

Let’s say there are 1000 submissions and 1000 reviewers. If each submission receives a lot of reviews, then the average of the reviews will give a good estimation of the quality of the submissions.

However, if each reviewer only reviews a few submissions, each submission will only get a few reviews, which means that the average will yield a very poor and “noisy” estimation that doesn’t represent the opinion of the community as a whole.

This means that simply asking people to rate and averaging the rates is not a good idea in this case.

I suggest that we invite our mathematician friends to come up with a smarter and more elegant solution for this. Wouldn’t that be awesome?

This entry was posted
on Sunday, July 29th, 2012 at 12:43 pm and is filed under MiniLD.
You can follow any responses to this entry through the RSS 2.0 feed.
You can skip to the end and leave a response. Pinging is currently not allowed.

11 Responses to “Rating system”

I outlined how the previous competition (LD #23) had a wildly inccorect scoring system which lead to the overall results being crazy and all the scores were completely messed up. as you can see from the examples I gave in my post…

There was some positive response to my post and generally people agreed that something needs to be done , I suggested an alternative that would solve the problem (using Bayesian Averages), but since then there has been no official word about anything, so I am unfortunately sad to say I think the next LD will have exactly the same problem…

If you remember my statistics posts from LD’s 22 and 23 (i’m too lazy to look up links, but they included spreadsheets of data and graphs) you’ll remember my findings basicly pointed to the fact that regardless of quality or completeness of the game the more votes your game got the closer to 3.0 rating you approached, and the closer to minimum ratings (around 10) you received your ratings would skew INCREDIBLY high or low due to peoples tendencies to not rate on a curve and the low number of votes not being able to average that out, making it so a slightly above average game with less than 20 ratings would invariably end up rated higher than a good game with 100+ ratings.

With 1000+ people and most of them not rating at all or only rating a single game the rating skew is ENORMOUS now and no longer works, a simple rating system as we have could only possibly work if every single game got a even number of reviews, and then that ends up having the issue of now our sample size is too small if the same people don’t review them all, so i’d have to agree with basicly a combination of some of the above things mentioned.

Rating elimination rounds, each round scrapes off the bottom x% and that’s the “tier” they score in for final. After that percentage is eliminated from making it to the next tier up voting re-opens on the remaining games for a few days and repeats. Instead of a full month for ALL ratings, have each tier open for a few days with the last few tiers basicly being the gold/silver/bronze winners in those categories. As long as you don’t scrape off too much of the bottom the ratings should become more and more fair with each tier it goes up, and since you’re only scraping from the bottom to eliminate in the “less fair” early tiers even if you were eliminated the skew shouldn’t have been so bad that you probably wouldn’t have made it past the next tier anyways as you would be even lower in the ranks there. To give even more fair scores you could even use Bayesian Averages but imho in a system like this i don’t think they would be needed as the number of votes on a game would raise with each tier it goes up so it’s ratings in relation to other games in that same tier should be somewhat close, especially as you get closer to the gold/silver/bronze tiers (and it wouldn’t drill down to positions, much like mentioned here elsewhere your exact position is essentially random as it’s beyond the precision of any of these rating systems, so gold should maybe be the top 5%, silver the top 10%, bronze the top 20%, 4th tier top 40%, and everything below that essentially unranked (as precision would drop significantly as you go down ranks, below top 40% you could still see your score but as it is now your “position” past the top 100 or so is meaningless anyways)

I totally agree with you… that “INCREDIBLY high or low” skew was really transparent and obvious in the last LD, where, as I pointed out, games coming in high overall positions such as 5th were only rated by a VERY very few number of people… which in my opinion completely broke the scoring and ratings system for the previous competition.

Alas I fear it will be the same, if not worse, for the next competition.

Improvement is always awesome, but it’s quite hard to achieve. The last changes that were done to the rating system had a really positive effect to actually make people rate – as it rewards those people who do.

Things that should be taken into account is;

* A lot of contestants rates very few games, and a surprising amount of people does not rate a single game.
* Rating isn’t always fun, especially if you walk through the entries that have been rated the least – there are two reasons for an entry to be rated the least; 1) It’s bad, 2) It’s for a minor platform

The last LD offered a couple of changes, that rewarded players who rated entries by showing them first on the list of entries to rate. That way, if people wanted their entry to get rated – all they had to do was rate more entries. It increased the amount of rated entries by 50% – and was very effective. I’m _REALLY_ happy with how that turned out. The really good entries will always stick out, and the people who want to be considered properly only need to do their part of the chore that is rating other entries of … pretty various quality.

Let’s get a full understanding of the problems before us before we begin to create a solution.
Problems:
If a game gets advertised through the right channels, it gains additional hits that will likely never reach other games in the competition. A well advertised game will be played more often, causing it to become viral. It’s a virtuous circle, but it plays against us in this case. One game will have many reviews, the rest will have little to none.

How, then, do we make people play more than one game? And for that matter, how can we be sure someone has played a decent amount of the game before they review it?

A simple solution for the latter problems comes to mind, but not all-encompassing: A Ludum Dare application that and download and play games right from inside the app. You open up the LD48 app, you pick out a game or have it pick you out a game, you play it for a couple minutes and the Review option becomes available. Instantly we go from having to download and install multiple games to having one program serve games to you on a silver platter.

(1) QUANTIZATION
Votes go from 1 star upto 5 stars, and there are an average of about 20 ratings per game, so there are only about 80 graduations between 1.00 and 5.00. So when 1000 contestants are put into those 80 boxes, there are going to be over 10 people in each box. Rankings have error upto 10 places, due to this quantization.

(2) AVERAGING NOISE
Let’s say I average 3.50, and I get another rating. All things being equal I will get either 3 or 4 stars, and that will nudge my average to either 3.476 or 3.524, which would move me up or down the ladder upto 10 places. Rankings have additional error upto 10 places, due to this averaging noise.

(3) SAMPLE NOISE
Suppose my game deserves 3.5 stars on average, so the people who rate it would be expected to give 3 or 4 stars, each with probability 0.5. But because of random sampling, I may get an unlucky run of 3-star givers — just like getting an unlucky run tails in a series of coin flips. This sample noise is likely to cause error upto 30 places.

Those three random phenomenon can easily nudge my game 50 places up or down, which makes the detailed rankings complete nonsense. As LD grows bigger, the ranking will becomes even greater nonsense.

I suggest we eliminate this noise by not ranking contestents from 1 to 1000 at all. Instead give awards to the top percentiles.

Top 2% – Gold Award
Top 6% – Silver Award
Top 12% – Bronze Award

It’s not perfect, but percentiles will absorb a lot of the noise, and the stability of it would be constant, no matter how big LD grows. I feel it would also better reflect the spirit of LD.

Idea, there is a plugin for chrome (and I think firefox too) called WOT(web of trust)
What it does is allow users to rate websites in 4 categories, trustworthiness, Vendor reliability, Privacy and Child safety. Instead of averaging out results, it takes into account how trustworthy the rater is too, by counting how many websites that user has rated. for more information, http://www.mywot.com/en/faq/website/rating-websites.
A rating system like this could be a better way to go, but there may also be down sides to such a thing.

I don’t really think we should cut the current winning aspect. It would be too drastic and take publicity from those who earned it. Personally, I don’t see to much of a problem with the current rating system, although if anything, I like jellonator’s suggestion.

I don’t really think we should cut the current winning aspect. It would be too drastic and take publicity from those who earned it.

Contestants earn an approximate position (such as being in the top 5%), but whether they come 20th or 50th is essentially random for the reasons I explained. That’s random component isn’t earned, so it should not factor in what deserves merit or publicity.

It’s a common rule of statistics, that when data is accurate upto 1 decimal place, you don’t quote it to 3 decimal places, as it gives an illusion of precision where there isn’t any.

I’ve argued how the precision in LD about 25 ranks with 1000 contestants = 2.5% accuracy, yet when we rank people from 1 to 1000, we are creating an illusion of their being 0.1% accuracy.