Game Balancing: Data-driven Approach

Arthur Mostovoy, Lead Game Designer

April 25, 2017

In my previous article I’ve tried to describe how to translate abstract game concepts into concrete numbers for further manipulations (whatever they may be). We have witnessed that it can help us derive the very first version of a balanced system for a given game. However, since it usually rests on a lot of assumptions and sometimes complex calculations, we cannot reliably assume that this great-in-theory system will actually do great on practice when we release it to millions of players. Besides, if we decide to tweak it in the future, how do we figure out if what we have in mind is good enough or needs some additional adjustments?

We certainly can just go ahead with the release and see what happens but this would not be ideal as it may as well break the balance and drive the users away. Fortunately, there’s a better approach — performing extensive testing in real-world conditions. For this purpose a lot of digital companies use various testing methods in conjunction with data to make sure the soon-to-be product is top notch. The methods I would like to mention here are beta testing (which is usually conducted before the initial release) and public test server (that is used to test updates planned for the live server of a released game).

How it works

Basically, both methods reproduce real-world conditions given there’s a relevant and large enough data. Relevance is there to provide qualitative effect. When you conduct a beta test you generally try to gather people that are interested in playing your game when it’s released because they will most likely be your actual players. If you ask match 3 players to try out your zombie-killing hardcore online competitive game, what kind of feedback do you expect? The data you will process after the test will be highly irrelevant. When speaking about public test server, it’s mostly the actual players from your live server (who else?) that will be taking part in trying out future patches, so that’s a given in this case.

Large size of an audience guarantees quantitative effect. Statistically, more people testing your system means a more accurate result. But you don’t need to bring the whole world there. A few thousand people is reasonably high, yet if you can get more — go for it. Aside from granting players access to exclusive new content (even if it’s just the balance changes which means a lot in online competitive games), you can also hand them additional rewards on live server for participating and letting you know what they think.

Things to track

Basically, data can be split into two different categories: user feedback (impressions) and user behavior (stats).
With impressions, the idea behind data-driven approach is that the weight of an individual impression diminishes with the number of unique impressions. Besides, there’s a physical limitation to just how many of those you can process as a human being. If your game is played by 5 people, you can easily listen to every one of them, discuss their thoughts with them and figure out what their consensus is. If your audience is literally a couple million people, how many impressions exactly do you think you could process one by one before you lose track of what’s happening? And what if you face contradicting impressions every once in a while (which is perfectly normal since we’re all human with our own opinions)?

For this reason, if your playerbase is large, it makes sense to give your users a tool to both speak their mind and let you look at the big picture once they’re done. The first tool with such capabilities that comes to mind is simply a poll. After the test is conducted you ask your users about every detail that is important to you and give them exhaustive amount of options to pick. For instance, in War Robots when we test a new weapon (which can and should be viewed not only as a new addition to the balance, but an entity that by itself shifts the balance), we usually ask the players if they think any of the weapon stats should be increased or reduced. We also give them the ‘everything is fine’ option and the option to speak their mind freely in the end of the poll if they want to. This allows us to group individual impressions into collective feedback and see what’s what.

With user behavior, it all depends on what stats you think are necessary to track to help you decide if anything needs to be changed. Again, if we take weapons as an example, I usually look at things like how much damage the weapon did, how many times it was used on the battlefield and so on. There can be many subtle metrics that affect gameplay and user experience like how many times the user has died while using this specific weapon. Try to think out of the box when figuring out what metrics to track.

Corner cases

After hearing what players have to say about the weapon i.e. after analyzing the impressions, I relate them to the stats — the way players actually behaved in battle in a numerical expression. All this data combined is usually enough to make a decision to either move forward with the update or hold it back and tweak it a little bit more.

However, there are’s a couple points of note here. Firstly, things can go wrong anywhere, anytime and tests are not an exception to this rule — somebody can forget to do all of the planned tweaks and just include some of them during the test (thus inevitably changing the carefully planned test scenario without realizing it) or something can go wrong on the technical side. A subtle change in the way a projectile is fired from a weapon can change its balance immensely in a manner you wouldn’t predict. This will ultimately affect the final data you will receive and process. As such, you should always check that the tests have been conducted according to the initial plan. If not, you should either do more tests with normal conditions this time or take this into account when evaluating the numbers.

Secondly, it’s pretty obvious where things stand if impressions and stats go hand in hand. But what if they differ drastically? Although it might seem unrealistic at first, this may as well be the case in certain situations. For instance, let’s say some new weapon that has been tested appears to be doing significantly more damage than any other weapon in the game and basically just decimates everyone according to the stats. Imbalance, right? No problem, let’s prepare the tweaks. Wait a second though — the users are absolutely happy with the weapon and think it should be deployed on the live server right away! What then?

Basically, while initial reaction of the users is highly favorable, on practice if you just go ahead and release an imbalanced piece of content on the live server, it will quickly cannibalize all the other content in the game and become the meta (the single best option for players that want to play effectively and win). The hype towards that new weapon will turn to hatred towards the new state of balance in the game. I would say that in case when impressions and stats collide, you have to override the impressions with the stats (presuming they are tracked correctly, of course) and balance the weapon and conduct more tests until it fits your balance system. However, this situation may also indicate that the way your balance system is calculated is not ideal and does not take into account something that this new weapon provides, so that might also be worth a look before you make a final decision.

Afterword

While I used a single defined entity (a weapon) as an example representing a planned update to a released game, the case can be further generalized and applied to pretty much anything — a piece of content, a technical improvement or game balance as a whole.

However, do not assume that math and statistics are mandatory to design and ship a well-balanced game that people will love playing. If you can do it without using either of them, all the better for you! In the end, both of these are simply instruments in a game designer’s hands — instruments that are there to make your life easier and double-check your design if you know how to use them to your advantage.