Our video game database is being modernised on a big scale and this preview series is your chance to get a sneak peek. In this first entry we will be discussing the new core data model.

The most important aspect of any game database is the underlying data model. Design it the wrong way and you'll be heading for disaster, forever entrenching your database in a morass of dirty workarounds for seemingly trivial problems. Over the last decade or two we have witnessed this in all major gaming databases, which all have their own problems. We tried to avoid these problems with our initial design, but the scope of our project changed somewhat since then, making it necessary to redesign the core model from scratch. Thanks to good data layer encapsulation in our codebase and hard work on a conversion script of our existing data we have been able to design the new model without any compromises.

Before we explain some benefits of this model, we want to give you a quick overview of the model. It's quite a simplified representation, leaving out some technical details, using more known terms and even leaving out several non-trivial levels. But it should still give you a good idea what it's all about.

What are some of the advantages of this approach you may ask, and why does it have to be so complex, featuring multiple levels? There are games, and they have releases, right? Isn't that all there is to it? Unfortunately, not quite. As video games are a curious mix of technology and art, things like ports, remakes, mods, compilations, add-ons, demos, pirates and even fan translations muddy the water significantly, requiring a thorough design approach. Here are some examples of things only possible due to this complex design.

Game Relationships

One of our main goals with the game database here at RetroCollect is preserving the history of video games. We preserve this history by accurately storing data about games, called metadata. If humans look back at games from the 1980s 100 years from now, one of the most complicated aspects is untangling the vast ways in which all these games were connected to each other.

Let's take a look at a seemingly simple example to see why some of the complexity above is needed: Final Fantasy IV (Final Fantasy II in the US).

The principal work of art in this case is Final Fantasy IV. It has a distinct game mechanic and basic storyline which is the same across all its various releases and iterations. It is a part of the Final Fantasy series, a sequel to Final Fantasy III and has a sequel Final Fantasy V. There is a reinterpretation of the original game (Easy Type), there's ports (PS1), remakes (WonderSwan Color), enhanced remakes (3DS), ports of remakes (GBA), digital re-releases (Virtual Console), spin-offs (Interlude), fan translations and it has been part of compilations on various platforms (Final Fantasy Collection, Final Fantasy Chronicles, Final Fantasy Anthology, Final Fantasy IV Complete). Phew, yeah, that's right! But even if you limit yourself to something as simple as the SNES version, you have to track the differences between regions, the name change for the US, the fact that it has multiple reprints (e.g. the USA-1 manual variant) and of course the whole Easy Type business. For games released in Europe this becomes even more complex. All of these things are only possible due to the various levels in our models shown above. Drop a single abstraction level in that design and you lose the ability to accurately track at least one of these things. And this example from the 16-bit age doesn't even involve tricky things like episodic games, downloadable content, demakes or even a limited edition. Rest assured, none of that would be a problem!

Data Accuracy

But there's more to it than "just" correctly modeling relationships between games. It's also very important to track each bit of data you want to save about a game at the proper level. An easy example of this is genre. It would make no sense to save the genre of a game for each regional release. If the US release Nintendo World Cup is a soccer game, then the original Japanese version Nekketsu Kōkō Dojjibōru Bu: Sakkā Hen is also a soccer game. Saving this information twice leads to errors and also forces users to enter the data twice. The same is true for every bit of data. Does the game have a multiplayer mode and what does it look like? That's a property of the game version, not the packaging variant or port. Can a cartridge be played on a GBA? That's a piece of data related to the combination of version and medium. Was the Amiga version given a 90% score? That's only directly related to the port, not the game, but it's also relevant for the game. Was there a blue cover variant? That's a detail about a variation, not about a release. Did the Limited Edition come with an artbook? Needs to be tracked with the edition. We're sure you can see the patterns by now. Suffice to say, only the complex design makes it possible to accurately store all of this.

Collection Tracking

What if you're not interested in all this "metadata" and preservation? Maybe you simply want to track which games you own? We hear you! You still profit from this design. Only through this complexity will you be able to define what exactly constitutes a "full set" for you. Include Australian games with your PAL set? Also want Japanese exclusives? No problem. What if you want to track exactly which edition or reprint of a game you own? Yep, once again the data model comes to your rescue. We'll describe the new collection tracking properly in a future preview post, so stay tuned.

Well, let's wrap it up for today. We hope you enjoyed this (small) peek into the depths of video game database design. You can look forward to more preview posts covering all kinds of topics in the future. If there's something in particular you want to hear about, please let us know.

Comments

Super interesting article. A lot of work will be going into it I see. Good luck with it all

The one thing I'd love to see, which is more relative to later consoles than earlier ones, is the ability to filter games which are exclusive to that console.

For example. If someone was collecting Gamecube, XBOX, PS1 or PS2 - to be able to filter just the exclusive games to that console would be another way to collect. This will get more important when platforms like the PS3 are added. Is it possible?

Nice readup. What concerns me though, is performance. I have no doubt this data integration will work smoothly offline and with a limited amount of users. But online and multiple users at a time may become a problem.

If you let me know what you exactly mean by data integration I'll gladly reply to that ;)

Well - from your example it looks like when I query Final Fantasy IV I will pull all data from the database. Either that or the query / data string will be quite extensive.

I am no database expert by far, but I work in an environment where we use OLAP databases (SAP BPC). Even when working with as little as 6 people together on one dedicated server we notice performance issues. And to be honest we do not even have that big a model in place and stay well within the SAP recommended number of dimensions. A typical "category" will have 400k records tops.

Just wondering how you will ensure we do not run into performance issues.

I would not mistake SAP as a good parallel to any other commonly used DB. At least in the sense that any modern DB would have problems with 6 concurrent users. This data model does not look particularly taxing. Not to imply performance won't still be something of a concern but I wouldn't use your experience with SAP as a measure.

That said, maybe I missed this in another post but what is the DB in use here? Postgres? MySQL?

The 500 tables mentioned do make me wonder how exactly the small model above managed to translate to that many tables. Without much to go on that seems like a lot for something like this and seems to speak to missed normalization opportunities. Not to say 500 tables will cause any problems at all but I work with a very extensive Oracle DB with 2k tables in the core schema and probably another 3k spread across the other 30 or so schemas. It handles a pretty massive product space from ordering and billing to minute network config details and monitoring data and has taken 10 years to get to this point. I'd guess something like 15% of those tables are history/audit tables too. Would love to know more about how you got to 500 tables. Not knocking the design I just wish I could see the ERD since it sounds pretty extensive. Cheers on sharing what you have and no hard feelings if you keep the rest vague :)

I would not mistake SAP as a good parallel to any other commonly used DB. At least in the sense that any modern DB would have problems with 6 concurrent users. This data model does not look particularly taxing. Not to imply performance won't still be something of a concern but I wouldn't use your experience with SAP as a measure.

Well, considering I would expect a SAP database to be on a dedicated server, streamlined and built to endure several users at once, I would say the anology works. (though I must say, I keep wondering about certain SAP design choices).

I am well aware the Retrocollect database may be a bit less heavy regarding the amount of data, but considering the website performance we are experiencing today I think performance should certainly be one of the top concerns...

Yeah, I'm afraid I can't make the ERD public at the current time, maybe in the future.As I mentioned in the post, the above core model is simplified. There's actually 6 more "blue" levels in between those pictured above. We're using MySQL and the number of tables is not the result of a single simple cause. The inherent complexity of video games as cultural and technological artefacts is the primary reason. The data is complex and thus requires an appropriately complex model. The design is almost completely normalized, with very few deliberate exceptions. But yes, there are history tables too (a central design though, not a per-table design) and of course all the tables related to collection tracking, user votes. Heck, I think the platform stuff alone is ~50 tables. I'm probably going to explain some of this in another preview.

Flozem, you are quite right, the fact that lots of tables are needed to fully store the complexities of a single game means we have to run a rather high amount of db queries for a single page. We are trying to counter this with multiple strategies (caching, lazy loading, a custom ORM), but yes, only extensive load tests will tell if there are still problems.

Well, it is just the game database part, so no tables for the forum, news, or community sections are included in that number. But I'm sure you already guessed that ;) Btw, I just checked, there are indeed 48 platform-related tables, and that doesn't include the system requirements stuff. Only the hardware and software platforms as well as their abstractions and relations, of course including actual physical release editions as well as medium support and release regions.

I totally love how professional the approach is being taken here. And everything is being done for the community without the need to pay for it :).

I also like the fact how hydr0x is almost all silent about the update just to come up with the short yet very informational and reassuring post like that, meaning you're probably busy with the work Keep up the good work, can't wait to see the effect!

About RetroCollect

RetroCollect is your one stop for everything retro games and retro gaming. Featuring the latest classic gaming news around, informative reviews and an ever active forum, you'll feel right at home with other retro gamers.