Right now, I can tell you that about 37 percent of the roughly 781 million games registered to various Steam accounts haven’t even been loaded a single time. I can tell you that Steam users have put an aggregate of about 3.8 billion hours into Dota 2. I can tell you that Steam users tend to put nearly 600 percent more time into the multiplayer mode on Modern Warfare 2 than the single player mode.

Basically, I can give you an idea of how any of the thousands of games on Steam have performed, both in terms of sales and gameplay hours.

These estimates are based on publicly available information described in much more detail below. It's the kind of data that the public almost never gets access to in the video game industry. Sure, we get a monthly “Top 10” list of best-selling titles in the US from tracking firm NPD, but these results smash together myriad versions of multi-platform releases and don’t even contain specific sales numbers these days (foreign services like Britain’s Chart-Track and Japan’s Media Create are slightly more robust in their public reporting). Those with deep pockets can pay for access to a treasure trove of historic and current sales numbers, but subscribers are contractually forbidden from sharing those numbers with the public. Steam, to its credit, offers real-time and “daily peak” snapshots of how many players use its 100 most popular games, but these numbers can be transitory and don’t reflect total sales or play time very well.

Further Reading

We've come up with what we believe is a much more robust way of estimating sales and player data based on publicly available information, at least when it comes to games specifically on Valve’s Steam download service. The information we’ve collected over the past few months includes not only sales estimates for every game on Steam, but also data on how many hours Steam users have spent time playing those games. The result is a wide-ranging survey of a service that estimates suggest represents 70 to 75 percent of the current PC gaming market in the US.

Today, we’re going to start sharing that data with you. But first, let’s discuss where our numbers actually come from as well as the strengths and limitations of our approach.

How we did it

As it happens, all the data needed to track sales figures on Steam was hiding in plain sight. The core of our data comes from the individual profile pages on Valve’s SteamCommunity.com social portal, such as this one for yours truly. Prominent on each of these public pages, again just a click away, is a list of every game that Steam users have registered to their accounts. This page also lists how many hours they’ve played for each of those titles (here’s mine if you have a desire to see my pile of shame for some reason).

Crucially, for our purposes, these profile pages don’t even require a specific username (e.g. http://steamcommunity.com/id/KyleOrl/) to access. They can also be brought up using a unique, usually hidden identifier known as a Steam Community ID number. As detailed on the Steam Developer Wiki, every individual user on Steam gets a unique 64-bit SteamID that can be converted algorithmically to a 17-digit decimal number, starting at 76561197960265729 and going up sequentially from there. That’s why my profile page can also be accessed using a URL in the form http://steamcommunity.com/profiles/76561197980357107 (you can find your own hidden SteamID number and Steam Community ID using this site; if you get results in the form of "STEAM_0:X:XXXXXX," put those results through again for the URL-able ID).

Enlarge/ A screengrab of my own profile page, with the kind of gameplay data that forms the core of our sampling.

Currently, the range of valid Steam Community IDs extends to include about 172 million pages (with some small gaps of invalid IDs occasionally popping up in the middle). Theoretically, if we could look at every one of those profile pages, we would have a comprehensive list of every game owned by every Steam user. Functionally, we here at Ars don’t have access to the kind of computing power needed to churn through hundreds of millions of webpages in a timely manner.

What we do have is a basic understanding of random sampling and an Amazon EC2 server instance that can scrape through more than a 100,000 pages a day. Using our knowledge of the Steam Community ID structure (and some light PHP/MySQL coding), we’ve been conducting what amounts to a rolling, randomized poll of the Steam user universe for about two months now. Using this method, we've generated a generalized estimate of Steam sales and gameplay numbers.

What we know and what we don’t

After throwing out pages that are marked as private or invalid for some reason, our crawler leaves us with a sample of about 80 to 90,000 valid Steam user pages every day. From that, we can generate an estimate of the percentage of all Steam users that have bought/played any particular game (and how many hours they’ve spent on that game). We then multiply that ratio out across the total size of the Steam Community ID universe (about 172 million but growing every day) to generate our sales and gameplay estimates.

Determining the denominator

This universe of about 172 million Steam Community pages in our valid sampling range is significantly larger than the 75 million active Steam users Valve cites as of January. There are a number of potential reasons for such a discrepancy: old user pages that are no longer counted as “active” by Valve, accounts that are created then abandoned without buying a game, or users that create multiple accounts for instance. Whatever the reason, we’ve found that multiplying out our results across the entire sample space of 171,340,250 valid Steam Community IDs (as of March 30) generates more accurate sales estimates, as described below.

Sampling what amounts to just 0.04 percent of Steam Community pages every day might not seem like an effective methodology, but the power of random sampling means that should be enough to generate a margin of error of only 0.33 percent from the actual numbers, statistically. (This is the same reason national political polls can be so accurate by sampling with just a few thousand likely voters.)

Still, to be as accurate as possible and to smooth out some noisiness in the day-to-day samples (especially for games that aren’t major sellers), we're using a three-day rolling sample to generate our final reported numbers. That means every day we “throw out” the data from three days prior and replace it with newer data from more recent crawling. Our rolling sample generally includes data from more than 250,000 valid Steam Community profiles at any time.

Granted, we don’t have a perfect random sample of all Steam users here, since some people have effectively “opted out” by setting their profile pages to "private" in their user settings. It’s possible that these “private” users are buying and playing games in a significantly different manner from the public at large, but we don’t consider this a huge risk. In any case, players that have actively set their profile to “private” account for only about six percent of all pages we try to randomly sample.

While we feel our methodology is sound, we wouldn’t be comfortable reporting these numbers unless we were able to spot check their accuracy and reliability against some real-world data. Fortunately, we’ve been able to compare our estimates to both public and private Steam sales data in a few cases now, and we've been pleased with how comparable the results have been.

On March 28 for instance, DayZ creator Dean Hall revealed that he sold 1.7 million copies of the standalone version of his game through Steam early access. Our sampling method estimated sales of about 1.76 million copies on that date. Similarly, Rust developer Garry Newman tweeted notice that the game’s sales on Steam passed one million as of February 10. Our sampling estimated the game sold just 930,000 copies by that date, but our numbers caught up to reality relatively quickly, reporting one million estimated sales for the game just two days later. This suggests our sample may be a bit behind reality for games that are selling particularly quickly, which makes sense logically. Our Rust sample on February 10 still included a lot of data collected on February 8, when the game had sold a little less than a million copies.

A few Steam developers have also been willing to share their own sales data with us off the record to help calibrate and confirm our sampling method to the point that we’re confident our numbers are at least in the right ballpark. We should stress again, however, that these are still just estimates based on a small sample of public Steam user pages. But in general, we’re confident that our reported numbers are within a few percentage points up or down from the actual sales and gameplay numbers. That accuracy seems to decrease a little bit for games that are selling quickly (as mentioned above) and games that haven’t sold much at all (and are therefore harder to find within our small sample). If you’re a Steam developer that would like to share your numbers to help us refine our data further, please get in touch.

It’s also important to note that looking at Steam data obviously doesn’t give a full picture of the entirety of the PC games market. Games people who buy and play via Battle.net, UPlay, Origin, Good Old Games, GamersGate, Desura, direct download, or even actual retail discs (remember those?) are not reflected in our numbers. That includes some of the most popular PC games out there, and this fact makes our report a necessarily incomplete view of the entire market for computer games.

For games that are sold both on and off of Steam, this kind of single-platform reporting can be downright misleading in fact. For instance, our sample shows only 465,000 Steam sales for The Sims 3, even though EA said the game sold 1.4 million copies in its first week on sale in 2009 (through all distribution methods). When using any of our numbers, it’s important to consider that a game may have a significant presence off Steam beyond what we're reporting.

All that being said, Steam does represent a significant chunk of the PC games market, and we’re excited to finally be able to share our estimates of what that portion of the market looks like. We also spent a bit of time analyzing what these figures say about the state of PC gaming.

Determining Steam’s most popular games

Games published by Valve itself dominate the list of most-owned titles on Steam.

So what are the most popular games on Steam? Well, that depends on how you define popularity to some extent (all data discussed in this section is up to date as of March 30, 2014, and it excludes DLC and non-game apps sold on Steam).

The easiest way to analyze the data is simply to see which games are registered to the most distinct Steam accounts (i.e. the number of “owners”), as shown in the chart above. But that number isn’t always indicative of how popular a game is among actual players. Let's look at those top-selling games again, but this time, let's set aside the owners that haven't played the game even once (Steam reports play time data in tenths of an hour, so it’s possible that players who put in less than six minutes with a game are showing up as “never played” in our data).

Games like Ricochet and Deathmatch Classic are registered to a lot of Steam accounts, but not played all that much since March 2009.

As you can see, just because a game is registered to a lot of Steam accounts doesn't mean it's popular. Half-Life 2: Lost Coast, for instance, is the third-most popular game on the service by ownership, registered to about 12.8 million Steam accounts by our count. But the tech demo, which shows off some deleted content from Half-Life 2, has only been actively loaded up by about 2.1 million of those owners, placing it behind 35 other Steam games by that metric. That may sound hard to believe, but remember that Lost Coast was offered for free to everyone who bought Half-Life 2 through Steam, putting it on millions of accounts that may not have been interested in playing it. Nvidia and ATI later offered the demo to anyone who bought one of their graphics cards as well.

[UPDATE: The "hours played" metric was first introduced to Steam in March of 2009, so games released before then might show up as being played much less than they have in reality. This would explain why many older Valve games show up as being played by so few owners, though it's interesting to see how little these games have been played in the post-2009 era in its own right.

In addition, gameplay hours might not be tracked accurately for users playing in "Offline mode," so the "total hours" played may be undercounted for some players. Ars regrets missing these caveats in the original piece.]

The “owned but unplayed” phenomenon is an extremely widespread issue across Steam according to our data. It shows up most frequently in games included in bundles, offered at heavily reduced prices, or simply given away to players who have yet to actually play them. It’s not limited to those games, though; in fact, out of the roughly 781 million games registered to Steam accounts, our data shows only 493 million, or 63 percent, have been played even once. [Update: This data is skewed somewhat by the fact that Steam did not start measuring gameplay hours until March of 2009. Among titles released since then, about 26 percent of registered games are "unplayed" by our estimates." See this update for more.]

For a version of this graph that does not include games released before Steam began tracking gameplay hours in 2009, check out this update

Before you protest that this number sounds incredibly high, consider how many Steam games you’ve acquired through some sort of indie game bundle or ridiculous seasonal sale. You probably heard some good buzz about these titles and told yourself you’d find the time to play them some day in the future. Then you went right back to playing Team Fortress 2 for dozens of hours every week instead, didn’t you? If this particular scenario doesn’t apply to you specifically, trust us that it does apply generally to a lot of Steam users (for corroboration, see this Kotaku survey that shows Steam users haven’t played 40 percent of games they bought in the last 12 months).

For a version of this graph that does not include games released before Steam began tracking gameplay hours in 2009, check out this update

It's important to note that the precise ratio of played-to-unplayed copies varies greatly for each distinct game sold on Steam, as shown in the chart above. Some games (mostly free-to-play titles) show practically 100 percent of all “owners” putting in at least a little bit of playtime. Others, primarily those that are often offered in bundles or as part of sales, are played by only a sliver of the registered owners on the service.

Dota 2: 25.9 million players and owners on Steam.

Team Fortress 2: 20.3 million players and owners on Steam.

Counter-Strike: Source: 8.9 million players, 12 million owners on Steam.

Left 4 Dead 2: 8.6 million players, 10.7 million owners on Steam.

Counter-Strike: 6.7 million players, 9.8 million owners on Steam.

The Elder Scrolls V: Skyrim: 5.7 million players, 6 million owners on Steam.

Kyle Orland
Kyle is the Senior Gaming Editor at Ars Technica, specializing in video game hardware and software. He has journalism and computer science degrees from University of Maryland. He is based in the Washington, DC area. Emailkyle.orland@arstechnica.com//Twitter@KyleOrl