Science gleans 60TB of behavior data from Everquest 2 logs

Thanks to a partnership with Sony, a team of academic researchers have …

Researchers ranging from psychologists to epidemiologists have wondered for some time whether online, multiplayer games might provide some ways to test concepts that are otherwise difficult to track in the real world. A Saturday morning session at the meeting of the American Association for the Advancement of Science described what might be the most likely way of finding out. With the cooperation of Sony, a collaborative group of academic researchers at a number of institutions have obtained the complete server logs from the company's Everquest 2 MMORPG.

As the researchers who are dealing with this new resource describe it, it's one of those "be careful what you wish for" situations—with nearly 60TB of data, the standard procedures for tackling social data sets just aren't up to the job.

Dmitri Williams introduced the project and described how researchers have been approaching various game developers over the years. He paraphrased the conversation with Sony as:

"What do you collect?""Well, everything—what do you want?""Can we have it all?""Sure."

The end result is a log that includes four years of data for over 400,000 players that took part in the game, which was followed up with demographic surveys of the users. All told, it makes for a massive data set with distinct challenges but plenty of opportunities.

Computer science challenges

Jaideep Srivastava is a computer scientist doing work on machine learning and data mining—in the past, he has studied shopping cart abandonment at Amazon.com, a virtual event without a real-world parallel. He spent a little time talking about the challenges of working with the Everquest II dataset, which on its own doesn't lend itself to processing by common algorithms. For some studies, he has imported the data into a specialized database, one with a large and complex structure. Regardless of format, many one-pass, exhaustive algorithms simply choke on a dataset this large, which is forcing his group to use some incremental analysis methods or to work with subsets of the data.

Srivastava then gave a short tour of the sorts of items the team is trying to extract from the raw logs. He apparently has graduate students working on non-traditional figures like the "monster composite difficulty index" and an "experience rate measure."

But many of the other measures that researchers in the social sciences want—trust, performance, expertise—are fairly subjective. To get estimates of them, the team is experimenting with trying to track physical proximity and direct interactions, such as when characters share experience from an in-game victory.

To give a concrete example of the data's utility, Srivastava described how he could explore the phenomenon of customer churn, something that's significant for any sort of subscription-based service, like cell phones or cable TV. With the full dataset, the team can now track how individual customers dropping out of the game influenced others who they typically played or interacted with. Using this data, the spreading rate and influence factor could then be calculated, providing hard measures to work with.

Getting social

Noshir Contractor described how the data was allowing him to explore social network dynamics within the game. He described a variety of factors that are thought to influence the growth and extent of social networks, such as collective action, social exchange, the search for similar people, physical proximity, friend-of-a-friend (FoaF) interactions, and so on. Because these are well-developed concepts, statistical tools exist that can extract their signature from the raw data by looking at interactions like instant messaging, partnerships, and trade.

Contractor described the results of running these tests on a week's worth of data from a server that saw over 3,000 North American players during that span.

In that week, his team could detect over 2,000 players that became involved in partner relationships and about 2,500 who took part in trade interactions. The IM network had fewer participants; in the question-and-answer session afterwards, Williams suggested that many players rely on VoIP for their interactions—"It's easier to say 'look out' than take your hands off the controls and type it," he said.

Nevertheless, signatures of popularity and FoaF relationships were apparent in the IM data. FoaF relationships were the most common in other interactions as well.

Mixing in the demographic information produced a few surprises. Gender turned out to be a negative influence on interactions: even after their low numbers were taken into account, female players avoided interacting with each other. Time zones had some influence; players in the same time zone were 1.25 times more likely to partner than players even one time zone apart.

But distance had a much larger effect; players within 10 kilometers of each other were five times more likely to interact. Contractor concluded that, for the typical player, the game simply offered a way of continuing their real-world social interactions in a virtual setting.

Links between the real and virtual world

In addition to introducing the EQ2 logs as a resource, Dmitri Williams described some of the efforts involved in exploring how much of the real world spilled over into the virtual.

The average age of players turned out to be 31. "These aren't just pasty white teenage boys in a basement—to be sure, they're there, but they're not typical," he said. The older players tended to play more than the kids and, although the total hours played seem large, he said that the time mostly displaced either TV watching or movie going. And the surveys showed that those who viewed TV news in the first place continued to do so, suggesting that gamers really slotted EQ2 into their entertainment time.

Mostly, the gamers seemed healthy; their body mass index was better than the US average and, although they were slightly more depressed than average, they were also less anxious.

Buried among those happy, average players was a small subset of the population—about five percent—who used the game for serious role playing and, according to Williams, "They are psychologically much worse off than the regular players." They belong to marginalized groups, like ethnic and religious minorities and non-heterosexuals, and tended to use the game as a coping mechanism.

Implications for gaming and science

Williams pointed out one case where having access to the server logs allowed the researchers to identify some serious skewing in the responses to the demographic surveys. Older women turned out to be some of the most committed players but significantly under-reported the amount of time they spent in the game by three hours per week (men under-reported as well, but only by one hour). The example highlights the risk of using self-reporting for behavioral studies and the potential of the virtual world data.

Saying, "I'm not tenured yet, and I don't want to tick off that many people at once," Williams wouldn't get into the significance of this finding, but Srivastava was happy to do so. ("I'm tenured and I'm not in the social sciences," he said, "so i can tick as many people off as I like.")

In his view, the data suggests that many studies that report marginal male-female differences in gaming based only on self-reported figures most likely did so based on unreliable numbers. It's entirely possible that a number of other sanity checks on past studies are lurking in this virtual data trove.

There was also talk about the potential for a symbiotic relationship between game designers and researchers. Srivastava's work on customer churn, for example, could prove highly valuable for developers that rely on retaining subscribers, and many of the studies that the speakers were interested in doing could provide valuable feedback on how users were actually interacting with various features of the game.

For the most part, the companies that the researchers have approached either haven't been interested in sharing their logs or the logs themselves don't contain the sort of data that would make for fruitful research. In several cases, Williams has been told that he should ignore entire classes of events in the logs, because they were purely put in for debugging purposes.

But he argues that this isn't just about researchers losing out. "There are a lot of things we can show them about their bottom line, but these industries are deadline focused," Williams said. "They're not far enough beyond the garage-shop mentality."

51 Reader Comments

Having researchers figure out how to make people stay reminds me a lot of researchers figuring out how to make people believe things... it doesn't make the game better or the speech true, but people with money can afford it, and once discovered, anyone, for good or ill, can use the results of the research.

Hmm, not to second guess the expertese of the team, but 60TB is not uncommon in commercial OLAP databases. There are tools for dealing with such things, including open source solutions. Granted, you need more hardware than the group may have easy access too (several hundred spindles would be best).

"Jaideep Srivastava is a computer scientist doing work on machine learning and data mining—in the past, he has studied shopping cart abandonment at Amazon.com, a virtual event without a real-world parallel."

I hope this was being sarcastic, because this happens all the time in real life B & M stores! While I'm sufficiently conscientious to put perishables back, not everyone is, and for me, the typical reason is if I find myself in a time crunch and discover when I get to the registers that there's no way I'm getting through quickly, due to how busy they are, or... how understaffed they are, or a combination thereof. Another reason that may happen in the real world after that, is that of discovering you don't have enough cash on hand.

And only 60 TB of data? Bah, while that's perhaps a large amount for MS Access and comparable databases, if you get a good MPP database with a large enough cluster, it'll chomp through all that in hours or minutes: they just need the right database and hardware combination for that, which it sounds like they don't have

Hmm, not to second guess the expertese of the team, but 60TB is not uncommon in commercial OLAP databases. There are tools for dealing with such things, including open source solutions.

I believe they are referring specifically to the software and techniques used to analyze this kind of data, not just store or manipulate it. Some of the more advanced modeling techniques that scientists use to study this type of data - like actor-oriented random graph models, or certain kinds of clustering algorithms, for example - can be very computationally expensive to run even on relatively small data sets (say, with a few hundred to a few thousand individuals).

Trying to identify algorithms and techniques for handling larger data sets is still an active area of research in many areas; for example, there are still people in econometrics trying to find ways to speed up computations in simple spatial regression models, which are far less computationally intensive than the kind of techniques people (like Nosh Contractor, one of the people in the article) are using in their research.

Am I the only one who is horrified at the privacy implications of this? Sony just handed over every conversation you ever had in the game, with seemingly little or no protections on how that is used or who it's given to? Did I miss something about that? I'm not in the habit of having super personal interactions in my MMORPGs, but many people do, and if this is truly the entire server log, isn't every conversation you had in the game now available for viewing, and at the very least identified with your particular account and/or character name? That's just not right.

Originally posted by pikahatonjon:wouldnt backing up 40tb be a pain in the ass?

No. I back up 200TB at my site, including weekly fulls, without any issues. We use a StorageTek SL8500 with six picker robots and thirty-five LTO-3 tape drives, connected via fibre channel to a bunch of Cisco MDS 9509s, and from there to my NAS and to my NetBackup master & media servers. The non-NAS hosts are backed up by Symantec NetBackup to the SL8500, and the NAS talks NDMP directly to the library (with robot control handled by the NetBackup master). All backup traffic rides on its own VSAN.

No. I back up 200TB at my site, including weekly fulls, without any issues. We use a StorageTek SL8500 with six picker robots and thirty-five LTO-3 tape drives, connected via fibre channel to a bunch of Cisco MDS 9509s, and from there to my NAS and to my NetBackup master & media servers. The non-NAS hosts are backed up by Symantec NetBackup to the SL8500, and the NAS talks NDMP directly to the library (with robot control handled by the NetBackup master). All backup traffic rides on its own VSAN.

So this doesn't violate anybodies privacy ? Hmmphf. Sony could easily be taken to court for doing so . Then using this article as evidence that it did so . And having a court demand the same x tera under a warrant to do prove so. Privacy is a two edge sword. Just because you are a business,or operate a server farm,or gouge Users with Writing User Agreements for a living . Keeping the business on the door stop shouldn't be that easy for privacy lewpers. Of course the whole forray of the issue will be seen in several different seemingly non related articles. So that when the point is taken,the only advantage is the one that one holds. Then being ignored as "I'm too big to care about privacy". However it is though,it is probably that since it is an ethical issue,where business and such doesn't have to have morality about it...its a fatal mistake. Business is only ignorant to a certain span. It cannot be claimed as a patent to being so. So here this article,you have example a:. You might say exhibit a. The choice of a practicioner to choose the preposition. It is among most of the several others. In that well the reader is stupid. Point of the toss in the schill of conversation amongst some uninvited consequencial indifference. Why take part if it only ''serves the farm''. "Oh,I really didn't know,I really didn't know". Perhaps a waiter spills your drink on your chest in business class. Then it dawns on you it is the same that was ordered. These agencies are at sometime going to look at each other,and see that they are going straight to hell. Being that the hole they are falling is the same dimension as the shadow being casted from them.

Originally posted by hirez:Am I the only one who is horrified at the privacy implications of this? Sony just handed over every conversation you ever had in the game, with seemingly little or no protections on how that is used or who it's given to? Did I miss something about that?

That was the first thing I thought about after reading the headline, too. I had some friends in grad school and they were always very cautious regarding experimental subject consent while conducting their research. I'm guessing that players give consent by accepting the game's EULA but I wonder if any of them imagined that this sort of log would be handed over to an external party.

I'm pretty sure I'm not going to be playing any Sony-operated MMOs in the future.

I've said it before and I'll say it again: privacy is an illusion on the internet. You want privacy? live off the grid. It's the only way you'll ever get it.

I think this sounds rather interesting, and I hope they dig up all kinds of cool behavioral data. As for being able to analyze it, it sounds exactly like the kind of project distributed computing is designed to resolve. They need to get in touch with the BOINC folks.

Then HappySin the Internet is an illusion,. Still I know that it is not. That is if you dont care about what your privacy is. Certainly you will not be in charge of that concerning someone elses. Just because that is what "you" think about your own liberal privacy. It does not change any fact of some others being violated with your indifference. Here nobody is speaking of Internet. The data is not their own,because of the same indifference you do not discern. Howebeit your cast of characters is going to be separate,I'll bett you run for that control everytime you CAN control it. Control of your own privacy that is. There is a couple of homonymns for "dummy". Privacy is not one of them.Although your interpretation of 'privacy'may be crash-proof,the reasons for this indifference certainly wont escape its own subtility to fail.

Originally posted by Commander Thanatos:Has anyone read the EQ EULA to see if you didn't forfeit your rights to privacy?

I think this part seems to cover it.

12. We cannot ensure that your private communications and other personally identifiable information will not be disclosed to third parties. For example, we may be forced to disclose information to the government or third parties under certain circumstances, or third parties may unlawfully intercept or access transmissions or private communications. Additionally, we can (and you authorize us to) disclose any information about you to private entities, law enforcement or other government officials as we, in our sole discretion, believe necessary or appropriate to investigate or resolve possible problems or inquiries. You agree that we may communicate with you via telephone, email and any similar technology for any purpose relating to the Game, the Software and any services or software which may in the future be provided by us or on our behalf. You expressly permit SOE to upload CPU, operating system, video card, sound card and memory information from your computer to analyze and optimize your Game experience, improve and maintain the Game and/or provide you with customer service. Furthermore, if you request any technical support, you consent to our remote accessing and review of the computer you load the Software onto for purposes of support and debugging. You may choose to visit www.everquest2.com,www.station.sony.com, or other SOE web sites if such web sites offer services such as an EverQuest II game-themed chat room or other services of interest to you. You are subject to the terms and conditions, privacy customs and policies of SOE while on such web sites and in connection with use of your Account and the Game, which terms and conditions, policies and customs are incorporated herein by this reference. Since we do not control other web sites and/or privacy policies of third parties, different rules may apply to their use or disclosure of the personal information you disclose to others. Solely for the purpose of patching and updating the Game and/or Software and ensuring the integrity of the Game, you hereby grant us permission to (i) upload Game-related file information and data from the Game directory and (ii) download Game files to you. You acknowledge that any and all character data is stored and is resident on our servers, and any and all communications that you make within the Game (including, but not limited to, messages solely directed at another player or group of players) traverse through our servers, may or may not be monitored by us or our agents, you have no expectation of privacy in any such communications and expressly consent to such monitoring of communications you send and receive. You acknowledge and agree that we may transfer Game and your Account information (including your personally identifiable information and personal data) to the United States or other countries or may share such information with our licensees and agents in connection with the Game.

I think the comments on having difficulty mining the data is not due to lack of tools. It is more due to lack of definition. Meaning, they have all this data but it's not in any form they can use yet. Also it's not structured in any easily queried way. The sheer volume would be overwhelming for a researcher. Think about it for a minute, while it would easy to do a simple query to find out playtime, how would you do a query to find out how much time was spent questing, or just chatting? How much tie was spent doing trade skills as opposed to raiding?

I think this is pretty bad work from the author of this article. Simply ignoring the privacy aspect of this issue like it wouldn't exist makes me feel like he copied the text from some happy researchers press release.

I personally feel very upset that Sony just feels it is free to hand out such data. Those who never played this game might not be able to understand how personal such data can be - after all a MMORPG is a place where you meet friends and chat with them about the world and other things. It just like Microsoft announced they are giving out the complete chatlogs for their Instant Messenger.Even if this data would only contain movement profiles, it would still be offensive as long as it is not completely anomized.

I'm pretty sure i won't ever play a Sony MMORPG again untill this issue is solved and if Sony Online Entertainment would be based in my home country i would demand a complete disclosure of which data they collected and to whom they give/gave it, which is enforceable by the privacy protection laws here.

Guess that "no expectation of privacy" is about as legal glib as can be ascertained. It is a coined phrase denoting an early Supreme Court ruling considering that of email for some relationship to be that of having it so. The problem is that 'privacy',and 'security'are two different things. Sender and receiver in their definition are what makes up a big difference to societies description to use civil law,over constitutional. Perhaps the use of it in this End User agreement,is a reflection of the fact of figuring in just what it means to be 'private',or 'have privacy',when you must engage interactively to other influences prepositionaly. "no expectation of privacy" just meets the criteria of filling. If however you have 'reasonal expectation of security',in its place.Then the whole mention becomes something completely different. Of course it would be against the law,to lessen a persons security to exploit that persons advantage of that security. DMCA is a good anology to this.The DMCA has the right of an author to create a physical lock to a given 'digital work'. Howebeit it would be the authors right to 'unlock it',it would be be somewhat moot if this 'digital lock'did not exist. Certainly seeing this,that replication is shades away.Look at IP,the coined phrase 'Intelectual Property'. Supposing that ones ''privacy'',or ''reasonable expection of security',was that anologous to Copyright.And IP. In that 'the Account information,and Game Interaction,is that secured by a law stating that the 'reasonable expection of security'would not be abrigded. Why would I for this then believe that in this inhibition replication of personal data was inevitable.A breach of the 'reasonable expection of security'. ? Then the author is somewhat satirically composed of the contrast. ? Well I don't think that everybody is going to dye their hair green,or orange. Simply because it creates the effect of lessening the 'reasonable expectation of security'. Or puts more grandeur,the 'reasonable expactation of security'. The inhibition to be summative,and additive is there. Its a morality issue,and an ethical issue. So then,not tenative yet ? Suppose that the same information was covered under Sarbanes Oxley ? What group is gaining behavoral strength with this ascertation ? Its not in this article to find the greatest,or least amount of accesory to ones inhibition. I just happen to believe they are individual(inhibitions). They dont belong to someone,or somebody outside the scope of those who can choose to deny its question still inadequete for technical reason. The User Agreement basically sais that the ideal in this,is that everybody has dyed their hair green,or orange. To participate. However the majority in this field believe that a 'reasonable expectation of security exists'. What inhibition does that leave ? And whos doing (or not doing)the work ?

One would like to believe that they at least obscured the account names and character names. As for the chat logs, it depends on what exactly they gave out. Was it just the events, i.e. Player A /tell Player B or the full text of the message?

With regards to the privacy concerns, I would check out the EULA for your MMOs. Most of them contain some type of provision regarding using collected data and transferring that data to trusted 3rd parties.

I would like to think that what was transferred was character names and interactions between characters, and not any information regarding accounts themselves. Although to figure out churn rate, having account and not character data would be more useful.

But, other than data being released, what practical privacy concern is here? There is a link between a character name and an account name? Is simply having your account name released enough of a privacy concern to file a lawsuit? Certainly if real-world names, addresses, phone numbers, and billing info were released there would be a privacy issue. But if all the information that was released was related to virtual world actions of a character, or semi-public user names, what has really been violated? Someone finds out what you are doing inside a game? A character in-game could simply follow you around and record that information as well. What are the unique privacy concerns regarding this data? Or is it simply the fact that it was released without a person's explicit permission (for which I would again check the EULA - buyer beware)?

Williams pointed out one case where having access to the server logs allowed the researchers to identify some serious skewing in the responses to the demographic surveys. Older women turned out to be some of the most committed players but significantly under-reported the amount of time they spent in the game by three hours per week (men under-reported as well, but only by one hour).

We often play EQ2 and we load our characters into the game but we go off to make breakfast or whatever. Unless we go away for several hours, we don't log out when we go afk and we don't always set the afk flag (I almost never do). It just isn't important to log out of the game when you go away for a walk with the dogs or whatever.

They also make statements about physical location being a factor, but a lot of families play together. So, physical distance might be feet not miles. Its a shared entertainment within a family but that has nothing at all to do with how general players interact.

It's good to know that Sony hands all this data out though. Now I can really fuck with their heads. "Sony data reports massive surge in hermaphroditic game play, commits $11 billion to adapt all games to that market". I am going to start sending my alts random credit card information and login names and passwords. Populate the world with alts that stand around saying "Tard!" every minute or so (hmmm, that would make EQ2 be more like Galaxies though).

I will also have to make an extra effort to misrepresent myself in all questionnaire. Between leaving characters logged in 24/7 and saying I only play 30 minutes a week, and gushing Portuguese poetry into the data maybe EQ2 will be more fun.

We often play EQ2 and we load our characters into the game but we go off to make breakfast or whatever. Unless we go away for several hours, we don't log out when we go afk and we don't always set the afk flag (I almost never do). It just isn't important to log out of the game when you go away for a walk with the dogs or whatever.

If you have access to the full logs it cant be too difficult to algorithmically identify the difference between an idle character and active game play.

quote:

They also make statements about physical location being a factor, but a lot of families play together. So, physical distance might be feet not miles. Its a shared entertainment within a family but that has nothing at all to do with how general players interact.

Again this is something that can be picked apart by looking at the data. I'm not saying that their finding is not obvious, but it is significant.

I don't really see why everyone is having such a fit about the privacy issue...like someone previously posted you should have no expectation of privacy on the internet (and certainly not in a proprietary virtual environment). I'm very excited that lots of different researchers got access to this, I think a lot could be gleamed from this data.

It seems like to me the data is all flawed as to what they expect to unravel in the first place. Generally speaking I have played quite a variety of online games and rpg's are my favorite. One thing I can say for certain is that MOST poeple playing these games are NOT acting like they would in any stretch of the real-world comparisons. It may very well be that they want this research to show how people act DIFFERENTLY online behind a supposedly anonymous character versus real life interactions but I didn't get that vibe from the article. I like many people am worried about privacy to an extent but it's not like I have to worry about corruption of minors for my ingame behavior. I feel very strongly that I act the same way I do in real life in any online interaction but it is certainly my strong belief that MANY people do act ALOT differently than they would versus face to face interactions. As for the privacy issue I would think that is out the window with anything online anymore seeing as most sites mine data and sell info for profit. For every honest protective site there must certainly be 10 shady sites that would sell your soul for a few bucks.

Privacy concerns crossed my mind for about 1 second. Then I woke up and remembered that I already know that everything I do and type in an OLG is under the complete control of the game service provider.

And please don't tell us you're going to stop playing Sony games. I'll bet you that every single OLG has the same policy. They can do whatever they want with your data.

But, other than data being released, what practical privacy concern is here? There is a link between a character name and an account name? Is simply having your account name released enough of a privacy concern to file a lawsuit? Certainly if real-world names, addresses, phone numbers, and billing info were released there would be a privacy issue.

The problem is that all of that information and much more could, and probably is, contained in the chat logs. MMORPGS are basically huge, multichannel chat rooms with graphics. People would be sending other people real world addresses, possibly even credit card information (imagine a brother asks a sister for her card # for whatever reason), proclaiming love, engaging in cyber sex (I hear!), confessing sins and even crimes, disclosing medical conditions, and all sorts of other things. You wouldn't believe some of the stuff I have heard on MMORPG chats. How could Sony possibly filter all this out if they truly gave up "everything"? Answer is, they can't.

Is some guy at the research lab going through there looking for this stuff? Probably not, but they might be, or will probably stumble across it in any case, and who knows who else will get ahold of this data? If they forked it over to academia, you can be sure it's hardly secure, I imagine copies of parts of this will be all of the net in a few week's time. So yes, it is kind of scary. I don't give a damn if they gave themselves the right to do it in the EULA, it's just not right and I think there will be a lot of backlash as users find out about it. Furthermore, it's been legally difficult for companies to fall back on the "It's covered in the EULA" defense because judges have ruled a number of times that they're so long and written in such torturously legalese language, that it's unreasonable to expect any normal person (user) to truly read and comprehend the whole thing. Yes, I do think you should be wary about what you disclose anywhere over the internet, but I don't think it's unreasonable to be pissed off that Sony just handed over whatever conversation you had over several years to...someone...or someones.

Of course Sony could have easily avoided the issue but announcing in advance that they were doing it, give their pitch to the users about how researching the data will make for a better game experience, and then allowing each player to opt in or out. Players who opt out have their data filtered out of the results. I'm sure 20 TB to 40 TB would do the job just about as well. And if you really want permission badly, just offer them some in-game gold or special mount or pet or whatever, it costs them nothing.

Besides the game EULA, there's Sony's Playstation 3 EULA and the Playstation network EULA; for every system update, you have to agree to another EULA. What they don't include in a game EULA can be easily added to the platform or network EULA.

Basically, you're stuck once you've paid for the console (and any games) if you don't agree with the update EULA.

Not many people would stop using a game or console, simply because they disagreed with some of the clauses in a EULA.