The ramblings of Wesley Kerr

Latest Posts

Steam Recommender

This post goes into some of the detail behind the
steamrecommender.com site that generates
game recommendations based on your Steam play history. I have received
several questions about how I am able to generate the recommendations, so I
hope to give a fair account of the methods that I use.

If you have spent any time working on recommender systems, then you have
heard of the Netflix prize. In 2009, Netflix awarded the Belkor’ Pragmatic
Chaos team with one-million dollars for a 10\% improvement in movie and TV
recommendations based on user ratings provided by Netflix. One outcome of
the Netflix prize is the raised awareness for recommender systems and
standardized ways of talking about recommendations.

Training data for the Netflix prize consisted of a collection of triples:
(user, item, rating) where the user referred to a unique user id
(anonymized of course) and the item is the movie or TV show being rated,
and rating being a quality score from one to five with five being best.
There are many open-source tools that work with this format or similar
formats, e.g. GraphChi (with
work by Danny
Bickson),
Prediction IO,
MLlib, and others.

In the end, I decided not to go with one of these existing services. This
project was for my personal enjoyment and so that I could learn more about
building web services, so I decided to build my own recommender system that
would plug in nicely with my service. In the next few sections I describe
different aspects of the system as a whole.

Data Ingestion

The recommendations generated from the steamrecommender.com site are
primarily based on an Amazon system that has been going strong for almost 20
years. You would recognize it as the panel beneath items in the Amazon store
with the title “Customers Who Bought This Item Also Bought.” The idea is
that this panel would represent what your friends would recommend you
purchase, if your friends were the entire user base of Amazon.

In order to generate this type of personalized recommendation for video
games, I searched for datasets where I could gather all of the games owned
by a person and how well they liked each game. The Steam store by Valve came
the closest, since they provide a web API for developers to build
integrations. In a piece by Kyle Orland published on Ars
Technica,
he describes a technique for querying this API for random Steam users to
build up a dataset. With this technique we are able to get all of the games
owned by a Steam user along with the number of minutes spent playing the
game in the last two weeks as well as total time spent playing the game.
Using similar techniques I am able to gather a sample of roughly 50k players
every day.

Model Building

The data ingestion system runs uninterrupted daily and gathers more and more
of these training samples. After a month of gathering publicly accessible
Steam accounts I have roughly 1.5 million profiles to construct
recommendations from. It is important to note that Steam users still have
control over their privacy. At any time they can set their profile to
private and I would not have any access to the games they own nor would I
have any access to the amount of time they have spent playing those games.

The recommendation engine powering
steamrecommender.com is based on the
original paper: Amazon.com Recommendations: Item-to-Item Collaborative
Filtering.
The idea is to construct a matrix wherein the rows are the Steam users and
each column is a different steam game. The challenge is deciding what the
rating should be for a user and a game. We briefly explore the different
options and the intuition for what each entails below.

ownership: In the ownership rating system, each cell in the matrix
has a one in it if the user owns the game and a zero if the user does not.
Under this assumption, popular games are often recommended more frequently
because they happen to be owned by a larger number of people. Relying on
ownership for a rating is limited. Just looking through my own Steam
library, I realized that I had a rather large backlog of games that I owned
by had not yet played. If you were to personally ask me about those games, I
would not be able to recommend them because I had not played them. This
insight led to my second option for a rating.

played: In the played rating system, each cell in the matrix has a
one in it if the user has played the game and a zero otherwise. The user
must have spent at least 30 minutes in the game before we consider the game
played. If you quit a game prior to spending 30 minutes in the game, you
either hate the game or are saving it to come back later. In either case, I
see that as the user not wanting to recommend that game. Recommendations
from the played system are qualitatively better than those from the
ownership system since we are better able to filter out games that impulse
purchases from Steam sales and have a better signal to remove disliked
games.

inferred: For the inferred ratings, I go one step further and try to
compute a 5-star rating for each game from your playtime. My 5-star rating
system will assume you would give a 5-star rating to a game if you played it
to completion or longer. I make some assumptions (that may or may not be
true) in order to create inferred ratings. First, I assume that majority of
players will not finish a game. They may become bored, or may dislike the
game, but either way they will not play through to completion. Second, I
assume that there are some die-hard players who will continue to replay a
game over and over simply because they love the game. Based on these two
assumptions, I decided that the 75th percentile of the total time spent
playing distribution (across all steam users who played the game) would
represent the expected playtime of a game that you loved playing and would
happily recommend.

For every player, the inferred rating is computed by dividing the number of
minutes spent playing teh game with the 75th percentile for that game and
then multiplying by five (for five stars), keeping the maximum star rating
at 5. For example, if you played a game for 120 minutes and the 75th
percentile is 280 minutes, then the 5-star rating for this game would be
(120 / max(280, 120) * 5) = 2.

The inferred rating system is the one that is currently deployed on the
website as it generated the best qualitative results. Once you have a method
for computing the ratings, building a model for generating recommendations
is straightforward. I will try and provide the intuition behind the
collaborative filter that I build and leave the details to the original
paper.
For a more gentle introduction to recommendation techniques, I recommend
Programming Collective
Intelligence.

To generate recommendations for a single user, we first gather the games
that he/she has played by making a call to the Steam web API. The playtimes
associated with each game are converted into inferred ratings and every game
that has not been played is dropped. At this point we should have a
collection of games that our Steam user has played enough to generate a
rating. What we would like is a principled way to find games similar to
these that other users have also found favorable.

That is where the power of the collective comes in handy. We iterate through
all of the players and look at the pairs of games that they have played. If
a player has two games that they have played and given high ratings to, then
odds are another player who has played one of those games will enjoy the
other one. If you do that over 1.5 million players, this signal becomes
pretty clear. We convert this intuition into something more formal by
defining a distance metric between two games as the cosine similarity
between the rating vectors. An example will hopefully help clear up any
confusion.

In this example, we can see that Jim either does not own or has not played
Left 4 Dead 2 (L4D2) and the same goes for Sally with Skyrim and
Civilization 5. Both Gina and Bob have played all of the games, but it seems
that Bob cannot find anything that he really likes and tends to play games
for short periods of time. Our recommendation system relies on recommending
similar games to the ones that you have played before, so we will compute
the cosine similarity
between all pairs of games. some notion of similarity. For example, we want
to compute the similarity between L4D2 and Bioshock. The rating vectors for
these two games are: [5, 3, 1] and [5, 3, 1]. Jim has not played both
games, so we do not include his rating in this computation. This example is
not particularly interesting since the vectors are identical, the similarity
between them is one. Of course with more examples, this would not be the
case.

The result is that we have a matrix of similarities between games. A single
column represents a single game and the similarity between it and all of the
other games. If someone had only played Sid Meier’s Civilization V and
he/she was looking for new recommendations, then we could order this column
by the most similar games to Civilization 5 and generate decent
recommendations (Note: these recommendations actually came from http://steamrecommender.com on 03/26/2015).

If someone has played more than one game, we weight each column by the
rating that the user gave to that particular game, sum across of the games
the user has played and sort to find the most similar games to this
particular user.

Conclusions

This post has taken longer than I had originally planned an glosses over some
details while getting into the gory details for others. Hopefully this post
mixed with some external reading will make everything clear.

If you have built recommendations before, or happen to be a researcher in
the field, you may wonder why I am using technology from the early 2000s
instead of using methods like matrix factorization that helped win the
Netflix prize. The primary reason is that I needed to be able to generate
recommendations for users who were not in my sample set. I would have no
guarantees that I have a factorization for a new user as they may have not
been part of my initial training set. Of course, I could always add them to
the set and redo the factorization, but that takes time and would result in
a poor user experience.

The original idea for http://steamrecommender.com was to build a website
that showcased recommendations for a product area that I thought was
woefully neglected. At the time, Valve had not launched their updated
recommendation system and I often wondered what I should play next. At the
time, Steam sales were how I would decide what to play next because I would
often purchase whatever was on sale. Given my background in recommender
systems I thought that there had to be a better way. Now the site has become
a fun toy for me to think about different recommender challenges and
sometimes discover new games to play.

Now that this post is finished, I will begin work on the matrix
factorization blog post where I discuss computing game embeddings and using
them to improve discoverability, search, and recommendations.

1000 Words

So, I’m trying to get better at writing and I have heard that the best way
to do that is by actually writing. So, the goal that I am going to work
towards is writing at least 1000 words per night during the week on some topic.
Tonight I am going to ramble a bit about Steam and the Web API that they
provide.

Steam Web API

For those of you not aware, I built and maintain
steamrecommender.com. I originally thought I had
this brilliant idea because the recommendations within Steam were not that
good and there were no other recommendation engines that were making use of
the data that would come out of the Steam platform. So, I started working on
an ingestion system that would crawl all of the Steam users (eventually).
Valve conveniently developed a web
api that I could
make up to 100,000 requests per day.

My first idea was to generate a seed set of Steam users (which primarily
consisted of my friends) and then make the right API calls to get their
public details. In this case the public details consisted of all of the
games that they owned with the number of minutes they have spent playing
each and a list of all of their friends. The first call would be to get all
of the games for a known Steam ID.

/IPlayerService/GetOwnedGames/v0001/

The second call would help me to grow the graph of Steam users that I could
crawl.

/ISteamUser/GetFriendList/v0001/

The friend graph within Steam consisted of far more isolated subgraphs than
I had expected when I first started looking for users. I had to quickly
devise new plans to find additional users so that I could continue growing
the number of players in my samples. I soon discovered the Steam communities
where people could express interest in certain games and become part of
their groups. At the time you could visit a page that contained a listing of
every member of a group, so I would generate a seed set of popular games and
wrote some code to scrape the Steam ids from these pages. Even with these
additions growth was still slow. Note: Valve has since removed the
listing of the groups.

Another reason that I faced slow growth in the acquisition of new Steam IDs
was that I wanted to ensure that I was always working with fresh data.
Things change over time, users acquire new friends and new games. Therefore,
I needed to build in part of the system that would actually refresh a user
every so often in order to have the latest snapshot. With the 100,000
requests per day limit some fraction of my requests were going to refreshing
Steam IDs whose data needed refreshed and another fraction of these requests
went to the backlog for Steam IDs that had not been loaded yet. That left
very few queries for gathering new Steam IDs through the friends API.

Despite these hurdles I was able to gather enough player information to
build a item-item collaborative filter that generated reasonable
recommendations. This allowed me to launch the website and begin playing
with improving the recommendations.

Well hell. I only made it to 550 words on my first try. Hopefully with
more dedication and effort, then tomorrow I can have a better turnout.

Forest Generation

Before diving into how we are going to generate random forests in Slappy,
I I wanted to give a brief reminder of the type of game that we are
building. Our player is stranded in an abandoned world that has been been
left to nature to rebuild. So, there are large swaths of open land and some
old-growth forests that cover parts of the landscape. We already established
that each level will be rectangular, and now the challenge will be to fill
this area with forests without overwhelming the player.

So, why have forests in the first place? Forests provide natural cover for
our enemies and so the player must choose to enter the forest and be
prepared to deal with what is within. In addition to enemies hiding in the
forest the player could find valuable items and useful power ups.

Background

The forest algorithm is based on some research done during my time at the
University of Wyoming. Our research lab had a grant to build a multi-agent
simulator and demonstrate how decentralized control of flying swarms would
allow us to provide surveillance for a forested area. In the summer of 2004,
I began development on the simulator and needed to devise a way to generate
random forests that felt natural and covered some percentage of a
rectangular area (sound familiar?). For those interested, the original paper
can be found
here.
The rest of this post will focus just on the forest generation algorithm.

Algorithm

We decided to build an algorithm that behaved very similar to the cellular
automata described in a previous post,
but rather than it being on or off and switching immediately, we introduced
the idea of seeding.

We start with a small set of trees that grow and seed the neighboring area.
Each iteration some seeds become trees and will grow to begin expanding the
seeded area. In the previous post on procedural cave generation, there were
relatively few parameters that could be tuned in order to generate different
caves, whereas this algorithm has more parameters that can be varied to
create different types forests.

nf - The initial number of trees in the forest.

Rs - The radius that seeds can travel.

Ss - The amount of seeds that fall during seeding.

ds - The decay rate of the seeds that fall.

Fc - The percentage of ground to be covered with forest.

Each iteration of the forest generation algorithm we go through a seeding
phase and then some portion of those seeds become trees for the next
generation. If seeds do not become trees, then they will decay according to
the defined decay rate.

Initial Implementation

The algorihtm begins by initialzing the map with a small set of trees. One
caveat is that rather than being a single pixel, each tree is a collection
of pixels shown below.

Initialization of forest generation algorithm.

From the initial set of trees, we will “grow” a forest until we have covered
some predefined percentage of the area. Each round of growth is divided into
three steps. In the first step existing seeds decay by ds. In the
next section, we will discuss what the decay is for in more depth, but the idea
is that we don’t want seeds to pile up indefinitely in each grid cell because
in the step we select new trees from the existing seeds. Those grid
cells with more seeds in them have a higher likelihood of becoming a tree
during this phase.

The final step in the forest growth algorithm is to have all of the trees
(including the new ones) seed the area around them for the next round of
growth. The following image represents the effects of the seeding algorithm.
Seeds are spread up to Rs away from the original tree. Think of
this as the canopy of a mighty oak tree and how far away seeds can drop from
the base of the tree. When two trees are close to each other and their
canopies overlap, then there will be twice as many seeds that fall within
that area increasing the likelihood that a tree will occur in the next round
of growth.

Seeding of new trees.

The following screenshots show the forest creation algorithm in action.

Small forest after just a few steps.

Same forest after a few more steps.

At this point we have reached a reasonable looking forest. Since the forest
generation algorithm is supposed to work in conjunction with the cave
generation algorithm we outlined earlier, we should see what it looks like
overlayed on top of a randomly generated level.

Random forest overlayed with the level.

As you can see there is definitely overlap with the forest and the empty
areas within the level, which is what we desired. The forest is also not
limited to empty areas in the original area and can exist on top of filled
in areas. I deliberately chose this route and will need to consider how the
art blends these two regions together.

One last thing to note is that the forest grid seems to be much finer than
the grid we used to generate the level. This is because each grid cell in
the level is supposed to be about 10m by 10m. We can definitely fit more
than one tree in a grid of this size and that’s reflected in the granularity
of the forest grid.

Intuition

I want to try and capture what this algorithm is doing intuitively so that
you you will not have to blindly vary the parameters, but you can set them
and the forests that it generates are what you would expect.

nf - The initial number of trees in the forest often
determines how many clumps of trees that you will have. If you have spent
any time in the forest, you will notice that often there are stands of trees
that are isolated from each other but densely populated. If we have a 10x10
grid representing our forest and we want 30% of the ground covered in
forest, we could set the initial number of trees to 30 and have a uniform
distribution of trees.

Rs - Setting the radius that seeds can travel to small number
means that the forest can only grow roughly one pixel at a time. So, you
will end up with very tight clusters of trees in your forest, whereas if you
set this number higher, you forest could still be very densly populated but
you would have small areas of light shining through.

Random forest with _Rs_=4.

Random forest with _Rs_=20.

Ss - In each iteration, seed is spread across the seed radius
with the strength determined by this parameter. The higher the value, the
higher the probability that a cell will sprout a tree. If this number is
high enough, each iteration will result in a new tree in ever cell that is
seeded. The result is a perfectly rounded forest.

Random forest with _Ss_=1.

ds - The decay rate prevents stacking seeds on top of each
other forever until one becomes a tree. The idea is that if a tree does not
sprout the first year, then that seed has a smaller chance of sprouting the
next year and so on. A very high decay rate should result in a forest with
many more patches as demonstrated below.

Random forest with _ds_=0.99.

The parameters should be relatively straightforward. Try it out for
yourselves, modifying the parameters in the code and running it.

Running Demo

I included a demo of the system running
(thanks to LibGDX HTML5 output).

Thanks again for taking the time to read this far, and feel free to send me
any feedback.

Level Generation

The key to level generation for Slappy is to assume that they player is
not stranded in a wasteland ala the Fallout series, but the player is
stranded in an abandoned part of the world. Around him are forests, open
space, and enemies trying to kill him. In my head I imagine it being very
similar to the movie/book I Am Legend but with our hero choosing to
survive in the woods rather than a major metropolitan area.

Each level will be rectangular in nature (easier to model with a computer)
and the focus of this blog post is how I have decided to fill in the
rectangular area with empty and filled space. The goal of level
generation is to create an environment that feels natural and spacious.

Initial Research

Searching for “procedurally generated
levels” on Google
yields plenty of resources and tutorials. A large portion of these resources
are similar to this one: Create a Procedurally Generated Dungeon Cave
System.
These tend to generate maps that feel like they were designed on graph
paper. Whenever I see these I think back to my time playing Dungeons and
Dragons with pen and paper. Lately, a lot of these levels have been
associated with the rogue-like sub-genre within RPG/Action games.

The tutorials that caught my eye were ones discussing cave generation and
were not focused on rougue-like. I would see images like this one:

Looking at that image I can see the top down view of a level that provides
freedom to explore the level as the player sees fit, and more importantly it
is a level that I would easily believe could occur naturally.

Many of the “procedural cave generation” articles that I read all referenced
Cellular Automata and one algorithm in particular Cellular Automata
Method for Generating Random Cave-Like Levels. I am
not sure who is the first to discover this particular cellular automata, nor
who was the first to apply it to procedural cave generation, but I will try
to call out each of the interesting articles that helped me refine my level
generator.

Initial Implementation

I have written a simple libGDX application for exploring different
procedural generation algorithms that can be found on my
GitHub account. Each blog
post will have a tag so that you can download the specific version of the
code that I’m referring to, since I plan to continually update the code.

The cave generation algorithm is based on a cellular automata
(Wikipedia). The level is
represented as a two-dimensional matrix (boolean[][]) and each cell can be
in one of two states (EMPTY or FILLED). The matrix is initialized
randomly with roughly 40% of the matrix starting out filled. The borders of
the matrix are also initialized as filled just to ensure that the player is
unable to walk off of the map. A sample initialized matrix is visualized
below.

Random initialization of level.

Now that our matrix is initialized we can begin evolving the cave structure
by repeatedly applying some rules for deciding whether or not a cell stays
filled or becomes empty. The rules are applied to each cell in the
matrix independently. In the cellular automata literature, these rules can
also be referred to as a birth rule and a survival rule. There are two
inputs to the rules that we define, each one is a count of the number of
neighbors within a specified range that are filled. A visual demonstration
of the neighborhoods is shown below.

Cellular Automata neighborhoods.

It is important to point out that although it looks like the nieghborhoods are
disjoint, d=2 actually contains all of d=1.

The rules that I used are a combination of those found in these two
articles: one and two.
After initialization there are two phases to our cellular automate. The first
of these phases is responsible for carving out the main cave structure, and the
second phase refines the structure and clears out some of the middle to create
more open spaces.

In the first phase x=5 and y=2. After one iteration we have the following:

After one iteration of phase one.

And after a few more iterations we can see the cave structure emerging.

After four iterations of phase one.

In the second phase we set x=5 and y=-1 effectively shutting down the
second conditional since there cannot have a negative number of filled cells
around a cell in the matrix. This begins removing small chunks in the middle
that do not have enough neighbors to stay around and opens up the space on
the interior.

After one iteration of phase two.

And after a few more iterations.

After four iterations of phase two.

We have now reached a reasonable looking cave and happened to be lucky
enough that all of the chambers are connected. There is a small bit of code
included with the cave generation code to ensure that all rooms are
connected. The basic idea is to use a “flood fill” algorithm to identify all
of the unique rooms in the level
(redblobgames.com).
Once we have identified the unique rooms we begin a random walk towards the
center of the map clearing out filled cells until we hit another room.

Running Demo

I included a demo of the
system running (thanks to LibGDX HTML5 output).

Codename Slappy

So, I’m terrible at naming things, but I need to give this game idea a name
so that I can refer to it in future blog posts. Slappy is a top-down 2D
game that tries to emulate all of the things that I loved about the Left 4
Dead series.

Another Zombie Game

The hope is that this is not another zombie game, but I have not figured out
what to replace the brain-dead zombies with. Each level will be self
contained and will require the player to navigate the unsafe world until
they find the “goal.” In Left 4 Dead (L4D) this was a safe house and
reaching the safe house triggered the next level in addition to some details
about your performance on the current level.

Procedural Generation

I plan on relying on procedural generation in several parts of the gameplay
experience. The first and most obvious, is procedurally generated levels.
The plan for each level is for it to last between five and ten minutes so
that you can get in, knock out a level, and continue on with your day.

As part of the level generation, I will also be implementing a forest
generation algorithm that will provide cover / hiding areas for enemy.
Things within the forest cannot be seen until you enter the forest and even
then you have only a limited range of sight. One fun challenge for level
generation is developing the two algorithms so that they complement each
other and provide a cohesive experience .

The last area where I can currently envision procedural generation is in the
actual player experience. The L4D series ships with an AI Director which is
responsible for creating enjoyable experiences for the player. The AI
Director is designed to take into account the player’s abilities, current
performance, position, mood, and tension to decide how many enemies to
spawn, where to spawn them, etc. I plan on developing something similar
because the AI of L4D has intrigued me for a very long time.

Other Details

The game will be developed in Java with the libGDX
library and will be available on all major platforms (iOS, Android, Mac,
Windows). This blog will host all of the major updates to the game until I
decide on an actual name for the project and purchase the domain for it.