Tutorial: Design and implementation of a simple Twitter clone using PHP and the Redis key-value store

Tutorial: Design and implementation of a simple Twitter clone using PHP and the Redis key-value store¶

This article describes the design and implementation of a very simple
Twitter clone written using PHP
with Redis as the only database. The programming community has
traditionally considered key-value stores as a special purpose database
that couldn’t be used as a drop in replacement for a relational database
for the development of web applications. This article will try to show
that Redis data structures on top of a key-value layer are an effective
data model to implement many kinds of applications.

Before to continue, you may want to play a few seconds with the Retwis
online demo, to check what we are going to
actually model. Long story short: it is a toy, but complex enough to be
a foundation in order to learn how to create more complex applications.

Note: the original version of this article was written in 2009 when
Redis was released. It was not exactly clear at the time that the Redis
data model was suitable to write entire applications. Now after 5 years
there are many cases of applications using Redis as their main store, so
the gaol of the article today is to be a tutorial for Redis newcomers.
You’ll learn how to design a simple data layout using Redis, and how to
apply different data structures.

Our Twitter clone, called Retwis, is
structurally simple, has very good performance, and can be distributed
among any number of web and Redis servers with little efforts. You can
find the source code
here.

I use PHP for the example since it can be read by everybody. The same
(or better) results can be obtained using Ruby, Python, Erlang, and so
on. A few clones exist (however not all the clones use the same data
layout as the current version of this toturial, so please, stick with
the official PHP implementation for the sake of following the article
better).

Retwis-RB is a port of Retwis
to Ruby and Sinatra written by Daniel Lucraft! Full source code is
included of course, and a link to its Git repository appears in the
footer of this article. The rest of this article targets PHP, but
Ruby programmers can also check the Retwis-RB source code since it’s
conceptually very similar.

Retwis-J is a port of Retwis
to Java, using the Spring Data Framework, written by Costin
Leau. Its source code can be found on
GitHub,
and there is comprehensive documentation available at
springsource.org.

The essence of a key-value store is the ability to store some data,
called a value, inside a key. The value can be retrieved later only if
we know the specific key it was stored in. There is no direct way to
search for a key by value. In a sense, it is like a very large
hash/dictionary, but it is persistent, i.e. when your application ends,
the data doesn’t go away. So, for example, I can use the command SET
to store the value bar in the key foo:

SET foo bar

Redis stores data permanently, so if I later ask “What is the value
stored in key foo?” Redis will reply with bar:

GET foo => bar

Other common operations provided by key-value stores are DEL, to
delete a given key and its associated value, SET-if-not-exists (called
SETNX on Redis), to assign a value to a key only if the key does not
already exist, and INCR, to atomically increment a number stored in
a given key:

Something is wrong! We incremented the value two times, but instead of
going from 10 to 12, our key holds 11. This is because the increment
done with GET/increment/SETis not an atomic operation.
Instead the INCR provided by Redis, Memcached, ..., are atomic
implementations, and the server will take care of protecting the key
during the time needed to complete the increment in order to prevent
simultaneous accesses.

What makes Redis different from other key-value stores is that it
provides other operations similar to INCR that can be used to model
complex problems. This is why you can use Redis to write whole web
applications without using another database like an SQL database, and
without going crazy.

In this section we will see which Redis features we need to build our
Twitter clone. The first thing to know is that Redis values can be more
than strings. Redis supports Lists, Sets, Hashes, Sorted Sets, Bitmaps,
and HyperLogLog types as values, and there are atomic operations to
operate on them so we are safe even with multiple accesses to the same
key. Let’s start with Lists:

LPUSH means Left Push, that is, add an element to the left (or to
the head) of the list stored in mylist. If the key mylist does not
exist it is automatically created as an empty list before the PUSH
operation. As you can imagine, there is also an RPUSH operation that
adds the element to the right of the list (on the tail). This is very
useful for our Twitter clone. User updates can be added to a list stored
in username:updates, for instance.

There are operations to get data from Lists, of course. For instance,
LRANGE returns a range from the list, or the whole list.

LRANGE mylist 0 1 => c,b

LRANGE uses zero-based indexes - that is the first element is 0, the
second 1, and so on. The command arguments are
LRANGEkeyfirst-indexlast-index. The last-index argument can be
negative, with a special meaning: -1 is the last element of the list, -2
the penultimate, and so on. So, to get the whole list use:

LRANGE mylist 0 -1 => c,b,a

Other important operations are LLEN that returns the number of elements
in the list, and LTRIM that is like LRANGE but instead of returning the
specified range trims the list, so it is like Get range from mylist,
Set this range as new value but does so atomically.

Currently we don’t use the Set type in this tutorial, but since we use
Sorted Sets, which are kind of a more capable version of Sets, it is
better to start introducing Sets first (which are a very useful data
structure per se), and later Sorted Sets.

There are more data types than just Lists. Redis also supports Sets,
which are unsorted collections of elements. It is possible to add,
remove, and test for existence of members, and perform the intersection
between different Sets. Of course it is possible to get the elements of
a set. Some examples will make it more clear. Keep in mind that SADD
is the add to set operation, SREM is the remove from set
operation, sismember is the test if member operation, and SINTER
is the perform intersection operation. Other operations are SCARD
to get the cardinality (the number of elements) of a Set, and
SMEMBERS to return all the members of a Set.

Note that SMEMBERS does not return the elements in the same order we
added them since Sets are unsorted collections of elements. When you
want to store in order it is better to use Lists instead. Some more
operations against Sets:

Sorted Sets are similar to Sets: collection of elements. However in
Sorted Sets each element is associated with a floating point value,
called the element score. Because of the score, elements inside a
sorted set are ordered, since we can always compare two elements by
score (and if the score happens to be the same, we compare the two
elements as strings).

Like Sets in Sorted Sets it is not possible to add repeated elements,
every element is unique. However it is possible to update an element’s
score.

Sorted Set commands are prefixed with Z. The following is an example
of Sorted Sets usage:

In the above example we added a few elements with ZADD, and later
retrivied the elements with ZRANGE. As you can see the elements are
returned in order according to their score. In order to check if a given
element exists, and also to retrieve its score if it exists, we use the
ZSCORE command:

This is the last data structure we use in our program, and is extremely
easy to gasp since there is an equivalent in almost every programming
language out there: Hashes. Redis Hashes are basically like Ruby or
Python hashes, a collection of fields associated with values:

If you haven’t downloaded the Retwis source
code already please grab it now.
It contains a few PHP files, and also a copy of
Predis, the PHP client library we
use in this example.

Another thing you probably want is a working Redis server. Just get the
source, build with make, run with ./redis-server, and you’re ready to
go. No configuration is required at all in order to play with or run
Retwis on your computer.

When working with a relational database, a database schema must be
designed so that we’d know the tables, indexes, and so on that the
database will contain. We don’t have tables in Redis, so what do we need
to design? We need to identify what keys are needed to represent our
objects and what kind of values this keys need to hold.

Let’s start with Users. We need to represent users, of course, with
their username, userid, password, the set of users following a given
user, the set of users a given user follows, and so on. The first
question is, how should we identify a user? Like in a relational DB, a
good solution is to identify different users with different numbers, so
we can associate a unique ID with every user. Every other reference to
this user will be done by id. Creating unique IDs is very simple to do
by using our atomic INCR operation. When we create a new user we can
do something like this, assuming the user is called “antirez”:

Note: you should use an hashed password in a real application, for
simplicity we store the password in clear text.

We use the next_user_id key in order to always get an unique ID for
every new user. Then we use this unique ID to name the key holdign an
Hash with user’s data. This is a common design pattern with key-values
stores! Keep it in mind. Besides the fields already defined, we need
some more stuff in order to fully define a User. For example, sometimes
it can be useful to be able to get the user ID from the username, so
every time we add an user, we also populate the users key, which is
an Hash, with the username as field, and its ID as value.

HSET users antirez 1000

This may appear strange at first, but remember that we are only able to
access data in a direct way, without secondary indexes. It’s not
possible to tell Redis to return the key that holds a specific value.
This is also our strength. This new paradigm is forcing us to organize
data so that everything is accessible by primary key, speaking in
relational DB terms.

There is another central need in our system. A user might have users who
follow them, which we’ll call their followers. A user might follow other
users, which we’ll call a following. We have a perfect data structure
for this. That is... Sets. The uniqueness of Sets elements, and the fact
we can test in constant time for existence, are two interesting
features. However what about also remembering the time at which a given
user started following another one? In an enhanced version of our simple
Twitter clone this may be useful, so instead of using a simple Set, we
use a Sorted Set, using the user ID of the following or follower user as
element, and the unix time at which the relation between the users was
created, as our score.

So let’s define our keys:

followers:1000 => Sorted Set of uids of all the followers users
following:1000 => Sorted Set of uids of all the following users

Another important thing we need is a place were we can add the updates
to display in the user’s home page. We’ll need to access this data in
chronological order later, from the most recent update to the oldest, so
the perfect kind of data structure for this is a List. Basically every
new update will be LPUSHed in the user updates key, and thanks to
LRANGE, we can implement pagination and so on. Note that we use the
words updates and posts interchangeably, since updates are actually
“little posts” in some way.

posts:1000 => a List of post ids - every new post is LPUSHed here.

This list is basically the User timeline. We’ll push the IDs of her/his
own posts, and, the IDs of all the posts of created by the following
users. Basically we implement a write fanout.

OK, we have more or less everything about the user except for
authentication. We’ll handle authentication in a simple but robust way:
we don’t want to use PHP sessions, our system must be ready to be
distributed among different web servers easily, so we’ll keep the whole
state in our Redis database. All we need is a random unguessable
string to set as the cookie of an authenticated user, and a key that
will contain the user ID of the client holding the string.

We need two things in order to make this thing work in a robust way.
First: the current authentication secret (the random unguessable
string) should be part of the User object, so when the user is created
we also set an auth field in its Hash:

HSET user:1000 auth fea5e81ac8ca77622bed1c2132a021f9

Moreover, we need a way to map authentication secrets to user IDs, so we
also take an auths key, which has as value an Hash type mapping
authentication secrets to user IDs.

HSET auths fea5e81ac8ca77622bed1c2132a021f9 1000

In order to authenticate a user we’ll do these simple steps ( see the
login.php file in the Retwis source code): * Get the username and
password via the login form * Check if the username field actually
exists in the users Hash. * If it exists we have the user id, (i.e.
1000) * Check if user:1000 password matches, if not, return an error
message * Ok authenticated! Set “fea5e81ac8ca77622bed1c2132a021f9” (the
value of user:1000 auth field) as the “auth” cookie.

This happens every time a user logs in, but we also need a function
isLoggedIn in order to check if a given user is already
authenticated or not. These are the logical steps preformed by the
isLoggedIn function:

Get the “auth” cookie from the user. If there is no cookie, the user
is not logged in, of course. Let’s call the value of the cookie
<authcookie>

Check if <authcookie> field in the auths Hash exists, and
what the value (the user ID) is (1000 in the example).

In order for the system to be more robust, also verify that user:1000
auth field also matches.

OK the user is authenticated, and we loaded a bit of information in
the $User global variable.

Having loadUserInfo as a separate function is overkill for our
application, but it’s a good approach in a complex application. The only
thing that’s missing from all the authentication is the logout. What do
we do on logout? That’s simple, we’ll just change the random string in
user:1000 auth field, remove the old authentication secret from the
auths Hash., and add the new one.

Important: the logout procedure explains why we don’t just
authenticate the user after looking up the authentication secret in the
auths Hash, but double check it against user:1000 auth field.
The true authentication string is the latter, while the auths Hash
is just an authentication field that may even be volatile, or, if there
are bugs in the program or a script gets interrupted, we may even end
with multiple entries in the auths key pointing to the same user ID.
The logout code is the following (logout.php):

As you can see each post is just represented by an Hash with three
fields. The ID of the user owning the post, the time at which the post
was published, and finally the body of the post, which is, the actual
status message.

After we create a post and we obtain the post ID, we need to LPUSH the
ID in the timeline of every user that is following the author of the
post, and of course in the list of posts of the author itself (everybody
is virtually following herself/himself). This is the file post.php
that shows how this is performed:

The core of the function is the foreach loop. We use ZRANGE to
get all the followers of the current user, then the loop will LPUSH the
push the post in every follower timeline List.

Note that we also maintain a global timeline for all the posts, so that
in the Retwis home page we can show everybody’s updates easily. This
requires just doing an LPUSH to the timeline List. Let’s face
it, aren’t you start thinking it was a bit strange to have to sort
things added in chronological order using ORDERBY with SQL? I think
so.

There is an interesting thing to notice in the code above: we use a new
command called LTRIM after we perform the LPUSH operation in the
global timeline. This is used in order to trim the list to just 1000
elements. The global timeline is actually only used in order to show a
few posts in the home page, there is no need to have the full history of
all the posts.

Basically LTRIM + LPUSH is a way to create a capped collection
in Redis.

showPost will simply convert and print a Post in HTML while
showUserPosts gets a range of posts and then passes them to
showPosts.

Note: ``LRANGE`` is not very efficient if the list of posts start to be
very big, and we want to access elements which are in the middle of the
list, since Redis Lists are backed by linked lists. If a system is
designed for deep pagination of million of items, it is better to resort
to Sorted Sets instead.

It is not hard, but we did not yet checked how we create following /
follower relationships. If user ID 1000 (antirez) wants to follow user
ID 5000 (pippo), we need to create both a following and a follower
relationship. We just need to ZADD calls:

ZADD following:1000 5000
ZADD followers:5000 1000

Note the same pattern again and again. In theory with a relational
database the list of following and followers would be contained in a
single table with fields like following_id and follower_id. You
can extract the followers or following of every user using an SQL query.
With a key-value DB things are a bit different since we need to set both
the 1000isfollowing5000 and 5000isfollowedby1000
relations. This is the price to pay, but on the other hand accessing the
data is simpler and extremely fast. Having these things as separate sets
allows us to do interesting stuff. For example, using ZINTERSTORE we
can have the intersection of ‘following’ of two different users, so we
may add a feature to our Twitter clone so that it is able to tell you
very quickly when you visit somebody else’s profile, “you and Alice have
34 followers in common”, and things like that.

You can find the code that sets or removes a following / follower
relation in the follow.php file.

Gentle reader, if you reached this point you are already a hero. Thank
you. Before talking about scaling horizontally it is worth checking
performance on a single server. Retwis is extremely fast, without any
kind of cache. On a very slow and loaded server, an apache benchmark
with 100 parallel clients issuing 100000 requests measured the average
pageview to take 5 milliseconds. This means you can serve millions of
users every day with just a single Linux box, and this one was monkey
ass slow... Imagine the results with more recent hardware.

However you can’t go with a single server forever, how do you scale a
key-value store?

Retwis does not perform any multi-keys operation, so making it scalable
is simple: you may use client-side sharding, or something like a
sharding proxy like Twemproxy, or the upcoming Redis Cluster.

To know more about those topics please read our documentation about
sharding. However here the point to stress is
that in a key-value store, if you design with care, the data set is
split among many independent small keys. To distribute those keys to
multiple nodes is more straightforward and predictable compared to using
a semantically more complex database system.