Google IO Bootcamp codelab : Hands on with the datastore

- Ikai Lan

Introduction

This is an introduction to Google App Engine's datastore. We will be building a Link sharing site. During the course
of this code lab, we learn about working with the datastore by writing code. We will cover the following topics:

Prerequisites

Apache Ant (we will be using this for unit tests and for running the server)

The skeleton project TODO: link to skeleton project here

ShareSite

Our link sharing site will allow users to do the following:

Log in and post links

Log in and comment on links

Sort links by newest

Sort links by view count

Sort links by comment count

(Homework) sort links by popular in the last day

(Homework) sort links by popular in the last 7 days

(Homework) Allow users to post nested comments

Setup

Clone or download the repository. You need to have ant installed.

Our data model

The basic data model is fairly simple. We have to record a few pieces of data for each Link:

Key

URL

title

description

posted - date when Link was posted

number of comments

owner's ID

Part 1: Defining basic persistence methods

Let's go through each of these steps now.

Step 1: Conversion to and from a POJO

Take a look at the toEntity() and fromEntity() methods in Link.java. These are already defined. Note the use
of indexed and unindexed properties: we need indexed properties on every property we intend to query against.
Note the implementations below:

In our toEntity() method, a few things are happening:

We create a new Entity instance. The single argument constructor is overloaded to take a Key or String argument.
If a Key argument is provided, we can infer the Kind as well as the key name. This is useful if the Link instance
already has a Key (it was previously saved). If it does not, we pass a String describing the Entity Kind. When
we save the Entity, a Key will automatically be assigned.

We set properties. There are two methods that we call: setProperty() and setUnindexedProperty(). setProperty()
tells the datastore on write that we want indexes to be created when the property is saved. Indexes are required
if we plan on querying or sorting on a value. We use setUnindexedProperty() on every property we will not query -
in this case, we'll never query on a url, a title or a description. In general, it's a best practice to index as little as
possible, as index writing consumes CPU as well as datastore space.

The fields of an Entity are typeless and can
be defined at runtime. Furthermore, the value of a property field can be a String, Date, List, Key. or other App Engine
specific classes. A full list of property types is available in the official Entities documentation.

We'll need to convert the Entity class to Link classes. Below is the implementation of the fromEntity() method:

Fortunately, this is fairly straightforward. getProperty() can be used on any property. It does not distinguish between
indexed and unindexed properties. We have to cast the property to the proper type.

Step 2: implement put() and get() in LinkDAO

Let's define the put() method:

Here is the corresponding get() method:

These are basic persistence methods that convert the Link POJO to a low level Entity instance and save the instance.

Part 2: CommentDAO

Let's add comments to our model. Comments belong to a Link. It is not possible for a Comment to associated with more than one Link. One Link may have many Comments. This seems like a very good place to make use of Entity Groups, a hierarchical organization of related data that can be treated as a single unit of data. We'll set each Link to be an Entity Group root and set the ancestor of each Comment to the Link object it references.

CommentDAO's put() method is already defined. It looks pretty much exactly like LinkDAO's put() method, so there's no value in us repeating that exercise. Instead, we'll define 2 methods: createCommentForLink() and getForLink().

The job of createCommentForLink() is to accept a Link, create a Comment that contains the Link as the parent entity, and return the Comment without saving it. While that may seem simple, we'll have to take some extra steps to make it work the way we want to:

In this line:

We ask the datastore to set aside some IDs in the global namespace that contain the parentLink's key as an ancestor. These IDs are guaranteed to be unique and will never be auto-reassigned again, though it is possible for a developer to manually reuse the assigned IDs. The first parameter we pass is the parent Key. The second is the enity Kind. The third parameter is the number of IDs to allocate.

Since we only have one Key returned in this KeyRange, we can fetch it like so:

The last part of this code is standard Java interaction with the Comment POJO.

The code in AddCommentServlet.java should now work correctly. Note that we are not transactionally incrementing the numComments field of Link. This has the potential to introduce a race condition if two Comments are saved at the same time. Our code doesn't currently make this transactional, mostly for the sake of simplicity. We would probably want to address this in a future update.

The second line is the important one. We're telling the datastore to do an ancestor query by setting an Entity root. Ancestor queries are queries that span a single Entity Group. The unit of atomicity for strong consistency in High Replication datastore, the default option as of 1.5.0, is an entity group, thus it is important for us to organize data that functions as a single unit into entity groups when possible. If we created comments without setting an entity root and queried shortly after an update using a non-ancestor query, the changes may not be visible right away. With entity groups and ancestor queries, we are guaranteed to see changes after they are committed.

Part 3: LinkDAO

To complete the example, we'll need to revisit LinkDAO and finish building it out. We'll add a few more methods. These methods fall into one of two categories. We have the query methods:

getNewest()

getSubmittedByOwnerId()

getPopularAllTime()

getMostDiscussed()

We also have the methods for dealing with the view counter:

incrementTotalViewCount()

flushCounterShards()

the run() method in FlushShardedCounters.java

We'll implement one of the query methods first: getNewest():

The other methods are similar and we won't cover them here. It may even make sense to extract out a method in a future refactoring.

The two view counter methods are the interesting ones. We want to track how often a Link is viewed and we want to be able to sort on the property. A naive implementation would be to simply update the viewCount property on a Link every time a user browses to the page. This will work, but it'll run into a scaling wall very quickly as more users use the site and the view rate of the page exceeds our single entity write write. What if our site starts to attract hundreds or even thousands of visitors a second? With the naive counter implementation, we would either lose data or cause a large amount of write contention, resulting in errors being exposed to end users. Two possible solutions here are:

Increment a counter in Memcache. Periodically flush this counter to the datastore.

Use a sharded counter implementation.

Solution 1 is probably more than adequate for this particular use case, as the risk of Memcache losing our data wouldn't be the end of the world, especially for data tracking view counts. In most cases, simply being in the general ballpark is good enough.

We'll implement solution 2 because it helps us to introduce the principle of sharding. Sharding is the act of taking some data and splitting it across multiple logical entities. By employing sharding, we can linearly scale the total capacity of a single logical entity by adding more capacity. Effectively, we remove a bottleneck. Sharding is an example of an implementation detail - users do not care how their data is persisted, so as long as it is correct and they have access to it. Scalability is not possible without data sharding.

Let's revisit our requirements. We want to be able to:

Sort links by most views

Have a pretty accurate view count stored with each Link so that we can take advantage of the datastore's indexing when we query and sort by the number of views

Our implementation will look like this:

When a user views a link detail page, we write to a sharded counter. Each Link is associated with N number of entities representing a distributed count of the total number of views this Link has recorded. To minimize contention, this counter will be a root entity.

When a user views a link detail page, we create a new task queue task, scheduling it for the next discrete minute and name it as such. When we name a task, it cannot be enqueued twice in a one week period. The Task Queue API will throw an exception. The job of this task queue task is to collect all the sharded counters, total up the count and update the Link entity with the view count.

We're going to make a compromise here: the view counts will not be entirely up to date, since they will be updated once a minute by a background task. This is perhaps the simplest example of eventual consistency. It takes a minute (or more) for the internal state of the Link entity to become “correct”, but it does eventually become correct. This is a compromise we can live with, as it allows us to greatly increase our write throughput.

We'll implement incrementTotalViewCount() first:

We're going to reference the shards by Key. An alternative implementation would use queries and query for all the counter shards, updating the one we want, but doing it this way gives us:

Strong consistency in high replication datastore (gets by key are strongly consistent)

A little bit of speed. Not a lot - just a little. We also only retrieve the data we want

Transactionality. We would not have been able to transactionally read data across multiple entity groups and update it within a single transaction

In these two lines, we've programmatically constructed the key based on the randomly generated shardNumber value, a value between 0 and the number of total counter shards this Link has. By using an unindexed property instead of a fixed value globally, this gives us the flexibility to assign more counter shards for a given entity if it turns out to require an even higher write rate.

Now let's implement the actual incrementing of the counter value:

We transactionally get the counter shard by key. If it doesn't exist, we create a new Entity with the Key we created and set the value to 1.

Now we need to implement the flushCounterShards() method. This is the method that our background task will use to update the viewCount property on the Link entity:

Since we know the total number of view counter shards a link will have, we can build a List of
keys to do a batch get.

We iterate through these Keys, retrieving the counter property and adding it to our running total. We update our Link with this total and save it.

There's one last part to our flush implementation: the deferred task that will run the code. Fortunately, this is fairly simple. Open FlushShardedCounters.java and complete the implementation of the run() method:

This code is being scheduled to run in LinkDetailServlet.java:

The task name ensures that we will not enqueue two tasks with the same name. In this case, we are concatenating the Link ID with the next discrete minute, represented in milliseconds as a unique identifier.

Run the unit tests using

ant test

All the tests should now be passing.

Homework

The self-study in this lab is to implement two features:
- most popular in the last 7 days
- most popular today