Now lets say I get the first 100 posts using Post.all().fetch(limit=100), and pass this list to the template - what happens?

It makes 200 more datastore gets - 100 to get each author, 100 to get each author's city.

This is perfectly understandable, actually, since the post only has a reference to the author, and the author only has a reference to the city. The __get__ accessor on the post.author and author.city objects transparently do a get and pull the data back (See this question).

Some ways around this are

Use Post.author.get_value_for_datastore(post) to collect the author keys (see the link above), and then do a batch get to get them all - the trouble here is that we need to re-construct a template data object... something which needs extra code and maintenance for each model and handler.

Write an accessor, say cached_author, that checks memcache for the author first and returns that - the problem here is that post.cached_author is going to be called 100 times, which could probably mean 100 memcache calls.

Hold a static key to object map (and refresh it maybe once in five minutes) if the data doesn't have to be very up to date. The cached_author accessor can then just refer to this map.

All these ideas need extra code and maintenance, and they're not very transparent. What if we could do

@prefetch
def render_template(path, data)
template.render(path, data)

Turns out we can... hooks and Guido's instrumentation module both prove it. If the @prefetch method wraps a template render by capturing which keys are requested we can (atleast to one level of depth) capture which keys are being requested, return mock objects, and do a batch get on them. This could be repeated for all depth levels, till no new keys are being requested. The final render could intercept the gets and return the objects from a map.

This would change a total of 200 gets into 3, transparently and without any extra code. Not to mention greatly cut down the need for memcache and help in situations where memcache can't be used.

Trouble is I don't know how to do it (yet). Before I start trying, has anyone else done this? Or does anyone want to help? Or do you see a massive flaw in the plan?

Something I omitted from my answer... you may or may not realize that RPC calls are extremely slow compared to your Python code, and it counts against your quota. IIRC datastore fetches by key take around 100ms, so 200 fetches would take 20 seconds. You are actually doing 400 fetches (one from post to author, one from author to city) so that will time out your page.
–
JasonSmithJan 16 '10 at 9:32

Precisely... my actualy page will probably only show ten posts, and I'm recording 40 calls. Takes a little more than a second, calls that number in the 100s could easily timeout.
–
Sudhir JonathanJan 16 '10 at 9:37

Also, hooray for a user in timezone >= GMT+5 (I'm assuming)! Finally I can answer before all the Westerners do! :p
–
JasonSmithJan 16 '10 at 9:38

Lol... yeah, I don't see too many eastern hemisphere people contributing here. Would be nice :)
–
Sudhir JonathanJan 16 '10 at 16:13

2 Answers
2

I have been in a similar situation. Instead of ReferenceProperty, I had parent/child relationships but the basics are the same. My current solution is not polished but at least it is efficient enough for reports and things with 200-1,000 entities, each with several subsequent child entities that require fetching.

You can manually search for data in batches and set it if you want.

# Given the posts, fetches all the data the template will need
# with just 2 key-only loads from the datastore.
posts = get_the_posts()
author_keys = [Post.author.get_value_for_datastore(x) for x in posts]
authors = db.get(author_keys)
city_keys = [Author.city.get_value_for_datastore(x) for x in authors]
cities = db.get(city_keys)
for post, author, city in zip(posts, authors, cities):
post.author = author
author.city = city

Now when you render the template, no additional queries or fetches will be done. It's rough around the edges but I could not live without this pattern I just described.

Also you might consider validating that none of your entities are None because db.get() will return None if the key is bad. That is getting into just basic data validation though. Similarly, you need to retry db.get() if there is a timeout, etc.

(Finally, I don't think memcache will work as a primary solution. Maybe as a secondary layer to speed up datastore calls, but you need to work well if memcache is empty. Also, Memcache has several quotas itself such as memcache calls and total data transferred. Overusing memcache is a great way to kill your app dead.)

The second code block is very helpful... I didn't think of zip and using it that way.. but the first block is actually a built in feature of the sdk... references are already cached after they're resolved once, so there's absolutely no need for that code..
–
Sudhir JonathanJan 17 '10 at 6:59

You are correct. My code was copied from a similar situation not using ReferenceProperty. It would be nice to populate the property via some back-door rather than just blowing away the .city attribute. But I believe that would work in a pinch.
–
JasonSmithJan 17 '10 at 9:00

Hmm.. I am reading the code in google/appengine/ext/db/__init__.py in the SDK. It looks like a simple assignment works fine because it will call the ReferenceProperty's __set__() method. I will update the answer to be shorter and clearer.
–
JasonSmithJan 17 '10 at 9:05

Done :) Also I forgot to mention, I use itertools.izip in my real code because I used to hit MemoryErrors from time to time. It's probably not necessary in general though.
–
JasonSmithJan 17 '10 at 9:09

Is the itertools version any faster? Why would there even be a difference? Zip seems like a very simple algo to me.
–
Sudhir JonathanJan 26 '10 at 6:02