Meta

Tech the Tech

This post is a plea for help. I’ve made something pretty cool, in a commercial context, that’s small and contained and useful and which I want to open source. I’ve got no experience in successfully creating open sourced code that other people might want to interact with, so I’m looking for guidance from someone who does, particularly in the Python and/or AppEngine communities.

AppEngine is a great platform, one I’ve been working in now, using Python, for a couple of years.

One of the ongoing annoyances for me has been the low-level nature of the abstractions available for distributed programming, particularly using task queues to kick off background tasks (which might want to kick off more background tasks and etc).

At the most basic, you can create tasks on a queue. The queue is processed, the task runs eventually. The task is a web handler (a “get” handler iirc). Simple, but it’s messy to set up if you want to do complex things, lots of boilerplate, lots of messing around with routes and so on.

Then there’s the excellent deferred library. It allows you to kick off a function with some arguments in a task, but hiding all of the task queue messiness. It makes tasks highly useful and usable. But there are still niggles.

Firstly, deferred functions are passed by name. Actually, it’s more complex than this; deferred takes a callable, which might be an object (must be picklable, is pickled), or a function (passed by name for the most part I think, maybe something else going on with built in functions). But in any case we have the following restrictions in the doc:

The following callables can NOT be used as tasks:
1) Nested functions or closures
2) Nested classes or objects of them
3) Lambda functions
4) Static methods

ie: you can’t use really interesting stuff.

The second problem is that you launch a process, but then what? Does it ever execute? How do you know when it’s complete? How can you compose these calls?

As a higher level alternative to using deferred, I’ve made a library that provides distributed promises.

It lets you do things like this:

The interesting thing here is that both the functions are run in (separate) background tasks. If the first one times out, the second one will receive a timeout error as the result.

Javascript’s promises are used for handling an asynchronous environment inside a single process. So you can kick off something asynchronous, and have the handler for that be a closure over the resolve method, meaning you can signal the job is done in a finished handler, without needing to “return” from the originating function.

For python on appengine, if you want to manage asynchronous behaviour inside a single process, try ndb tasklets.

My distributed promises for appengine are instead for managing distributed processing; that is, algorithms coordination multiple tasks to do something.

The function you pass to your promise (via when, then, etc) will run in a separate task, and can signal it is done by calling resolve() which will trigger further tasks (eg: as above) if necessary.

Also, to make promises really work and be interesting, I’ve included the ability to pass inner functions and closures and so on through to the promises. So while defer takes only callables that are picklable or referenceable by name, promises take functions (must be functions, not other callables) which are completely serialised with their entire closure context as necessary, and reconstructed in the task in which they will run. So you can do things like this:

Note the functions are referring to names from outside their own scope. Full closures are allowed. Also you can pass lambda functions.

The when() function returns a promise. The then() function returns a promise. There’s an all() function which also returns a promise (as well as variations such as allwhen and allthen). These functions are completely chainable.

The all() functions allow you to do something when a bunch of other things finish, ie:

promiseSpace.all([promises...]).then(somethingfinal)

Python has great exception handling, so I’ve dispensed with Javascript’s separate resolve and reject methods; to reject, just pass an exception to resolve. Also dispensed with are separate success / fail handlers; instead you get a result object with a value property, which throws the exception if the result was actually an exception.

Here’s a more red-blooded example of a function for mapping over the datastore, and example of using that to make a count function, and an example of calling that function:

Start at the bottom and work upwards to see how a simple counting method is implemented.

There might be much better ways of writing this; I’m not especially good at using these promises yet!

So, in closing, if you can see some value in this and you could help me with how to best publish this code, please contact me.

In my professional life I’m working on a server side appengine based system whose next iteration needs to be really good at dealing with schema-less data; JSON objects, in practical terms. To that end I’ve thrown together a simple document database layer to sit on top of appengine’s ndb, in python.

gaedocstore

gaedocstore is a lightweight document database implementation that sits on top of ndb in google appengine.

Introduction

If you are using appengine for your platform, but you need to store arbitrary (data defined) entities, rather than pre-defined schema based entities, then gaedocstore can help.

gaedocstore takes arbitrary JSON object structures, and stores them to a single ndb datastore object called GDSDocument.

In ndb, JSON can simply be stored in a JSON property. Unfortunately that is a blob, and so unindexed. This library stores the bulk of the document in first class expando properties, which are indexed, and only resorts to JSON blobs where it can’t be helped (and where you are unlikely to want to search anyway).

gaedocstore also provides a method for denormalised linking of objects; that is, inserting one document into another based on a reference key, and keeping the inserted, denormalised copy up to date as the source document changes. Amongst other uses, this allows you to provide performant REST apis in which objects are decorated with related information, without the penalty of secondary lookups.

Simple Put

When JSON is stored to the document store, it is converted to a GDSDocument object (an Expando model subclass) as follows:

Say we are storing an object called Input.

Input must be a dictionary.

Input must include a key at minimum. If no key is provided, the put is rejected.

If the key already exists for a GDSDocument, then that object is updated using the new JSON.

With an update, you can indicate “Replace” or “Update” (default is Replace). Replace entirely replaces the existing entity. “Update” merges the entity with the existing stored entity, preferentially including information from the new JSON.

If the key doesn’t already exist, then a new GDSDocument is created for that key.

The top level dict is mapped to the GDSDocument (which is an expando).

The GDSDocument property structure is built recursively to match the JSON object structure.

Simple values become simple property values

Arrays of simple values become a repeated GenericProperty. ie: you can search on the contents.

Arrays which include dicts or arrays become JSON in a GDSJson object, which just hold “json”, a JsonProperty (nothing inside is indexed, or searchable)

Dictionaries become another GDSDocument

So nested dictionary fields are fully indexed and searchable, including where their values are lists of simple types, but anything inside a complex array is not.

This will create a new person. If a GDSDocument with key “897654” already existed then this will overwrite it. If you’d like to instead merge over the top of an existing GDSDocument, you can use aReplace = False, eg:

lperson = GDSDocument.ConstructFromDict(lperson, aReplace = False)

Simple Get

All GDSDocument objects have a top level key. Normal ndb.get is used to get objects by their key.

Querying

Normal ndb querying can be used on the GDSDocument entities. It is recommended that different types of data (eg Person, Address) are denoted using a top level attribute “type”. This is only a recommended convention however, and is in no way required.

You can query on properties in the GDSDocument, ie: properties from the original JSON.

Querying based on properties in nested dictionaries is fully supported.

Denormalized Object Linking also supports pybOTL transform templates. gaedocstore can take a list of “name”, “transform” pairs. When a key appears like

{
...
"something": { key: XXX },
...
}

then gaedocstore loads the key referenced. If found, it looks in its list of transform names. If it finds one, it applies that transform to the loaded object, and puts the output into the stored GDSDocument. If no transform was found, then the entire object is put into the stored GDSDocument as described above.

eg:

Say we have the transform “address” as follows:

ltransform = {
"fulladdr": "{{.addr1}}, {{.city}} {{.zipcode}}"
}

You can store this transform against the name “address” for gaedocstore to find as follows:

GDSDocument.StorebOTLTransform("address", ltransform)

Then when Person above is stored, it’ll have its address placed inline as follows:

And if the object is recreated in the future, then that linked data will be reinstated as expected.

Similarly, if an object is saved with a link, but the linked object can’t be found, “link_missing”: True will be included as above.

updating denormalized linked data back to parents

The current version does not support this, but in a future version we may support the ability to change the denormalized information, and have it flow back to the original object. eg: you could change addr1 in address inside person, and it would fix the source address. Note this wont work when transforms are being used (you would need inverse transforms).

storing deltas

I’ve had a feature request from a friend, to have a mode that stores a version history of all changes to objects. I think it’s a great idea. I’d like a strongly parsimonious feel for the library as a whole: it should just feel like “ndb with benefits”).

I’m sick in bed today, floored with something viral I reckon. Glands are up, headache, the room’s spinning. Sucks man. But as often happens when I’m headachy, my brain is exploding with a new idea. This one is for a combination of the concepts of “social network” and “blog”, into a lightweight public-only improvement on both.

“Nework X” is a distributed social network using the open internet to connect its members. You set up your “Stream” like you would set up a blog – you can self host it, use someone else’s hosting service, whatever. You can post with the ease of a social network (not heavyweight like a blog). You have a stream seeing stuff from all over the net, like a social network / feed reader combo. You can follow people. The public view of your “stream” is more like a blog, just showing your own stuff.

Your identity is your stream/blog’s url. It uniquely identifies you. Your identity is confirmed, security is dealt with, just by calling back to this url and asking it to confirm your actions. Clean, straightforward.

It’s entirely public; there’s no private stuff (just like a blog).

You have a unique url which shows two main views, the Stream, and the Blog.

Stream.

See this when you are logged in.

Your posts and everyone’s posts that you follow turn up in the stream in chronological order. You can follow other users of Network X. You can follow Facebook users. You can follow Google+ users, twitter users, etc. You can follow RSS feeds. Might also follow specific blogging software types (blogger? wordpress? etc)

For Network X follows, and for most of the social network follows, for blog follows, you can comment on other people’s posts as you would in those networks, and the comment goes back to the right source (more on this below)

There is some basic filtering, so you can include/exclude sources dynamically. A really basic one is to screen out everyone’s posts except your own (which then looks like your “blog”).

At the top of the stream is a “Share what’s new” box, which works like G+.

This might have an “advanced” button which kicks it into a more full on wyziwyg/html editor like blogs have, but still inline.

There’s “Publish” and “Cancel” but there’s also “Save Draft”. The post might also autosave as draft. Drafts are exactly like regular posts, except only you can see them. Once you hit publish, the post is visible publicly.

There is an Edit button on your own posts, which puts you into inline-editing, like above, on existing posts.

For when you are not logged in, or when someone else is viewing your url.

Just shows your public posts (with comments). This is really for other people to come along and read your “blog”.

People can comment here like they would with a blog. (Auth with your network X stream, or with fb, google, twitter, etc etc)

Each of your public posts has its own public page, so easily shared on other networks. Stuffed with +1, like, etc etc sharing options for other networks. Also a reshare option for Network X of course.

You can follow anything with an RSS feed. This is read-only following, it’ll show up in your stream but you wont be able to comment or anything like that. You can click through to whatever the origin is.

You can follow other Network X users.

Your stream/blog url is your identity.

You paste their URL into a follow thing on your stream page. Done.

If you hit follow on their stream, it then asks you to paste in your stream url. Then that triggers something on your own stream page to confirm the follow?

Is there something more clever we can do with the browser already knowing your identity (because you’re logged in on your own stream page) so when you press follow on another person’s page, there’s no url pasting (just a confirm on your own page)? Smarter javascripty html5y people than me will know.

You can comment on anyone’s post that you follow. That posts the comment back to their stream. Your identity travels with the comment, in the form of your stream url. Their stream calls back to your stream to verify you actually sent the comment. If your stream says “yes”, the comment is posted.

You can manage who is allowed to comment on your posts. Anyone, no auth? Any Network X user with some identity info? Maybe people can also comment using facebook auth, G+ auth, etc? Probably allowing a moderation feature is a bit heavyweight, but you are free to delete comments from your own posts as you like.

There should be a +1/like mechanism, works same as comments for auth etc.

There should be a reshare mechanism, simply creates a new post with a link back to the other person’s post (wherever it originated).

You can block any identity from commenting. (just maintains an internal blacklist)

Some spam detection with auto blocking would be nice to have.

Integration with other social networks

You can see facebook, twitter, g+, etc in your stream.

To do this, you need to have an identity in those networks, and oauth using that, linking your network x identity (the url) to that social network identity.

Your normal stream/wall from fb/twitter/g+ is then integrated into your network X stream.

You can comment back to those networks (except G+ obviously which is read only) and it posts pack using your identity in that network.

You can reshare stuff out of any of these networks into your Network X blog/stream

Maybe you can reshare stuff from Network X or other social networks back into other social networks as well? That’d be fun!

Everywhere you can, you play nice with existing open internet protocols. eg: This url can be an OpenID. Use pubsubhubbub for realtime notifications. Have an rss feed. etc.

That’s all I’ve got for now. To build this would take some skilled front end javascripty goodness, but in principle the framework is actually pretty simple. Really, if you’re not trying to do the whole “social graph” thing that the facebooks of the world are doing, then this doesn’t need to be difficult!

I’ve started a new professional blog, AppEngine Development. It’s going to contain high rpm propeller hat stuff for developing systems on AppEngine. The first interesting post is The Worker, presenting a job processing class.

I’ve created the blog in the new look Blogger. It’s shiny! Lightweight and easy to use, too. However, it’s a bit light on features, a bit buggy here and there, and embeds terribly into Google+ . That last point is quite bizarre; aren’t these products meant to be part of the same ecosystem? Whatever 😉

You’ll recall in my initial post that I detected a, well, somewhat suboptimal algorithm, where I was touching the AppEngine datastore on the order of 10^8 times per day? Liz Fong made the comment that “Schlemiel the Painter algorithms are bad”. What? Who’s Schlemiel the painter?

Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. “That’s pretty good!” says his boss, “you’re a fast worker!” and pays him a kopeck.

The next day Shlemiel only gets 150 yards done. “Well, that’s not nearly as good as yesterday, but you’re still a fast worker. 150 yards is respectable,” and pays him a kopeck.

The next day Shlemiel paints 30 yards of the road. “Only 30!” shouts his boss. “That’s unacceptable! On the first day you did ten times that much work! What’s going on?”

“I can’t help it,” says Shlemiel. “Every day I get farther and farther away from the paint can!”

Is it Schlemiel or Shlemiel? Whichever it is, we need to fire them both.

Now, you’ll recall that this was all academic. AppEngine currently has no way to detect excessive datastore reads, apart from the billing info. So, I made changes to the code to give Schlemiel the flick, then we waited.

Estimated Charges Under New Pricing

The charges below are estimates of what you would be paying once App Engine’s new pricing model goes live. The amounts shown below are for your information only, they are not being charged and therefore do not affect your balance.

* Note this total does not take into account the minimum per-application charge in the new pricing model.

—

Let me just take a moment to say

w00t!!!!!

Oh yeah. It’s w00t, because this, although being a higher number, is one I can afford. I’m out of danger territory.

Anyway, let’s look at this in a little detail. Here’s what Schlemiel was projected to cost me:

Datastore Reads:
$0.70/Million Ops

59.06

0.05

59.01

$41.31

And here’s the post-Schlemiel picture:

Datastore Reads:
$0.70/Million Ops

2.30

0.05

2.25

$1.58

That, people, is a win.

There’s still a decent cost there, $1.58/day is considerable money for a self funded pet project. So where’s that coming from?

Recall I projected there would still be a good chunk of reads being performed by the fix:

1500 * 720 = 1,008,000 datastore reads per day

That’s in the ballpark, and probably accounts for around half of this. I can totally remove this processing with a careful change to my initialization code for the offending objects, and I haven’t done so already purely in the interests of science. So maybe it’s now time to do that.

It does look like there are still a fair chunk of reads happening elsewhere, possibly another bit of algorithm iterating a thousand or so records on every 2 minute cron job run. Odds are that’s also unnecessary. I’ll worry about that later though; I’ll remove the previously mentioned processing first, see how that works out, then take it from there.

But what about Instances, I hear you cry?

Sweet, Sweet Instances

Previously on Point7: <law and order-esque Da Dum!>

If I can get the average instances to 3, I’ll be paying 4 cents X (3-1) X 24 = US$1.92/day . That’s still a chunk more than I’m currently paying, but it’s doable (as long as everything else stays lowish). Cautious optimism!

But what did the data say?

Frontend Instance Hours:
$0.04/Hour

41.31

24.00

17.31

$0.70

Wow, that’s much better! Compare that to my old CPU bill:

CPU Time:
$0.10/CPU hour

12.83

6.50

6.33

$0.64

What did I do? I moved a couple of sliders, and spread my tasks out with a couple of lines of code. That’s pretty tough stuff. No wonder people are bitching & moaning!

I’ve got a few days of billing data now, and it’s all similar, so I’d say it’s a solid result. I actually can’t explain why this is so cheap, it doesn’t appear to match the instance data. Here’s the instance graph:

AppEngine Instances for Syyncc, 7 Days, 10 September 2011

Whatever it is, I’m happy with it.

So I’ve Got That Going For Me Which Is Nice

Not only are the instances right down to super cheap territory, I’m pretty sure I can get the datastore reads down into free territory with just a little more work. It’s all coming up Millhouse.

Just as a fun aside, this post took a little longer than I thought, due to the billing taking a while to come through. Why’d that happen? Because my credit card was maxed out (yeah, that’ll happen!), the weekly payment for AppEngine couldn’t be processed, and so my billing info was frozen until I fixed it. It turns out they give you no more billing info, and don’t let you change your billing settings, until you pay for the thing. Oh the humanity! Well, at least your app doesn’t get turned off, that’s great. Could be a gotcha if you have your billing settings cranked up though!

I’d love an ability to throw a bunch of money in ahead of time, ie: be able to have a positive account that the weekly billing draws from (my credit card is notoriously unreliable). I guess I could implement that by getting a debit card purely for the app. But then I have to think about money and stuff, and that’s boring and awful 😉

What’s it all mean for AppEngine developers?

Straightforwardly, it means this: optimisation isn’t difficult, and you should quit whining and do it. You might learn something.

I know there are some people who are optimised and can’t fix this. Their apps are predicated on using more instances than is financially viable. That’s hard luck.

On the other hand, the vast majority of apps will just be running fat & happy, and can be fixed with a bit of attention. Just do it, it’s not that tough. In fact, if you want a hand, yell out, I’m happy to look at stuff (as long as you’re ok for me to talk about it publicly, anonymity can be preserved to protect the guilty).

I’m a huge fan of not prematurely optimising, absolutely. But premature wont last forever. Here’s my new rule of thumb for AppEngine development:

You get to be willfully stupid or cheap but not both

Bullheaded idiocy is reserved for the rich. The rest of us need to use our brains just a little bit. And then it can be super cheap. So we little guys get a platform which can have longevity, and we get our way paid by wealthy and/or dumb people. I’m good with that.

Where to next?

My app has the exact same scenario as yours – I initiate ~250 URL Fetches every 15 minutes. If I allow multiple front-end instances to get created, all these fetches will occur in parallel, but I’d be paying a lot of money for all these mostly idle instances. My optimization centers around using one single backend that processes these URL fetches one at a time (totally acceptable for my app), by leasing them one at time from a pull queue. The pull queue is populated by random events in a front-end instances, and the back-end is triggered by a cron every 15 minutes to lease these URL Fetch tasks from the pull-queue. This way all my work gets done by a single backend that runs 24×7. I could easily change my cron to run once every hour instead of 15 minutes, and then my backend is running for free (just under 9 instance hours a day).

Another level of optimization is kicking off multiple URL Fetches using create_rpc() calls, so that my backend can do other things while the URL fetch completes (which, like in your case, can take several seconds to complete or timeout). With all this, I hope to stay under the free instance hour quota.

Some people, unlike me, can just say something awesome in a couple of paragraphs without going all tl;dr on it.

Firstly, Rishi has done the backend style optimisation which seemed like the way to go. And what’s that you’re saying, it could run for free? Now you’ve got my attention. That’s worth some thought.

Secondly, what’s this create_rpc() call of which Rishi speaks? Oh, it must be this:

“Asynchronous Requests: A Python app can make an asynchronous request to the URL Fetch service to fetch a URL in the background, while the application code does other things.”

The doco says that you create an rpc object using create_rpc(), then execute it with make_fetch_call(), then finally wait for it to complete with wait() or get_result(). You can have it call a callback function on completion, too (although it requires the call to wait() or get_result() to invoke it).

My code is full of long fetches to external urls. It might be possible to do other things while they execute, with a code restructure. I smell some coding and another blog post in my future.

As I’ve detailed in the previous posts, I’m facing a big Google AppEngine bill for Syyncc, based on the number of running instances. Lots of small developers like me have been struggling with this (and bellowing) but I’m betting that this is in my control, and due to poor suboptimal coding on my part.

In my last attempt at tuning, I managed to drop the average running instances from over 10 to around 5 or 6. Can I do better?

I already identified a likely culprit for the excess instances, which is that I’m kicking off 50 tasks, all at once, every 2 minutes. That kind of spiky load must force the scheduler to keep more instances running than really necessary. It’s more than necessary because I want the tasks done within the 2 minute period, but I don’t care about latency within that.

Now, these tasks are calls to /ssprocessmonitor, which the original post showed as averaging a bit over a second run time. The longest run time I’ve been able to see by checking my logs is about 4 seconds.

If I were to do these processes in a bit of sequential processing (in a backend?) then they’d finish before the two minutes were up, more or less. But I don’t want to go so far as to rearchitect around that. Can I do something simpler?

The problem here is that the tasks are being added to the taskqueue all at once. Wouldn’t it be nice if there were a simple way to stagger them, spread them out through the 2 minutes? Let’s look at the doco:

countdown: Minimum time to wait before executing this task, in seconds. Defaults to zero.

Using that, I could spread the tasks out, 2 seconds apart. That means the last one would at best begin 100 seconds after the tasks were scheduled, which still gives it (and any straggling tasks) 20 seconds to complete before this whole process starts again.

There are a lot of better ways to be doing what I’m doing here, but this might at least get me out of a jam.

So how did it go? Here’s the instances graph, see if you can spot where I uploaded the new code:

Instances Graph for Syyncc, 6 September 2011

Is this success? It’s better, a bit up and down. Maybe I need more time to see how it behaves.

Here’s a the view over a four days, so you can see it back when it was really bad, the previous optimisation, then this one.

4 day Instances Graph for Syyncc, 6 September 2011

I’m feeling better about that.

What I’m going to be billed under the new system is:

4 cents / instance hour, less 1 free hour per hour.

So that’s the area under the graph, minus a 1 unit high horizontal strip, times 4 cents.

If I can get the average instances to 3, I’ll be paying 4 cents X (3-1) X 24 = US$1.92/day . That’s still a chunk more than I’m currently paying, but it’s doable (as long as everything else stays lowish). Cautious optimism!

—

Now that’s all great, but there’s something else that I might have been able to use, and without a code change.

Apparently, push queues (ie: what I’m using) can be configured:

You can define any number of individual queues by providing a queue name. You can control the rate at which tasks are processed in each queue by defining other directives, such as rate, bucket_size, and max_concurrent_requests. You can read more about these directives in the Queue Definitions section.

The task queue uses token buckets to control the rate of task execution. Each named queue has a token bucket that holds a certain number of tokens, defined by the bucket_size directive. Each time your application executes a task, it uses a token. Your app continues processing tasks in the queue until the queue’s bucket runs out of tokens. App Engine refills the bucket with new tokens continuously based on the rate that you specified for the queue.

Instances

I’ve set the Max Idle Instances to 1, and Min Pending Latency to 15 seconds. ie: hang it all, don’t start more instances unless the sky is falling in.

Syyncc’s site still seems responsive enough, and the app itself is working as before, I can’t detect any functional difference. But what’s important is that the average instance count has dropped significantly:

Instances Graph for Syyncc, 4 Sep 2011

It’s dropped to 4 or 5 instances average, rather than 10 to 15 that it’s normally at. Not bad for shoving a couple of sliders around! And that’s without what will be necessary code changes, to stop the behaviour where the 50 tasks are scheduled concurrently once every 2 mins. That leads to a spike in activity, then nothing for most of the time, and is very likely the cause of excess idle instances. That’s all detailed in the previous post, linked at the top.

Given that the impact from the performance tuning is obvious, I’ll go ahead with refactoring the bursty scheduling code in the next few days, and post the results of that.

A bit more detail:

Instances Detail for Syyncc, 4 September 2011

Not quite good enough (I want to get down to under 2 average), but much better.

Datastore Reads

You’ll recall from the previous article I had the horrible code full of fetches with offsets. I’ve replaced it with this:

So that’s a simple iteration through the query that should behave much better; the debug line says

D2011-09-04 23:19:24.189 SetAllNextChecks processed 1403 monitors

and I’m fairly sure that this is actually accurate (ie: it’s not actually touching over 100,000 objects!)

There’s no way to tell if it’s helping at the moment, Datastore Reads are only surfaced in the billing, and the billing lags a few days behind. So, I’ll report back midweek or so with results of this change.

It’s a start

That’s it for now. Some preliminary success, but I’ve got a way to go. Stay tuned for the next update in a few days.