2014-01-30

I'd heard of the previous rounds of Stripe's Capture The Flag coding competition but hadn't actually looked into them in detail, but a friend of mine was giving this round a go and so I decided to play along. I had no vision of winning, so to keep the effort time-boxed I went for the minimalist solution that I could envision to each problem. I only made it to level 3 then stalled out, mostly because I'm unfamiliar with Scala and thus implementing my envisioned solution would have required a significant amount of time. I also wasted a significant chunk of time on level 1 (probably more than equal to the time spent coding all the other levels) trying to pull off what I assumed was an invited hack before trying a straightforward solution. Overall I found it a lot of fun and would encourage anyone interested in software to give it a go next time.

In this level we're given a short ruby script that "highlights" (by adding angle brackets) all of the words in an input text which aren't found in a dictionary. The script works but is slow, as it simply reads the dictionary into an array and does a naive search of that array for each word in the input. My solution was trivial, simply read the dictionary into a hash, and check for a key. Could have just been two changed lines if I bothered to try to remember a terser array to hash technique.

On this level I wasted a bunch of time trying a hack rather than coding
a solution. The task is to submit a git commit with a hash
"lexicographically less than the value contained in the repository's difficulty.txt file".
The fact that difficulty.txt was a local file rather than some constant or stored on a server
seemed to me to be asking for a hack like submitting a difficulty.txt file with a value like "ffffff" in it.
Could not make that work, but would be interested to hear from anyone who did.
After giving up on the hack it took about 5 minutes on Google and Stack Overflow to
find this utility:

beautify_git_hash: Beautify the Git commit hash! This is a little useless toy inspired by BitCoin's "proof of work" concept. It enables you to modify your Git commit to enforce a certain prefix on the Git commit hash.

I cloned this into my project and tested it by hand to see that it worked. Since this beautifies un-pushed commits via an amend, I changed prepare_index() in miner to do a commit instead of an add, then simply replaced default implementations solve routine with a call to beautify_git_hash.
High level loop didn't change at all, so here's the guts of my code:

I fired up the miner and in a couple minutes I'd mined a gitcoin, which gave me a result of 50/50. Didn't
have to improve performance of the beautify_git_hash script at all so credit for this win
goes to the original author, Volker Grabcsch.

Hey, finally a node level, how exciting! Well, not really, this turned out to be just a rehash of the
level 0. In this level we have to create a proxy that can blacklist attackers based on ip. A full
implementation is provided with one key function missing: currently_blacklisted(ip).
The obvious implementation proved good enough, with a couple quick tests to figure out the right constant to use to detect a "bad" number of requests.

This got me a result of 95/85. I actually spent a little more time on this later and was able to get
higher scores during local testing, but never got a positive score on submitting them, so I suspect something was borked in the test environment.

2014-01-22

My work in the past few years has involved producing quite a few web services. Some are public facing but many are middle-tier type things accessed by other services or REST back-ends for client apps. Almost all have leveraged some sort of framework such as rails or django. Each such service must be bound to a port to be accessed, and since binding to the standard http port (80) requires root privileges most frameworks have some higher default port they run on during development. For example rails defaults to 3000, django to 8000, dropwizard uses 8080, etc. This works well enough when throwing together a single service, but what if you need a bunch of these running at once? Each framework lets you configure alternate ports, but what values should you actually use?

The answer to that question is really a function of how the services will be used and maintained. If you're part of a team developing multiple services that need to work together you will have to agree on some sort of standard, or pick arbitrary numbers and simply record them in a central place. In practice that "standard" may amount to little more than "we've been using rails, so our first service is 3000, our next is 3001, etc.". For a variety of reasons, including preserving your sanity, I suggest a somewhat more formal approach--a canonical port numbering scheme based on the name of your service.

A service's name might not be particularly well defined in all cases, but usually there is some simple short name that a team refers to a service by, and that's the one you should start with. The name of the project, app, or source code repository are all reasonable candidates. There are a variety of ways to turn that name into a number (e.g. a hash like CRC-16), but I've come to prefer simply interpreting the first few characters of the service name as a base-32 encoded number (base-32 is like hex but goes from 0 to v instead of just 0 to f). This approach typically ensure unique ports as long as you pick service names that don't start with the same three letters. It also allows you to guess the service from the port. Here's an example--assume you have a service named api. Interpreted as a base-32 number that's 11058. With a couple tweaks to account for the valid port range, restricted ports, and characters that aren't valid base-32 digits you have a complete solution. Here's an example algorithm in javascript:

Here are a few examples of service names and the ports they yield, and what you get from converting the port number back to a string

While just having a sane and consistent way of picking port numbers may be it's own reward, this approach is actually an important component of automated service deployment tools I've been working with for the past year. I hope to explore these further in an upcoming post, but for example consider that well chosen service names be part of a domain name. You can leverage this into a "config-less" approach to inter-service communication, e.g. if the api service running on api.cluster-1.example.com it could assume the database service it should talk to lives at db.cluster-1.example.com, and will be found at port 14688. Initially the two services could be deployed on one box pointed two by both domain names. If later the load increases and the db service is moved to another box there's no need to update the config for the api service.

2013-03-13

AWS is organized into several regions, each totally separate from the others

we want to be able to deploy software in any region

in each region we will create EC2 instance to create other instances and deploy software from

by default instances have an unfriendly public domain name e.g. ec2-54-234-123-12.us-west-2.compute.amazonaws.com

we want that instance to have an easy to remember domain name, e.g. deploy.region.example.com

we want that domain name to be assigned programatically from the instance itself

Understanding this journey will probably be clearest if we start with a peek at the end. Domain
names can be assigned via Route53, which has an API. We're writing all our stuff for node.js so
here's the Javascript code for the final request we want to make to Route53.

Most of this is boiler plate, except for the region
and credentials config values, and the HostedZoneId,
Name (our desired domain name), and Value (the automatically
assigned domain name) as parameters to Route53.
To make this work, all we have to do is find a way for the instance to know each
of these things.

Item 1: the new domain name

This is going to be composed of 3 pieces: the fixed string "deploy",
the name of the region the instance is running in (will be "us-west-2"
in these examples), and our organization's domain, "example.com".
There are actually a lot of ways for EC2 instances to know their
region (some hackier than others) so I'll focus on the other two fields for
now. My first thought was "hey, can't you assign tags to EC2 instances
when you launch them? I'll just add a tag that says what domain to use."
This was easy to try out, and my tags show up nicely in the EC2 console,
but how is code running on the instance going to access the tags?

Digression 1: EC2 instance meta-data

My naive hope that tags were somehow magically present on the instance as environment
variables or in a config file was quickly dashed. However I knew that instances had access to
some meta-data about themselves via
making HTTP
requests to a magic IP address (169.254.169.254). "Oh cool," I thought, "tags will be in there somewhere."

Digression 1.1: meta-data paths

The data available through this API is
organized hierarchically, that is if you fetch a URL like /latest/dynamic/ it
returns a list of "sub-folders" like instance-identity/, and you can
then ask for /latest/dynamic/instance-identity/ and so on.
The interesting bits are actually buried pretty deep, for example region
is inside a JSON document found at
/latest/dynamic/instance-identity/document (you can find
the availability-zone in there too, but that's also available as a plain
string at /latest/meta-data/placement/availability-zone).
There are command line tools
available that allow access to more commonly used values,
and a variety of existingmodules for node as well.
None of them seemed to include tags, and I wasn't sure where'd those might be,
so I thought "I'll write a module that starts at the root and spiders it's way down!".
Exactly how much code-that-sucks was involved in that effort will require another blog post,
but the result was a new module: ec2-instance-data.
Read the readme for details, but the basic idea is you get an object that will have all
of the instance's meta-data as nested objects, so you could then get the region via something like

Ok, where were we? Oh, right, trying to get the value of an instances tags from
the meta-data. You can probably guess the next part: they're not in there. Well
how do you get them? Googling led me to this Stack Overflow
question
and from the answers I discovered there's an API for that:
DescribeTags.
Amazon recently released an "official" node module aws-sdk, which includes this API, so I thought,
hey cool I can use that. Except, as pointed out in some comments on Stack Overflow, you need credentials
to call that API.

Digression 2: IAM Roles for EC2 Instances

Hey, didn't I read something a while ago about automagically giving EC2 instances
permissions to call AWS APIs? Why yes I did,
it's called IAM Roles for EC2 Instances.
In short an IAM Role is a named group of permissions (a "policy" in AWS-speak), and you can specify one when
you create an instance. So I did the obvious, created a role with permissions to call DescribeTags (and the Route53 APIs), named
it "my-role" and assigned it to my instance. (Well I would have, except you can only assign roles when
you launch an instance, so I actually terminated my instance and started over with a new one.) Ok, goody,
now I can automagically call the DescribeTags API right? Um no, at least not with the AWS SDK for Node.
The AWS SDK for Ruby gets some IAM love however:

If the client constructor does not find credentials in AWS.config, or in the environment, it retrieves temporary
credentials that have the same permissions as those associated with the IAM role. The credentials are retrieved from the Instance Meta Data Service (IMDS).

Which produces a whole ton of output because it gets all the tags for every instance you have in the region,
not just the current instance's tags. To get just that you have to add a filter to describeTags() call. Which
leads me to a classic moment of "how-many-names-for-the-same-thing-can-you-work-into-a-single-function-call":

Ok, making progress. Back to our domain name, we can now get
the "deploy" part from the "purpose" tag, the domain from the "HostedZone"
tag, and we've already pulled the region from the meta-data. Putting them all together yields

Item 2: AWS Credentials

Actually already had to do this in order to call describeTags, but glossed over it a bit above.
The magic is in metadata.iamSecurityCredentials(), which is defined in ec2-instance-data.
The credentials can be found at /latest/meta-data/iam/security-credentials/my-role/.
Except it turns out be not as simple as one would hope because the meta-data service actually sticks
the name of the role into the path but the instance doesn't know it's role. This means
we have to get it in two steps, first get all the roles, then using the name of the first one
(currently there is only one), then get the credentials for that role. Here's that method,
which also clocks in with 3 superfluous renamings.

Item 3: HostedZoneId

While we now have the HostedZone (== domain) name, what we need for
our call to Route53 is the id. Fortunately Route53 let's us look that up
with AWS.Route53.client.listHostedZones(...), but for some inscrutable
reason it doesn't accept a filter, so we're left looping over the result.

Items 4 and 5: current domain name and region

The current domain name is readily available from the meta-data (/latest/meta-data/public-hostname),
and we've already pulled the region from meta-data, so we're now good to go. Our final call looks like this

2012-07-02

I've recently been working in Python again (after a decade long hiatus), including some rework of middle-tier REST(-ish) services built on webpy. This is a lean-and-clean web application framework that lends itself well to writing simple web services without all the cruft a "full-service" framework like Django brings with it. There was one particular annoyance for me in the service I was working on, which was that URL-to-controller routing was maintained separately from the controller classes themselves. Here's a stripped down example:

This structure meant if you added a new controller class you'd have to go add its URL somewhere else, and also made it (too) easy to have inconsistent URLs or ones that had no obvious relationship to the controller (which is confusing for someone like me to come in and do maintenance). Seemed to me there had to be a better way, and in fact web.py has a built in approach called auto_application(). Unfortunately it's got a couple (for me annoying) limitations, for one it uses the exact class name as a default path (for these controllers yielding paths like /RecordSearchController and changing the names of our controllers was a non-starter). It also doesn't seem to support RESTful parameters (the bits in parens in our urls above). There is also a slightly more flexible recipe for using metaclasses to do automatic URL registration. This also didn't quite meet my needs (e.g. it doesn't even try to automatically map controller names to urls) but it did set me in the generally correct direction. Here's an example of how things would look using this recipe:

This gets the URL into the controller's code so solves some of my problem, but definitely still didn't feel as clean (or DRY) as I'd like. For one thing I'd now have to add the __metaclass__ = ActionMetaClass boilerplate to every controller (I couldn't add it to the BaseController for complicated reasons). It also bugged me that this required more than one line--I really wanted to be able to write something like:

My first thought was to just add a URL param to ActionMetaClass.__init__ Unfortunately I quickly discovered you can't (directly) pass parameters to the value of __metaclass__ in this way, since its calling signature is assumed to match the default type(name, bases, dict)). I also ran into some pain around the fact that BaseController already was using a __metaclass__ to do some cool tricks like adding some standard envelope fields to every response. At this point I set out some formal goals for my implementation:

not require changing any of the existing routes or class names

automatically generate canonical paths from class names which
matched as many of the existing paths as possible

require at most one additional line of code per controller class,
even those that don't use a canonical path

automatically add REST parameters to the URLs

play nicely with other metaclasses in the inheritance chain

I wound up meeting most of the goals, with a couple pragmatic tweaks. Before I break down exactly how the solution I came up with works here's the actual code (stripped of comments/doc strings for brevity, I'll try to get the full code up as a gist at some point soon).

reverse the order (classes mostly followed a VerbNoun (or AdjectiveNoun) pattern, while the URLs were mostly /noun/verb (→ ["Record", "Annotate"]

lowercase and join with slashes (→ "/record/annotate")

This algorithm is implemented in the default_path(). function.

parameter detection/injection

Although there were dozens of controllers, there actually were less than a handful of distinct REST parameters that appeared in the URLs, and fortunately they had very consistent names across the controllers GET. and POST methods. Unfortunately, prior to this refactoring, the patterns were repeated many times (typically two parameters per URL) and multiple parameters shared patterns, so e.g. changing the record_code from numeric ([0-9]+) to alphanumeric required dozens of replace operations each of which had to be vetted by hand by cross-referencing with the controller's source. These I pulled out into a single short dictionary of parameter names, url_params. I then use a little bit of reflection magic to figure out which params go in which URLs by looking at the argspec of the controllers GET method (or POST if there's not one). I can then use the name the args to look up the correct patterns to tack onto the path generated from the class name. This logic can be found in the center of _new. I also broke with my strict one-liner rule for a special url level override, this is done via setting a url attribute in the controller rather than passing it to autoroute to simplify the signature, and it turns out that having the URL around in the class is useful for testing/debugging as well.

metaprogramming magic

So remember the pain I had with not being able to pass a custom path to a metaclass's __init__? For a moment I thought of generating classes dynamically to get around this, but by RTFM I discovered that the value assigned to __metaclass__ doesn't have to be a class, it can be any callable, including a function--importantly it can even be a partially evaluated function! This insight is embodied in _new(), whose signature is like type.__new__'s but with one extra parameter (the non-standard path, if any). That parameter is then partially evaluated in autoroute's return statement, the result being a functools.partial that has the correct signature.

Note that in the end there's nothing super webpy specific about this solution other than the assumption that controllers are classes and they have methods named GET/POST/etc. Most of it should be adaptable to any another framework.

2012-06-19

I often generate a fair number of views while developing reporting SQL, and have off an on looked for a way to easily save their source code to a file. mysqldump doesn't directly support extracting only views, but with a little command line trickery and a query against INFORMATION_SCHEMA you can make it do the right thing:

The skip-column-names and batch options produce output with just one view name per line, which is what xargs needs as input.
Be sure to replace both occurences of username and database with appropriate values, and add -h for remote hosts and -p if the user requires a password. Here's a one-line example for user root with no password on localhost, with a database named "foo":