This site started as and continues to be a place where I can explore the craft of software development. Below you'll find articles on such diverse topics as Perl, C++, Ruby, Python, Git, and more. I've published quite a bit of open source software on my Projects page, including much of what you'll find in these articles.

I'm a nerd. I write software for a living. I spend a lot of my day either sitting in a chair in front of a computer, or laying on my couch using my laptop. I'm not what you'd call... athletic. I did start lifting weights about six months ago but that's really just led to gaining more weight, not losing it. A few years back I started counting calories and I lost some weight, and then stopped counting calories and gained it all back. Time to change that.

After using Calorific for a month two things have become very clear. First, I need to eat less. Holy crap do I need to eat less. I went on to SparkPeople just to get an idea of what I should be eating, and it told me between 2300 and 2680 kcal. I haven't implemented averaging yet, but a little grep/awk magic tells me I'm averaging 2793 kcal per day. This is too much. So. One thing to work on.

Let's pretend, just for a second, that you want to make some money on the stock market. Sounds easy, right? Buy low, sell high, yadda yadda blah blah blah. Except, how do you know when to buy and when to sell? Not so easy. Being a nerd, you want to teach your computer how to do this for you. But where to start? I discovered a few months ago that there are servicesoutthere that will sell you a data feed that literally blasts every single anonymous transaction that happens on any market in the US in real time. They'll also sell you access to a historical feed that provides the same tick-level information going back for several years.

Once in a while at [Rentrak][] we have programming competitions, where anyone who wants to, including sysadmins and DBAs, can submit an entry for whatever the problem is. The previous contest involved writing a poker bot which had to play two-card hold'em, while others have involved problems similar in spirit to the Netflix Prize. This time we chose to build virtual robots that shoot each other with virtual cannons and go virtual boom! We'll be using [RealTimeBattle][], which is a piece of software designed specifically to facilitate contests of this sort. It's kind of like those other robot-battle systems, except instead of requiring you to write your robot in their own arbitrary, broken, horrible language, this lets you write your bot in any language that can talk on stdin and stdout.

Creating actionable information out of raw data is sometimes pretty simple, requiring only small changes. Of the few feature requests that I've received for Calorific, most (all) of them have been for goals. Always listen to the audience, that's my motto!

In my day job I use a mix of perl and C++, along with awk, sed, and various little languages. In our C++ we use a lot of boost, especially simple things like the datetime libraries and tuple. <a href="http://www.boost.org/doc/libs/1430/libs/tuple/doc/tupleusers_guide.html">Tuple is a neat little thing, sort of like std::pair except it lets you have up to 10 elements of arbitrary type instead of just the two. One of the major things that it gives you is a correct operator<, which gives you the ability to use it as a key in std::map. Very handy. One tricky thing, though, is generically iterating over every element in the tuple. What then?

Note: this article uses the Twitter V1 API which has been shut down. The concepts still apply but you'll need to map them to the new V2 API.

As some of you might know, Twitter provides a streaming API that pumps all of the tweets for a given search to you as they happen. There are other stream variants, including a sample feed (a small percentage of all tweets), "Gardenhose", which is a stastically sound sample, and "Firehose", which is every single tweet. All of them. Not actually all that useful, since you have to have some pretty beefy hardware and a really nice connection to keep up. The filtered stream is much more interesting if you have a target in mind. Since there was such a hubbub about "Lost" a few weeks ago I figured I would gather relevant tweets and see what there was to see. In this first part I'll cover capturing tweets and doing a little basic analysis, and in the second part I'll go over some deeper analysis, including some pretty graphs!

Modern web applications are complicated beasts. They've got database processes, web serving processes, and various tiers of actual application services. The first two generally take care of themselves. PostgreSQL, MySQL, Apache, Nginx, lighttpd, they all have well-understood ways of starting and keeping themselves up and running.

But what do you do if you have a bunch of processes that you need to keep running that aren't well understood? What if they're well-understood to crash once in a while and you don't want to have to babysit them? You need a user space process manager. Zed Shaw seems to have coined this term specifically for the Mongrel2 manual, and it describes pretty accurately what you'd want: some user-space program running above init that can launch your processes and start them again if they stop. Dropping privilages would be nice. Oh, and it'd be cool if it were sysadmin-friendly. Oh, and if it could automatically detect code changes and restart that'd be nifty too.

Edit 2010-08-08: ProcLaunch now has a CPAN-compatible install process. See below for details.

I finally got the chance to work some more on proclaunch, my implementation of a user space process manager, like runit or mongrel or god. I wrote up a big overview of the currently available options [previously][12], but in summary: all of the existing options suck. They're either hard to setup, have memory leaks, have a weird configuration language, or are just plain strange. The only viable option was procer, and even that was just sort of a tech demo put together for the Mongrel2 manual.

That's why I started putting together proclaunch. I need some of the features of runit, namely automatic restart, with none of the wackyness, and I wanted it to be easy to automatically configure. I also wanted it to be standalone so I wouldn't have to install a pre-alpha version of Mongrel2 just to manage my own processes.

Browsing around on hacker news one day, I came across a link to a paper entitled "A micro-manual for Lisp - Not the whole truth" by John McCarthy, the self-styled discoverer of Lisp. One commentor stated that they have been using this paper for awhile as a code kata, implementing it several times, each in a different language, in order to better learn that language. The other day I was pretty bored and decided that maybe doing that too would be a good way to learn something and aleviate said boredom. My first implementation is in perl, mostly because I don't want to have to learn a new language and lisp at the same time. The basic start is after the jump.

I started a small product a few weeks ago called FivePad, a simple easy way to organize your apartment search. It's basically the big apartment search spreadsheet that you and me and everyone we know has made at least three times, except FivePad is way smarter.

The initial versions of FivePad did everything in the web request cycle, including sending email and pulling down web pages. The other day I was about to add my third in-cycle process when I threw up my arms in disgust. The time had come to integrate resque, a great little redis based job queueing system. Except if I ran it the way Heroku makes things easy my costs would get a little bit out of control for a project that isn't making much money yet.

Literally ten minutes after hitting the publish button on my last post I took a little tumble and broke a rather important bone in my back, and now I'm on medical leave from work for awhile.

That doesn't stop me from doing fun things, though, so this morning I cooked up a tiny webapp using [Sinatra][], [DataMapper][], and [Bootstrap][] that will help me keep track of when I take painkillers. It's called [Painkiller Jane][] after the comic book character.

Dokuen (Japanese for "solo performance") is an amalgamation of open source components that I mashed together so I could run Heroku-style services on my shiny new Mac mini while retaining the paradigm of git push deployments and environment variables for configuration. Effectively, I wanted to be able to seamlessly deploy 12 factor applications in my local environment.

Update: I've rewritten Dokuen and released it as a gem. See this article for details.

Recently I sat down and reorganized my dotfiles around the tasks that I do day-to-day. For example, I have bits of configuration related to ledger and some other bits related to Ruby development. In my previous dotfile setup, this stuff was all mixed together in the same files. I had started to use site-specific profiles (i.e. home vs work), but that led to a lot of copied config splattered all over. I wanted my dotfiles more organized and modifiable than that.

In addition to writing on this blog, I've been keeping notes for various
things on Marginalia, my web-based note taking
and journaling app. In my previous post
I talked about the why and how of Marginalia itself. In this post I'd like to talk
more about what I actually use it for day to day, in particular to keep programming journals.

Earlier this year I bought a new Mac mini for various reasons. One of the big ones was so I would
have a place to stash private git repositories that I didn't want to host on 3rd party services
like Github or Bitbucket. This post describes how I set up Gitolite and my own hook scripts,
including how I mirror my git repos on S3 using JGit.

Heroku is a Platform as a Service running on top of Amazon Web Services where you can run web applications written using various frameworks and languages. One of the most distinguishing features of Heroku is the concept of Buildpacks, which are little bits of logic that let you influence Heroku as it builds your application. Buildpacks give you almost unlimited flexibility as to what you can do with Heroku's building blocks.

Hanging out in the #heroku irc channel, I sometimes see some confusion about what buildpacks are and how they work, and this article is my attempt to explain how they work and why they're cool.

Deploying Heroku-style 12 factor applications outside of Heroku has been an issue for lots of people. I've written several different systems that scratch this particular itch, and in this post I'll be describing a version that deploys one particular app using a Heroku-style buildpack, Foreman, and launchd on Mac OS X via Capistrano.

Yesterday I suddenly developed the intense need to add search to this site. Among the problems with this is that the site is kind of a weird hybrid between static and dynamic, and it has no database backend. If posts were stored in Postgres this would be a trivial matter, but they're just markdown files on disk. After flailing around for awhile I came across a library named Whistlepig which purported to do in-memory full text indexing with a full query language.

November 5, 2013: I've removed search because nobody used it and this way the site can be 100% static.

For as long as I can remember I've been trying to find a good way to keep personal text notes. Recipes, notes, ideas, that kind of thing. Things that aren't really suited to blogging. Along the way I've used (and stuck with) PmWiki, DocuWiki, TiddlyWiki, and most recently I built my own sort-of-pseudo-wiki Marginalia.

Lately, though, it's been kind of a drag to use a web-based application just to write down some work notes. Having sort of an obsession with Markdown I decided to just start keeping notes in Markdown-formatted files in a directory. Of course, files that aren't backed up are likely to disappear at any moment, so I naturally stuck them in a git repository and pushed to my personal git server. But then, how do I deal with synching my work and home machines? I guess I'll manually merge changes...

Processing payments correctly is hard. This is one of the biggest lessons I've learned while writing my various SaaS projects. Stripe does everything they can to make it easy, with quick start guides and great documentation. One thing they really don't cover in the docs is what to do if your connection with their API fails for some reason. Processing payments inside a web request is asking for trouble, and the solution is to run them using a background job.

For various projects including Mastering Modern Payments I've found it really useful to be able to view the Markdown source rendered as HTML but I don't really care about editing it online. I put together a little gem named page_viewer which renders Markdown files like this:

A good eBook needs a PDF version, of course. Awhile back I wrote an open-source service named Docverter that can render XHTML to PDF using a library named Flying Saucer, among other things. All you have to do is pipe in the HTML and other related files and you get back a rendered, self-contained PDF file. There are a few non-trivial aspects to this, of course, because HTML is not primarily intended for printable output. The W3C has worked up a whole CSS module for page-related styles but it's not the most readable document. There's a few simple-ish things that you can do to your document to make it look nice, though.

Over the past few years I've written a number of Rails applications. It's become my default "scratch an itch" tool for when I need to build an app quickly to do a task. Even though Rails is mostly batteries-included, there are a few tools that make writing new applications so much easier. This is my list of tools that I use for pretty much every new Rails project.

Edit: The discussion on Hacker News has some great gems that you should consider using as well.

When you're writing a Rails application you usually end up with a lot of CRUD-only controllers and views just for managing models as an admin. Your user-facing views and controllers should of course have a lot of thought and care put into their design, but for admin stuff you just want to put data in the database as simply as possible. Rails of course gives you scaffolds, but that's quite a bit of duplicated code. Instead, you could use the one-two-three combination of Simple Form, Inherited Resources, and Rails' built-in template inheritance to DRY up most of the scaffolding while still preserving your ability to customize where appropriate. This lets you build your admin interface without having to resort to something heavy like Rails Admin or ActiveAdmin while also not having to build from scratch every time.

A little over a year ago I announced the launch of my latest (at the time) product named Marginalia. The idea was to be a sort of online journal. A cheaper, more programmer friendly alternative to Evernote. It never took off, despite my best intentions, and so a few months ago I told the only active user that I was going to shut it down, and today I finally took that sad action. This post is a short history of the project and a few lessons learned.

My wife and I recently moved from Portland, OR to Ann Arbor, MI. Among the cacophony of change that is involved with a move like that, we of course changed to the local utility company. Browsing around in their billing application one day I came across a page that showed a daily graph of our energy usage, supposedly valid through yesterday for both gas and electric. And it has a button that spits out a CSV file of the date, which means if I actually wanted to I could build my own tool to analyze our usage.

Trading markets of all kinds are in the news pretty much continuously. The flavor of the week is of course the Bitcoin markets but equity and bond markets are always in the background. Just today there is an article on Hacker News about why you shouldn't invest in the stock market. I've participated in markets in one way or another for about a decade now but I haven't really understood how they work at a base level. Yesterday I built a tiny market simulator to fix that.

A few days ago I built a new tool I'm calling Sites. It builds on top of git-backed wikis powered by GitHub's Gollum system and lets me build and deploy microsites in the amount of time it takes me to create a CNAME.

Something that I've wanted for a very long time is a way to stand up new websites with little more than a CNAME and a few clicks. I've gone through a few rounds of trying to make that happen but nothing ever stuck. Furthest progressed was a Rails app exclusively hosting Comfortable Mexican Sofa, a simple CMS engine. I never ended up putting any sites on it, though.

GitHub's Pages are of course one of the best answers, but I'm sticking to my self-hosting, built-at-home guns.

Yesterday a friend of mine linked me to a fictional web serial that he was reading and enjoying, but could be enjoying more if it was available as a Kindle book. The author, as of yet, hasn't made one available and has asked that fan-made versions not be linked publicly.
That said, it's a very long story and would be much easier to read using a dedicated reading app, so I built my own Kindle version to enjoy. This post is the story of how I built it.

One of the very first things I did when I started working on the idea that eventually became Mastering Modern Payments was set up a Mailchimp mailing list. People would land on the teaser page and add themselves to the list so that when the book came out they would get a little note. After the book launch (with 30% of that initial list eventually buying) I started putting actual purchasers on the list.

I've had a GitHub account since 2008. June 16th, to be exact. For almost six years I've been hosting my code on someone else's servers. It was sure convenient, and free, and I don't regret it one bit, but the time has come to move that vital service in-house.

I've run my own private git server on the Mac mini in my living room since 2012. For the last few years, then, my GitHub account has become more of a public portfolio and mirror of a selection of my private repos. As of today, my GitHub account is deprecated. If you want to see what I'm working on now you can go to my Projects page. I'll be gradually moving old projects over to this page, and new projects will show up there first.

Sometimes you have a VirtualBox VM that's critical to your workflow. For example, the Mac mini in my basement hosts a VM that does things like host all of my private Git repos and provide a staging environment for all of my wacky ideas.

When I have to reboot that Mac mini for any reason, inevitably I find myself trying to push changes to some git repo and forgetting that I have to start up the VM again by hand. And then there's the yelling and the drinking and it's no good for anyone.

When I started Okapi LLC, my little consultancy and publishing house, I had to fax in some forms to the State of Michigan. The entire system for opening businesses in Michigan, in fact, is basically a fax driven API. Being a modern, hip millenial I don't subscribe to a land line phone, nor do I own a fax machine. How was I supposed to fax things?

Enter Phaxio. They have a whole bunch of fax machines (actually they're probably banks of modems) in a data center somewhere and they let you use them with a simple HTTP API. All you have to do is go sign up and make an initial deposit. They'll provide you with an API key and secret pair that you can then use to send faxes using curl.

A few weeks ago Stripe's SSL certificate became invalid, along with several other major sites. Their certificate didn't expire, their certificate authority's root certificate did. This shouldn't happen, but as with most terrible things it crops up at rather inconvenient times.

There's not much you can do to protect yourself against a service provider's certificate expiring, but you can proactively protect yourself against your own certificate expiring. The biggest thing to do is have a schedule and a process.

Your SaaS application should support SAML (Security Assertion Markup Language) if you're at all interested in big fat contracts from large enterprise customers. And why is that?

One word: money. Large enterprise customers pay quite a lot of money for services that help them do their work with a minimum of fuss. They want to do as little management of your service as they can possibly get away with, preferrably zero. If you can't make that happen, but your competitor can, guess who's not getting that big fat contract.

SAML is the technology that makes that happen. SAML came out in 2003, long before OpenID and OAuth and JWT and all those other, more modern, hipper authentication protocols. SAML is a stogy old goat based on XML and x509 certificates, which you may be familiar with because that's what SSL uses as well. It's supported by everyone that matters in the enterprise space.

When you set up SAML for your customer you're offloading all of the user management to their centralized system. They crypotgraphically vouch for users that they send your way which means all you have to do is find or create a user account for them and sign them in. No passwords, no email verification, no nothing. It's great for your customer because they get to manage everything on their end. It's great for you because you don't have to deal with any support requests related to passwords or usernames.

Hosting my own CDN has long been a completely irrational goal of mine.
Wouldn't it be neat, I'd think, if I could tweak every knob instead of relying on CloudFront to do the right thing?
Recently I read this article by Janos Pasztor about how he built a tiny CDN for his website. This just proves to me that at least it's not an uncommon irrational thought.

Yesterday I decided to actually start building something.
Even if it doesn't make it into production, I'll at least have learned something.

Technical Goals

Centrally manage all of the dozen or so sites that I run

Automatically generate and renew LetsEncrypt certificates, both for publicly-facing sites and my own private sites. This means using the dns-01 challenge instead of using the easier to understand http challenge.

Easily add new cache nodes with authenticated curl | sudo bash

Automatically reconfigure nginx on the cache nodes when certificates roll or sites change

Stop paying so much for bandwidth. Transfer is $5/tb/mo from DigitalOcean vs $$$$ for CloudFront.

Additionally, I really want to learn how LetsEncrypt works.
certbot is great but it is very much a black box to me.
Command-line arguments in, certificates out.
If I write my own management system I can actually learn how the guts work.

Current Status

basic Rails app that knows about sites and proxies

creating or updating a site (re)generates a LetsEncrypt certificate for all of the domains that point at that site

wildcard domains are fully supported

authenticated endpoint that generates a zip file of all of the certificates and private keys

In my earlier post I talked about how I'm building my own content delivery network (CDN) but I didn't really go into what a content delivery network even is or why someone would want such a thing. A little back story is probably in order.

What is a content delivery network?

A CDN is a set of computers distributed around the globe that all point back at the server where your website is actually hosted. The CDN computers (or nodes) run a piece of software called a proxy, which just grabs the content from your server and gives it to someone's web browser as if it was their own. Usually, but not always, the proxy caches the content (saves it to locally to it's own disk) so that the next web browser to come along doesn't have to wait for the origin server to respond, it just gets the saved content.

That's pretty much all there is to it. There's a standard way of telling a caching proxy (and a web browser, for that matter) exactly how long you want something cached and under what terms. Some commercial CDN offerings also let you write some code that executes at the cache itself, so you can do fancy things like modify requests on the fly before they hit your server.

That's great, but do I need one?

It depends! A CDN buys you nothing if your audience is all in one spot and that spot is close to your server. On the other hand, if your website has a global (or even semi-regional) audience it will benefit by having caches scattered around the globe that can respond to requests. Ultimately response time is limited by the speed of light. A web browser on one side of the planet will have to wait 130ms under the most ideal conditions to hear back from a server on the other side, just due to how long it takes light to travel there and back. That doesn't include any kind of processing or disk access or buffering or anything. If you have a cache nearby the viewer's experience will be that much better because your site will seem that much more responsive.

Sites that serve the same files to lots of people are also prime candidates for CDNs. Take Netflix for example. Netflix, at one time, accounted over 20% of worldwide bandwidth usage. Their content consists of millions of tiny video files (each episode or movie is broken down into 15-30 second segments which are all encoded at a bunch of different bitrates to suit different speed connections). Netflix uses a huge fleet of servers that are all as close as possible to you, the customer, both to reduce the cost of transfering all those bits around and to make it faster for you to get to watching the latest episode of The Crown. Typically when you're accessing Netflix you're actually talking to a Netflix-owned server in your ISP's wiring closet, at least for the video content itself.

This site and the other sites I run don't get nearly that much traffic. I use a CDN because they take care of boring, error prone stuff for me. Right now most of my apps and sites live behind Amazon's CloudFront service because they automatically generate a free SSL certificate, for example. They renew it when it expires and generally let me ignore SSL completely.

The CDN I'm building for myself is also going to take care of SSL via LetsEncrypt, which is 100% free. It's also going to automatically manage the IPs of each site (which AWS does not do) so I don't have to care about DNS. It'll just automagically happen.

In my last post I talked about what a CDN is and why you might want one.
To recap, my goal is automatic, magical DNS/SSL/caching management.
Today we're going to talk about one aspect of this project: HTTPS and SSL.

SSL, or Secure Sockets Layer, is the mechanism web browsers use to secure and encrypt the connection between your computer and the server that is serving up the content you're looking for.

A few years ago browser vendors started getting very serious about wanting every website to be encrypted.
At the time, SSL was expensive to implement because you needed to buy or pay to renew certificates at least once a year.

Almost simultaneously with this increased need for encryption, organizations including the Electronic Frontier Foundation and the Mozilla Foundation started a new certificate authority (organization that issues certificates) named Let's Encrypt.
Let's Encrypt is different because it issues certificates for free with an API.

Most people use a tool named certbot that automates the process of acquiring certificates for a given website.
However, certbot doesn't really work for my purposes.
I want to centrally manage my certificates and copy them out to my CDN nodes on a regular basis, which means I need to use the DNS challenge type.
certbot's support for the DNS challenge isn't really adequate for my needs.

Challenge Types

Let's Encrypt uses challenges to verify that you own the domain that you're trying to acquire a certificate for.
Currently there are two different challenge types, http-01 and dns-01.

For http-01, you simply create a file within a well-known directory structure within your website containing a challenge string that the API gives you.
Then you tell Let`s Encrypt to go look for it.
If the file is there and contains the correct challenge string, Let's Encrypt will give you a certificate.

dns-01 works much the same way, except instead of creating a file you create a TXT record for your domain.
Let's Encrypt will ask your domain's DNS servers for the value of the TXT record, and if it matches what it expects, you get a certificate.

http-01 has the advantage of being really simple and easy to use with the certbot tool and whatever web server you happen to have.
However, with multiple servers in the mix it can get tricky to make sure that every server has a certificate without hitting Let's Encrypt's rate limits.

That's why I'm using dns-01.
I can easily drive the API from the central management node and copy the certificates out to all of the CDN nodes simultaneously.

How ACME Works

I use a gem called acme-client to drive Let's Encrypt ACMEv2 API.
Once you know ACME's terminology it's easy to use.

An order is the initial request to generate a certificate for one or more domain names

An authorization is LetsEncrypt's response to the order. It contains one or more challenges for each domain name in the order.

After setting up the challenges with either http-01 or dns-01, you then request_validation. LetsEncrypt tries to verify that you were able to successfully install the challenges.

Finally, after LetsEncrypt has seen the validations in the wild, you send a Certificate Request (csr). LetsEncrypt responds with a properly signed certificate, valid for all of the domain names that you verified and sent with your csr.

Getting a Certificate, End to End

Step 1: Sign up for an account

The first thing we need to do is sign up for a LetsEncrypt account.
Accounts are identified with a private key and an email address.

Step 4: Wait for DNS to populate

Route53 takes some time to push your changes out so now we have to wait.
We also have to wait for all of the DNS servers that service the zone to return with the correct value because LetsEncrypt will pick one randomly to ask for the challenge.

Let's write a loop to wait for us.
First we need to get the list of nameservers for the zone:

This again uses Ruby's built-in Resolv library to get a list of values.
In this case we're asking for all of the TXT values that we set up with the Route53 upsert earlier.

We loop over each nameserver and ask if the value is what we're looking for.
If it isn't we bail out early because we need all of the nameservers to have the correct value.

Step 5: Request Validation

Finally, after verifying that DNS has the correct values set, we tell LetsEncrypt to validate our challenges.
If we had just asked for verification immediately after the upsert LetsEncrypt would have failed the order and there's no way to restart it or ask for them to check again.
You get one validation per order and if you fail you have to start all over.

The acme-client library comes with a handy Acme::Client::CertificateRequest wrapper that takes care of building a CSR exactly how LetsEncrypt wants to see them, so all we have to fill in is the list of domain names we want the certificate to apply to.
After a short wait LetsEncrypt will return the bright shiny new certificate in order.certificate.

Wildcard Wrinkle

The above is great if you want to list out every domain name that you want the certificate to apply to.
LetsEncrypt recently added support for wildcard certificates, though, which are very useful but have one additional wrinkle.

Wildcard certificates apply to all of the subdomains at a single level for a given pattern.
Let's say you want your certificate to apply to these domain names:

Instead of listing all of these domains in the certificate request you can ask for a wildcard, like this:

example.com
*.example.com

The wildcard will apply to any subdomain that matches a star.
mx.example.com will match but blah.mx.example.com will not.

The wrinkle here is that LetsEncrypt will give you two challenges for the same domain name because it wants you to verify both the root and the wildcard.
You can't set easily set multiple TXT records for a given label in Route53, though, so you have to collapse them into one upsert: