Archive for February, 2011

Bitcoin is the new hot thing (i.e. I just heard about it on Hacker News). It’s a peer-to-peer virtual currency with a couple of wrinkles – the total number of bitcoins is strictly restricted, making it popular with those who worry about central banks debasing currencies, and you get about 50 bitcoins a year just for installing some software on your computer, making it popular with those who enjoy getting something for nothing.

Madison Hours is a 15-year old alternative currency used in Madison, WI, based on the principle that everyone’s labor should be worth exactly the same amount: $10/ hour.

At first glance, there’s not much in common between the global, decentralized, computerized Bitcoins and the local, cooperative-managed, paper Hours, but they actually have a lot of similarities.

First, that something for nothing thing. It’s long been known that you can jump-start an economy by giving everyone a bit of cash, and both systems do this – Bitcoin gradually over time, and Hours when you sign up. It’s the same principle as the $300 Bush tax cuts, though they’d both hate the comparison.

Second, both systems go against something that almost all economists believe. For Hours, it’s that prices should be determined by supply and demand. For Bitcoin, it’s that a central bank should regulate the money supply to help stabilize the economy.

And third, and most important, they’re both mostly useless. Madison Hours can be used to buy “squares of seed-embedded paper” and can be used in part-payment for “dolls handmade from old clothes”, while Bitcoin offers “web marketing and web 2.0 solutions” and a “karaoke conversion tool”. Not too surprising – if the financial system worked as badly for people as its detractors said it did, it would have collapsed long ago. (Yes, I know it almost did – nobody likes a clever dick).

Perhaps the best way to understand this is to look at Samuel Butler’s novel Erewhon, describing a fictional country which has regular banks and “Musical Banks”, which offer a parallel currency, officially esteemed but practically worthless.

Of course every one knew that their commercial value was nil, but all those who wished to be considered respectable thought it incumbent upon them to retain a few coins in their possession, and to let them be seen from time to time in their hands and purses. … I never could understand [..] why a single currency should not suffice them; it would seem to me as though their dealings would have been thus greatly simplified; but I was met with a look of horror if ever I dared to hint at it. Even those who to my certain knowledge kept only just enough money at the Musical Banks to swear by, would call the other banks (where their securities really lay) cold, deadening, paralyzing, and the like.

Butler was satirizing how religion was officially honored, but ignored in practice, but he would be amused that his satire had actually been implemented in Bitcoin, and that people were taking it seriously.

Will Bitcoin be successful? Well, it can certainly survive as a quasi-barter system for a few geeks and those who love them. But buying a car, renting an apartment, getting anything significant from a major company? Let’s just say that 3 Bitcoins and $4 will get you a Starbucks latte.

If you’ve ever worked with business users (people with an MBA, or who might as well have one), you’ll know that they want just two things: pivot tables, and graphs. And the graphs always have to have two different scales for the y-axis, for some reason. You’ll want a third thing – to read in and parse data. So why is it so hard to do all of these in any one statistical programming language?

I’ve picked on R for a while (and it actually gets a passing grade with the plyr and doBy packages), so let’s look at Python and pivot tables. This page shows how to create summary tables in Python – the key lines to define a summary function are:

from itertools import groupby
from operator import itemgetter
def summary(data, key=itemgetter(0), value=itemgetter(1)):
"""Summarise the supplied data.
Produce a summary of the data, grouped by the given
key (default: the first item), and giving totals of
the given value (default: the second item)
The key and value arguments should be functions which,
given a data record, return the relevant value.
"""
for k, group in groupby(data, key):
yield (k, sum(value(row) for row in group))

What a mess. [*] Obscure auxiliary functions? Check. List comprehension? Check. Iterator object? Check. Overcomplicated, overabstracted, and hard to understand? Check, check and check. (And see comments to the original link for an extension to a .csv file, which needs lambda functions) This recipe extends it to multiple keys, with an example of use that looks like this:

Why so convoluted? It’s not as if this is a new problem – SAS basically solved it in the 70s – compare the use of PROC MEANS (example adapted from “The Little SAS book”). Even if you don’t know SAS, you can make a pretty good guess at what it’s doing.

(OK, I’m partly comparing the apple of a function definition with the orange of a function use – so what? The point is that the base SAS function is already defined, so you don’t have to worry about it, and that 30 years of “progress” in language development has given us something that is much less flexible and easy to use. Don’t get too smug though, SAS – you’re expensive, and your graphing is awful.)

Well, let’s take a step back and think about the sync problem and what the ideal solution for it would do:

•There would be a folder.
•You’d put your stuff in it.
•It would sync.

They built that.

Why didn’t anyone else build that? I have no idea.

“But,” you may ask, “so much more you could do! What about task management, calendaring, customized dashboards, virtual white boarding. More than just folders and files!”

No, shut up. People don’t use that crap. They just want a folder. A folder that syncs.

“But,” you may say, “this is valuable data…certainly users will feel more comfortable tying their data to Windows Live, Apple Mobile Me, or a name they already know.”

No, shut up. Not a single person on Earth wakes up in the morning worried about deriving more value from their Windows Live login. People already trust folders. And Dropbox looks just like a folder. One that syncs.

“But,” you may say, “folders are so 1995. why not leverage the full power of the web? With HTML 5 you can drag and drop files, you can build intergalactic dashboards of stats showing how much storage you are using, you can publish your files as RSS feeds and tweets, and you can add your company logo!”

No, shut up. Most of the world doesn’t sit in front of their browser all day. If they do, it is IE 6 at work that they are not allowed to upgrade. Browsers suck for these kinds of things. Their stuff is already in folders. They just want a folder. That syncs.

That is what it does.

Memo to designers of statistical programming languages: You may say “What about tuples, lambda functions, generator objects?” No, shut up. People don’t use that crap. But they do want pivot tables, graphs, and to be able to read in and parse data.

[*] I should be clear that my complaint is with Python rather than the code as such.

Groupon’s Tibet ad was a stupid mistake, but not for the reasons you might think.

Let’s start with the obvious. Nobody really cares about Tibet (who can name two cities in it?), but Tibet is one of the stuff white people like, so everyone has to pretend to care about it. Groupon is more stuff white people like – its quirky humor and global-local emphasis converted your grandmother’s coupon-clipping into a multi-billion dollar business. Now, Groupon’s ad no more mocks the people of Lhasa and Shigatse than “Best In Show” makes fun of little doggies – the real target is clueless hipsters. Unfortunately, this is a key part of Groupon’s user base, and they don’t like to be made fun of, so they took offence … on behalf of the Tibetan people, of course.

So, a stupid mistake. Or was it? Groupon is trying to move into the mass market – why not do that by creating a controversy? Attack your base, garner attention from the mainstream media, and ride your Sister Souljah moment all the way to Facebook-style valuations. Of course, not all Super Bowl ad campaigns lead to commercial success – if you think they do, there’s a Groupon for 50%-off sock puppets you might want to buy.

R is great for prototyping models. Not so great if those models have to run a business. Here’s some tips to help with that:

Validate, alert, and monitor

Sink

Use 64-bit Linux

Write your own functions

tryCatch

Validate, alert, and monitor: Sooner or later something is going to go wrong with your model. Maybe some parameter will get the wrong sign and it will recommend selling iPads for a nickel. You can guard against this by constrained optimization, but really you need to have an automated check on any results before they go into production. If model results change a lot between runs, you should be automatically notified. And even if the model is running fine, you should produce summaries of its performance, and how it’s changing over time. To email yourself the string message_text with subject my_subject in Unix, do:

Use 64-bit Linux: R is bad at memory management. You can try to use smaller structures, garbage collection (gc), rm structures where possible, but the best solution is to run it on 64-bit Linux with lots of memory. Anything else is gambling.

… glm(), and its workhorse, glm.fit(), are a mess: They’re about 10 lines of functioning code, plus about 20 lines of necessary front-end, plus a couple hundred lines of naming, exception-handling, repetitions of chunks of code, pseudo-structured-programming-through-naming-of-variables, and general buck-passing. I still don’t know if my modifications [to produce bayesglm] are quite right–I did what was needed to the meat of the function but no way can I keep track of all the if-else possibilities.

Do you really want that code in a production system? Copy it and call it my_glm or my_bayesglm. That way it’s under your control, and will be easier to debug and fix.

tryCatch: Well, at least if you do run into an error, you can send yourself a nice email saying what went wrong and where – a little more elegant than just relying on your log file.

So should you use R in a production system? Well, it’s free, and quick to develop in, so go ahead, but definitely keep your eyes open.