The always fabulous Louis Grey makes good points about using GMail in a corporate environment and got me thinking in a different direction.

I begain to consider: Why can’t I share emails the same way I can share RSS entries?

Google Reader allows you to publish all entries you tagged with specific keywords, or you can share entries on an individual basis. Yet, despite the obvious analogue, it’s impossible for me to share email messages or threads in the same manner!

I realize there are some privacy concerns, since RSS & Atom explicitly make things public and email does not. However, there’s no reading I couldn’t use an email to RSS gateway and violate expected convention easily.

I might also argue that by opening up email to the same type of social collaboration we get via Google Reader then the potential would exist to make things more secure.

For example, by adding a default copy-left style licensing, a la creative commons, or a per-email “off the record” flag like Google Talk. There could even be “free to share” delivery options rather then keeping everything on an “honor system”.

This post is for anyone interested in any of the Government Transparency inituatives. If you’ve been following this topic then you’re probably aware that Vivek Kundra sees a dashboard as a way of accelerating the transparency and transformation of the Government.

After watching groups like the Sunlight Foundation and Change Congress work their magic, I’ve now begun seeing much of this transformation from the inside due to my new job.

However, I still endeavor to participate externally as well and so I wanted to do some analysis on the public data.

In order to start, I wanted to import the USASpending.gov information into Google spreadsheets, since I can kill two buzzwords at once by leveraging Cloud Computing services for Transparency!

For a while, I fought with the easiest way to import the data and wanted to share what eventually worked for me.

First, create an new spreadsheet and name your tab appropriately. Then go to the USASpending Feeds page to select the specific data you want. I suggest starting with the Exhibit 300 information since it’s typically a smaller dataset and my Exhibit 53 tab with more than 1200 rows has proven to be very slow.

Next pick, and I highly suggest reordering, the data fields you want you must then pick which Agency or Agencies you’d like info on. Again, considering that lots of data will be pretty slow.

Finally, select the CSV icon which should open the download prompt for your browser. It’s unfortunate that the implementers didn’t use a dynamic tag here because you can’t simply copy the URL. Instead I had to first download the file itself and then copy the originating URL into my clipboard (I was using Chrome so how to do this will depend on your browser).

Now that we’ve got the URL we can import everything into our spreadsheet by selecting the A1 cell and entering:

=ImportData("<url>")

Where “<url>” is of course the long URL you copied earlier.

After a quick few seconds your data should be automajically imported!

For the Exhibit 300, things worked just great but for the Exhibit 53 data I ended up with each cell of Column A holding the full data for each entry. So in B1 I simply entered: =SPLIT(A1, “,”) (note thre’s a bug with Google where the quotes ave to be double quotes not single) and then things auto populated left to right.

Unfortunatley, the “SPLIT()” didn’t auto-populate downwards as well and dragging the function down the full B column is very very painful.

A while back, after Cloudera released their lectures and VMware image for Hadoop, I watched the training sessions and worked through some of the initial exercises.

I must say I was a little disappointed by the videos but I believe that’s because I’d seen Christophe Bisciglia’s lectures when he was still at Google.

However, the exercises are definitely something to get you thinking and are worth giving a shot. It’s sort of like ‘programming golf‘ and I thought I’d share my version of the first map function vs. the packaged solution.

By definition they should produce the same output, i.e. the mappings should be identical, and barring buggy corner cases mine certainly passed the test.

What I found interesting was my instinctual desire to let regexps do the work, whereas their version relies on a simple “split()” to sort the input. It’s likely a faster solution and given the massive amounts of data for large data passes, it’s worth benchmarking.

However, although I’m clearly biased, I must admit I found mine easier to grok and should be more flexible, e.g. perhaps the input pattern could become a parameter rather then hard-coded into the flow.

There’s certainly not a “right” way to do it, other then one that works. The advantage of the MapReduce model is that the necessary code is often really really short and easy to modify but I thought others might find it interesting to realize that perl doesn’t have an exclusive license on ‘TMTOWTDI‘

It’s been a while since I ran my CouchDB performance test, but many of the comments I received suggested that updating my codebase should yield some significant performance improvements. Unfortunately, at the time I didn’t have spare cycles to invest in building the latest branches of erlang, couchdb and everything else, so I hadn’t previously been able to rerun my tests.

However, I started a new project today and, like most developers, I took some time to sharpen my tools before I felt sufficiently prepared to proceed. Of course since one of my favorite tools is CouchDB itself I checked in to see how it had been progressing and I was thrilled to see Janl, and it looks like others have contributed, had released a new version of the excellent DBX bundle!

So after a round of updating DBX and CouchDB python library components, I decided to suffer a small distraction and give the new code a test drive.
I wanted to check my baseline, so here’s a rough time sample for the original, file based, keywords code:

Alas, it doesn’t look like most of the performance improvements have really paid off for this testcase, in fact every run I tried was slower then last version.
Here’s a sample run which is fairly indicative of the rest: