HTTP cookies, or how not to design protocols - Browser protocols feel a lot more like Windows than Unix in their design and evolution. The lack of clear principles means we'll face the same endless-but-just-about-manageable cascade of bugs that afflicted Microsoft's OS.

Nilometer - Predictive analytics from a hole in the ground. The level of the Nile was such a strong sign of the strength of the harvest months later that Cairo's biggest festival was cancelled and replaced with prayers and fasting if it didn't measure up. The 1,400 years of time series data from this instrument spawned some fascinating research in modern times too.

Spatial isn't special - There's a life-cycle to every technology niche. As demand first emerges, the few developers who can serve it can make a handsome living, but gradually knowledge and tools diffuse to a wider world and the specialty becomes a skill that can be acquired rather than an expert you need to hire. This is a very good thing for the wider world, what were hard and expensive problems become cheap and easy to solve, but it's worth remembering that when the money's too good it won't last forever.

I'm a female who majored in computer science but then did not use my degree after graduating (I do editing work now). While I was great with things like red-black trees and k-maps, I would have trouble sometimes with implementations because it was assumed going into the field that you already had a background in it. I did not, beyond a general knowledge of computers.

I was uncomfortable asking about unix commands (just use "man"! - but how do I interpret it?) or admitting I wasn't sure how to get my compiler running. If you hadn't been coding since middle school, you were behind. I picked up enough to graduate with honors, but still never felt like I knew "enough" to be qualified to work as a "true" programmer.

I like editing better anyway, so I'm not unhappy with my career, but that enviroment can't be encouraging for any but a certain subset of people, privileged and pushed to start programming early.

Hollow visions, bullshit, lies, and leadership vs management - "the best creative work depends on getting the little things right", "organizations need both poets and plumbers". So much to absorb here, but it all chimes with my experiences at Apple. Steve Jobs may have been a visionary, but he also knew his business inside and out, and obsessed over details.

Exceptions in C with longjmp and setjmp - When I was learning C, I loved how it felt like complete mastery was within reach, it was contained and logical enough to compile mentally once you had enough experience. longjmp() and setjmp() were two parts I never quite understood until now, so it was fascinating to explore them here.

Web Data Commons - Structured data extracted from 1.5 billion pages. To give you an idea of the economics behind big data, the job only cost around $600 in processing costs.

Hammer.js - "Can't touch this!" - A cross-platform Javascript library for advanced gestures on touch devices like tablets and phones. Even just building a basic swipe gesture from tap events is a pain, so this is much needed.

Common Crawl is one of those projects where I rant and rave about how world-changing it will be, and often all I get in response is a quizzical look. It's an actively-updated and programmatically-accessible archive of public web pages, with over five billion crawled so far. So what, you say? This is going to be the foundation of a whole family of applications that have never been possible outside of the largest corporations. It's mega-scale web-crawling for the masses, and will enable startups and hackers to innovate around ideas like a dictionary built from the web, reverse-engineering postal codes, or any other application that can benefit from huge amounts of real-world content.

Rather than grabbing each of you by the lapels individually and ranting, I thought it would be more productive to give you a simple example of how you can run your own code across the archived pages. It's currently released as an Amazon Public Data Set, which means you don't pay for access from Amazon servers, so I'll show you how on their Elastic MapReduce service.

I'm grateful to Ben Nagy for the original Ruby code I'm basing this on. I've made minimal changes to his original code, and built a step-by-step guide describing exactly how to run it. If you're interested in the Java equivalent, I recommend this alternative five-minute guide.

1 - Fetch the example code from github

You'll need git to get the example source code. If you don't already have it, there's a good guide to installing it here:

Buckets are a bit like top-level folders in Amazon's S3 storage system. They need to have globally-unique names which don't clash with any other Amazon user's buckets, so when you see me using com.petewarden as a prefix, replace that with something else unique, like your own domain name. Click on the S3 tab at the top of the page and then click the Create Bucket button at the top of the left pane, and enter com.petewarden.commoncrawl01input for the first bucket. Repeat with the following three other buckets:

com.petewarden.commoncrawl01output

com.petewarden.commoncrawl01scripts

com.petewarden.commoncrawl01logging

The last part of their names is meant to indicate what they'll be used for. 'scripts' will hold the source code for your job, 'input' the files that are fed into the code, 'output' will hold the results of the job, and 'logging' will have any error messages it generates.

5 - Upload files to your buckets

Select your 'scripts' bucket in the left-hand pane, and click the Upload button in the center pane. Select extension_map.rb, extension_reduce.rb, and setup.sh from the folder on your local machine where you cloned the git project. Click Start Upload, and it should only take a few seconds. Do the same steps for the 'input' bucket and the example_input.txt file.

6 - Create the Elastic MapReduce job

The EMR service actually creates a Hadoop cluster for you and runs your code on it, but the details are mostly hidden behind their user interface. Click on the Elastic MapReduce tab at the top, and then the Create New Job Flow button to get started.

7 - Describe the job

The Job Flow Name is only used for display purposes, so I normally put something that will remind me of what I'm doing, with an informal version number at the end. Leave the Create a Job Flow radio button on Run your own application, but choose Streaming from the drop-down menu.

8 - Tell it where your code and data are

This is probably the trickiest stage of the job setup. You need to put in the S3 URL (the bucket name prefixed with s3://) for the inputs and outputs of your job. Input Location should be the root folder of the bucket where you put the example_input.txt file, in my case 's3://com.petewarden.commoncrawl01input'. Note that this one is a folder, not a single file, and it will read whichever files are in that bucket below that location.

The Output Location is also going to be a folder, but the job itself will create it, so it mustn't already exist (you'll get an error if it does). This even applies to the root folder on the bucket, so you must have a non-existent folder suffix. In this example I'm using 's3://com.petewarden.commoncrawl01output/01/'.

The Mapper and Reducer fields should point at the source code files you uploaded to your 'scripts' bucket, 's3://com.petewarden.commoncrawl01scripts/extension_map.rb' and 's3://com.petewarden.commoncrawl01scripts/extension_map.rb'. You can leave the Extra Args field blank, and click Continue.

9 - Choose how many machines you'll run on

The defaults on this screen should be fine, with m1.small instance types everywhere, two instances in the core group, and zero in the task group. Once you get more advanced, you can experiment with different types and larger numbers, but I've kept the inputs to this example very small, so it should only take twenty minutes on the default three-machine cluster, which will cost you less than 30 cents. Click Continue.

10 - Set up logging

Hadoop can be a hard beast to debug, so I always ask Elastic MapReduce to write out copies of the log files to a bucket so I can use them to figure out what went wrong. On this screen, leave everything else at the defaults but put the location of your 'logging' bucket for the Amazon S3 Log Path, in this case 's3://com.petewarden.commoncrawl01logging'. A new folder with a unique name will be created for every job you run, so you can specify the root of your bucket. Click Continue.

11 - Specify a boot script

The default virtual machine images Amazon supplies are a bit old, so we need to run a script when we start each machine to install missing software. We do this by selecting the Configure your Bootstrap Actions button, choosing Custom Action for the Action Type, and then putting in the location of the setup.sh file we uploaded, eg 's3://com.petewarden.commoncrawl01scripts/setup.sh'. After you've done that, click Continue.

12 - Run your job

The last screen shows the settings you chose, so take a quick look to spot any typos, and then click Create Job Flow. The main screen should now contain a new job, with the status 'Starting' next to it. After a couple of minutes, that should change to 'Bootstrapping', which takes around ten minutes, and then running the job, which only takes two or three.

Debugging all the possible errors is beyond the scope of this post, but a good start is poking around the contents of the logging bucket, and looking at any description the web UI gives you.

Once the job has successfully run, you should see a few files beginning 'part-' inside the folder you specified on the output bucket. If you open one of these up, you'll see the results of the job.

This job is just a 'Hello World' program for walking the Common Crawl data set in Ruby, and simply counts the frequency of mime types and URL suffixes, and I've only pointed it at a small subset of the data. What's important is that this gives you a starting point to write your own Ruby algorithms to analyse the wealth of information that's buried in this archive. Take a look at the last few lines of extension_map.rb to see where you can add your own code, and edit example_input.txt to add more of the data set once you're ready to sink your teeth in.

Big thanks again to Ben Nagy for putting the code together, and if you're interested in understanding Hadoop and Elastic MapReduce in more detail, I created a video training session that might be helpful. I can't wait to see all the applications that come out of the Common Crawl data set, so get coding!

You may have been wondering why I haven't been blogging for over a week. I've got the generic excuse of being busy, but truthfully it's because I've had a draft of this post staring back at me for most of that time. God knows I'm not normally one to shy away from controversy, but I also know how tough it is to talk about racism and sexism without generating more heat than light. After twomore head-slapping examples of our problem appeared just in the last few days, I couldn't hold off any longer. I'm not a good person to talk about explicit discrimination in the tech industry, I'd turn to somebody like Kristina Chodorow, but I have been struck by one of the more subtle reasons we discourage a lot of potential engineers from joining the profession.

I don't get paid for most of the things I spend my time on. I do my blogging, open source coding, and speak at conferences for free, my books provide beer money, and I've only been able to pay myself a small salary for the last few months, after four years of working on startups. This isn't a plea for sympathy, I love doing what I do and see it all as a great investment in the future. I saved up money during my time at Apple precisely so I'd have the luxury of doing all these things.

I was thinking about this when I read Rebecca Murphey's post about the Fluent conference. Her complaints were mostly about things that seemed intrinsic to commercial conferences to me, but I was struck by her observation that the lack of expenses for speakers hits diversity.

I think it goes beyond conferences though (and I've actually found O'Reilly to be far better at paying contributors than most organizers, and they work very hard on discrimination problems). The media industry relies on unpaid internships as a gateway to journalism careers, which excludes a lot of people. Our tech community chooses its high-flyers from people who have enough money and confidence to spend significant amounts of time on unpaid work. Isn't this likely to exclude a lot of people too?

And yes, we do have a diversity problem. I'm not wringing my hands about this out of a vague concern for 'political correctness', I'm deeply frustrated that I have so much trouble hiring good engineers. I look around at careers that require similar skills, like actuaries, and they include a lot more women and minorities. I desperately need more good people on my team, and the statistics tell me that as a community we're failing to attract or keep a lot of the potential candidates.

We're a meritocracy. Writing, speaking, or coding for free helps talented people get noticed, and it's hard to picture our industry functioning without that process at its heart. We have to think hard about how we can preserve the aspects we need, but open up the system to people we're missing right now. Maybe that means setting up scholarships, having a norm that internships should all be paid, setting aside time for training as part of the job, or even doing a better job of reaching larval engineers earlier in education? Is part of it just talking about the career path more explicitly, so that people understand how crucial spending your weekends coding on open source, etc, can be for your career?

I don't know exactly what to do, but when I look around at yet another room packed with white guys in black t-shirts, I know we're screwing up.

From CMS to DMS - Are we moving into an era of Data Management Systems, that play the same interface role for our data that CMS's do for our content?

Drug data reveals sneaky side effects - Drew Breunig pointed me at this example of how bulk data is more than the sum of its parts. By combining a large amount of adverse reaction reports, a large number of new side effects caused by mixing drugs were discovered.

Gisgraphy - An intriguing open-source LGPL project that offers geocoding services based on OpenStreetMap and Geonames information. I look forward to checking this out and having a play.

I'm doing a short talk at SXSW tomorrow, as part of a panel on Creating the Internet of Entities. Preparing is tough because don't I believe it's possible, and even if it was I wouldn't like it. Opposing better semantic tagging feels like hating on Girl Scout cookies, but I've realized that I like an internet full of messy, redundant, ambiguous data.

The stated goal of an Internet of Entities is a web where "real-world people, places, and things can be referenced unambiguously". We already have that. Most pages give enough context and attributes for a person to figure out which real world entity it's talking about. What the definition is trying to get at is a reference that a machine can understand.

The implicit goal of this and similar initiatives like Stephen Wolfram's .data proposal is to make a web that's more computable. Right now, the pages that make up the web are a soup of human-readable text, a long way from the structured numbers and canonical identifiers that programs need to calculate with. I often feel frustrated as I try to divine answers from chaotic, unstructured text, but I've also learned to appreciate the advantages of the current state of things.

Producers should focus on producing

The web is written for humans to read, and anything that requires the writers to stop and add extra tagging will reduce how much content they create. The original idea of the Semantic Web was that we'd somehow persuade the people who create websites to add invisible structure, but we had no way to motivate them to do it. I've given up on that idea. If we as developers want to create something, we should do the work of dealing with whatever form they've published their information in, and not expect them to jump through hoops for our benefit.

I also don't trust what creators tell me when they give me tags. Even if they're honest, there's no feedback for whether they've picked the right entity code or not. The only ways I've seen anything like this work are social bookmarking services like the late, lamented Delicio.us, or more modern approaches like Mendeley, where picking the right category tag gives the user something useful in return, so they have an incentive both to take the action and to do it right.

Ambiguity is preserved

The example I'm using in my talk is the location field on a Twitter profile. It's free-form text, and it's been my nemesis for years. I often want to plot users by location on a map, and that has meant taking those arbitrary strings and trying to figure out what they actually mean. By contrast, Facebook forces users to pick from a whitelist of city names, so there's only a small number of exact strings to deal with, and they even handily supply coordinates for each.

You'd think I'd be much happier with this approach, but actually it has made the data a lot less useful. Twitter users will often get creative, putting in neighborhood, region, or even names, and those let me answer a lot of questions that Facebook's more strait-laced places can't. Neighborhoods are a fascinating example. There's no standard for their boundaries or names, they're a true folksonomy. My San Francisco apartment has been described as being in the Lower Haight, Duboce Triangle, or Upper Castro, depending on who you ask, and the Twitter location field gives me insights into the natural voting process that drives this sort of naming.

There's many other examples I could use of how powerful free-form text is, like the prevalance of "bicoastal" and "flyover countries" as descriptions changes over time, but the key point is that they're only possible because ambiguous descriptions are allowed. A strict reference scheme like Facebook's makes those applications impossible.

Redundancy is powerful

When we're describing something to someone else, we'll always give a lot more information than is strictly needed. Most postal addresses could be expressed as just a long zip code and a house number, but when we're mailing letters we include street, city and state names. When we're talking about someone, we'll say something like "John Phillips, the lawyer friend of Val's, with the green hair, lives in the Tenderloin", when half that information would be enough to uniquely identify the person we mean.

We do this because we're communicating with unreliable receivers, we don't know what will get lost in transmission as the postie drops your envelope in a puddle, or exactly what information will ring a bell as you're describing someone. All that extra information is manna from heaven for someone doing information processing though. For example I've been experimenting with a completely free map of zip code boundaries, based on the fact that I can find latitude/longitude coordinates for most postal addresses using just the street number, name, and city, which gives me a cluster of points for each zip. The same approach works for extra terms used to in conjunction with people or places - there must be a high correlation between the phrases "dashingly handsome man about town" and "Pete Warden" on pages around the web. I'm practically certain. Probably.

Canonical schemes are very brittle in response to errors. If you pick the wrong code for a person or place, it's very hard to recover. Natural language descriptions are much harder for computers to deal with, but they not only are far more error-resistant, the redundant information they include often has powerful applications. The only reason Jetpac can pick good travel photos from your friends is that the 'junk' words used in the captions turned out to be strong predictors of picture quality.

Fighting the good fight

I'm looking forward to the panel tomorrow, because all of the participants are doing work I find fascinating and useful. Despite everything I've said, we do desperately need better standards for identifying entities, and I'm going to do what I can to help. I just see this as a problem we need to tackle more with engineering than evangelism. I think our energy is best spent on building smarter algorithms to handle a fallen world, and designing interchange formats for the data we do salvage from the chaos.

The web is literature; sprawling, ambiguous, contradictory, and weird. Let's preserve those as virtues, and write better code to cope with the resulting mess.

Hard science, soft science, hardware, software - I have a blog crush on John D. Cook's site, it's full of thought-provoking articles like this. As someone who's learned a lot from the humanities, I think he gets the distinction between the sciences exactly right. Disciplines that don't have elegant theoretical frameworks and clear-cut analytical tools for answering questions do take a lot more work to arrive at usable truths.

Don't fear the web - A good overview of moral panics on the internet, and how we should react to the dangers of new technology.

Using regression isolation to decimate bug time-to-fix - Once you're dealing with massive, interdependent software systems, there's a whole different world of problems. This takes me back to my days of working with multi-million line code bases, automating testing and bug reporting becomes essential.

Humanitarian OpenStreetMap Team - I knew OSM did wonderful work around the world, but I wasn't aware of HOT until now, great to see it all collected in one place.

Open Data Handbook Launched - I love what the Open Knowledge Foundation are doing with their manuals. Documentation is hard and unglamorous work, but has an amazing impact. I'm looking forward to their upcoming title on data journalism.

My first poofer Workshop - This one's already gone, but I'm hoping there will be another soon. I can't think of a better way to spend an afternoon than learning to build your very own ornamental flamethrower.

Using photo networks to reveal your home town - Very few people understand how the sheer volume of data that we're producing makes it possible to produce scarily accurate guesses from seemingly sparse fragments of information. When you look at a single piece in isolation it looks harmless, but pull enough together and the result becomes very revealing.

Introducing SenseiDB - Another intriguing open-source data project from LinkedIn. There's a strong focus on the bulk loading process, which in my experience is the hardest part to engineer. Reading the documentation leaves me wanting more information on their internal DataBus protocol, I bet that includes some interesting tricks.

IPUMS and NHGIS - As someone who recently spent far too long trying to match the BLS's proprietary codes for counties with the US Census's FIPS standard, I know how painful the process of making statistics usable can be. There's a world of difference between a file dumps in obscure formats with incompatible time periods and units, and a clean set that you can perform calculations on. I was excited to discover the work being done at the University of Minnesota to create unified data sets that cover a long period of time, and much of the world.

Roger Magoulas asked me an interesting question during Strata - what was the biggest theme that emerged from this year's gathering? It took a bit of thought, but I realized that I was seeing a lot of people from all kinds of professions and organizations becoming conscious and open about their identity as data scientists.

The term itself has received a lot of criticism and there's always worries about 'big-data-washing', but what became clear from dozens of conversations was that it's describing something very real and innovative. The people I talked to came from professions as diverse as insurance actuaries, physicists, marketers, geologists, quants, biologists, web developers, and they were all excited about the same new tools and ways of thinking. Kaggle is concrete proof that the same machine-learning skills can be applied across a lot of different domains to produce better results than traditional approaches, and the same is being proved for all sorts of other techniques from NoSQL databases to Hadoop.

A year ago, your manager would probably roll her eyes if you were in a traditional sector and she caught you experimenting with the standard data science tools. These days, there's an awareness and acceptance that they have some true advantages over the old approaches, and so people have been able to make an official case for using them within their jobs. There's also been a massive amount of cross-fertilization, as it's become clear how transferrable across domains the best practices are.

This year thousands of people across the world have realized they have problems and skills in common with others they would never have imagine talking to. It's been a real pleasure seeing so much knowledge being shared across boundaries, as people realize that 'data scientist' is a useful label for helping them connect with other people and resources that can help with their problems. We're starting to develop a community, and a surprising amount of the growth is from those who are announcing their professional identity as data scientists for the first time.