Data Mining and Taco Bell Programming

Programmer Ted Dziuba suggests an alternative to traditional program that he called “Taco Bell Programming.” The Taco Bell chain creates multiple menu items from about eight different ingredients. Dziuba wants to be able to be able to create many applications with combinations of about eight different shell commands.

Here’s an example from Dziuba:

Here’s a concrete example: suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it? The cool-kids answer is to write a distributed crawler in Clojure and run it on EC2, handing out jobs with a message queue like SQS or ZeroMQ.

The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A “distributed crawler” is really only like 10 lines of shell script.

Dziuba gives another example. Instead of using Hadoop to process that data once you have it, you can use:

find crawl_dir/ -type f -print0 | xargs -n1 -0 -P32 ./process

“It is a viable way to deal with massive data problems, at least for one-off jobs,” Big data expert and ReadWriteWeb contributor Pete Warden says about Dziuba’s Taco Bell programming concept. “You’re trading off the ability to manage and tightly control the process against development speed.”

The question being posed by a new generation of news readers who now depend more upon online sources than any other, is whether the editorial process for deciding the precedence of articles in a publication – for deciding what you read, when you read the publication – matters. In a world full of thousands of “sources,” some of them actually…

Calais, a project sponsored by Reuters offers a few handy plugins that enable you to use its API to auto-tag all the posts in your blog (see our coverage). It goes through your content, extracts the relevant keywords, and adds those as tags in your CMS.

But Open Calais isn’t open source. Here are a few open source tools you can use to extract key…