Here are the straight-off-the-camera pictures from the Galapagos and time in and around Quito. At some point I'll filter those down to the best 50 or so and add a writeup about the trip. Short version... it was incredible!

Apache Hive is a very powerful tool for processing data stored in Apache Hadoop. Structured and unstructured data can be accessed, processed, and manipulated using a SQL-like query language. This architecture allows anyone with reasonable SQL knowledge to write complex jobs with little to no knowledge of Hadoop, HDFS, and Hive.

-- Create a summary of purchases by product and day
SELECT day, product, count(*) as purchases, sum(revenue) as revenue, avg(revenue) as average_revenue_by_purchase
FROM purchases
WHERE day between '2014-01-01' and '2014-01-31'
GROUP BY day, product
ORDER BY product, day;

With only SQL knowledge very powerful data extractions can be performed, but SQL can be limiting.

Enter UDFs

Hive supports User-Defined Functions as a way to add more complex capabilities than are available in SQL. There is a good list of included UDFs. Here at LivingSocial we have a few of our own in our HiveSwarm project. There are a few other bundles we like Klout Brickhouse and stewi2's UDFs. These UDFs greatly expand the capabilities of Hive by including features not available in standard SQL.

Developing new UDFs requires knowledge of the Hive API and Java. This is fine for Hadoop developers, but is a significant barrier for non-Java developers.

Enter Scripted UDFs

Java supports running code in other languages through the javax.script API. Our HiveSwarm project has a new scripted UDF that makes use of this so UDFs can be written in languages other than Java. As a bonus this allows scripts to be directly in the Hive SQL instead of being compiled and jared before they can be run.

This shows how it can be used to compute data that would normally be difficult to generate via SQL.

create temporary function scriptedUDF as 'com.livingsocial.hive.udf.ScriptedUDF';
-- Gather complex data combining groups and individual rows without joins
select person_id, purchase_data['time'], purchase_data['diff'],
purchase_data['product'], purchase_data['purchase_count'] as pc,
purchase_data['id_json']
from (
select person_id, scriptedUDF('
require "json"
def evaluate(data)
# This gathers all the data about purchases by person in one place so
# complex infromation can be gathered while avoiding complex joins
# Note: In order for this to work all the data passed into
# scriptedUDF for a row needs to fit into memory
tmp = [] # convert things over to a ruby array
tmp.concat(data)
tmp.sort_by! { |a| a.get("time") } # for the time differences
last=0
tmp.map{ |row|
# Compute the time difference between purchases and add the total
# purchase count per person
t = row["time"]
# The parts that would be much more difficult to generate with SQL
row["diff"] = t - last
row["purchase_count"] = tmp.length
row["first_purchase"] = tmp[0]["time"]
row["last_purchase"] = tmp[-1]["time"]
# This shows that built-in libraries are available
row["id_json"] = JSON.generate({"id" => row["id"]})
last = t
row
}
end', 'ruby', 'array<map<string,string data-preserve-html-node="true">>',
-- gather all the data about purchases by people so it can all be
-- passed into the evaluate function
bh_collect(map( -- Note, bh_collect is from Klouts Brickhouse
-- and allows collecting any type,
-- see https://github.com/klout/brickhouse/
'time', purchase_time,
'product', product_id)) ) as all_data
from purchases
group by person_id
) foo
-- explode the data back out so it is available in flattened form
lateral view explode(all_data) bar as purchase_data

Last weekend I competed in the Musselman Triathlon. This was my second half Ironman distance triathlon and it was a huge improvement over my first one. Before I get to my results a little about the race.

Everything race-wise went well. The only minor things that went wrong were forgetting to start my watch on time, forgetting to the off my gloves in T2, and some minor cramping on the run. All of those are things i can fix. I think I'm ready to go for the big race in a few weeks.

I'm not a big fan of running. The last few weeks I've been running with Zombies, Run!. This has made running much more enjoyable.

The story in the game is a great distraction on a short or long run. I run with the zombie chases enabled which adds an extra level of excitement that keeps you on your toes listening for the "Zombies Detected" warning.

On a camping trip this weekend to Shenandoah national park I got some much needed photography practice. It was hardly fair out there. The deer are very tame. I was able to get very close for these shots. There are a few more shots up on Flickr.

The retina display got me. I could ignore the faster processor and the nifty cover in the iPad 2 and stick with my first generation one. I did the same with the iPhone 3 and 3GS. I wasn't sure how much of a difference the retina display would make, but the iPhone 4 got me hooked. I'm looking forward to not being able to see the pixels on the new iPad.

I've been trying out groups on LinkedIn, but for Hadoop and Java the posts are dominated by recruiters and minor flame wars. The Hadoop and associated technology mailing lists are far more informative and useful. It looks like twitter is another place for seeing what's going on in Hadoop land. I need to spend more time there.

Whenever I'm looking for a new job I'll head back to LinkedIn. Until then I need something better.

I'm moving everything over from my old host to Squarespace. I'm trying to get in the habit of posting more and this move will make it a lot easier.

I had a lot of old pictures up on my old site and I haven't figured out what to do with them. Most are not that good, but there are a few I really want to keep. At some point I'll either move them in here or switch to Flickr or Picasa to host them.