Gists:

BTW the double nested loop here is about 25% faster than 2 smaller loops with a lot of vectorization. Not sure why, but I occasionally find that loops are actually more efficient than big matrix calculations.