2016-07-31T23:57:38-07:00http://earlh.com/blog/Octopress2015-08-08T13:55:38-07:00http://earlh.com/blog/2015/08/08/splitting-audio-with-ffmpegHere’s a quick utility to use a set list and ffmpeg to split single audio files into multiple tracks. It splits audio files via a setlist then sets the song name, artist, album id3 tags. The script is crude, but it’s a quick start.

]]>2015-06-13T12:29:49-07:00http://earlh.com/blog/2015/06/13/vi-swapping-order-of-counts-and-labelsA note to myself: if you have a table of counts and labels, perhaps created by something | something | uniq -c, you can swap the order of the labels and the counts / change the order of columns in vi via the following regex.

1234

1 labelone
2 labeltwo
3 labelthree
99 label99

highlight in visual mode with V then run the following regex: s/\(\d\+\)\s\+\([a-zA-Z0-9_]*\)/\2 \1/

1234

labelone 1
labeltwo 2
labelthree 3
label99 99

or turn them into the correct format for a python dict via s/\(\d\+\)\s\+\([a-zA-Z0-9_]*\)/'\2': \1,/

1234

'labelone': 1,
'labeltwo': 2,
'labelthree': 3,
'label99': 99,

]]>2015-06-11T18:55:00-07:00http://earlh.com/blog/2015/06/11/shell-utilities-for-data-analysisQuick utilities to help with data analysis from the shell:

Print numbered column names of a csv or tsv. You can specify a file or it will read from stdin. It will also guess the separator, whichever of tab or comma is more common; or you may specify with --separator. This is particularly useful if you want to use awk to select columns.

#!/usr/bin/python# you use macports, you probably want the first line to be exactly #!/opt/local/bin/python# Copyright 2015 Earl Hathaway rblog Ray at Bans earlh dot com (take my sunglasses off to email me)# License: The author or authors of this code dedicate any and all copyright interest in this code to the public domain.## print numbered column names or headers from a file or stdin if present with an optional field separator# tested to work with python from 2.7 to 3.4from__future__importprint_functionimportargparseimportmathimportos.pathimportsysstdin=notsys.stdin.isatty()parser=argparse.ArgumentParser(description='print numbered column headers')parser.add_argument('file',nargs='?',help='filename(default: stdin if not a tty)')parser.add_argument('--separator',dest='separator',nargs=1,help='specify the field separator (default: whichever of comma or tab is more common)')parser.add_argument('--python_dict',dest='pydict',action="store_true",help='emit a python dict?')args=parser.parse_args(sys.argv[1:])ifargs.fileisnotNoneandnotos.path.isfile(args.file):print('File "%s" does not exist'%args.file)sys.exit(0)first=Noneifstdin:first=sys.stdin.readline()elifargs.fileisnotNone:withopen(args.file,'r')asf:first=f.readline()else:print('no file specified and nothing on stdin')parser.print_help()sys.exit(0)sep=Noneifargs.separatorisNone:n_comma=first.count(',')n_tabs=first.count('\t')sep="\t"ifn_tabs>=n_commaelse","else:sep=args.separator[0]fields=first.split(sep)# emit a python dict to copy into code; should be zero basedifargs.pydict:pydict='{'+(', '.join(['\'%s\': %d'%(val.strip(),idx)foridx,valinenumerate(fields)]))+'}'print(pydict)sys.exit(0)# calculate indentation for fields so they don't staggerwidth=0iflen(fields)<10elseint(math.ceil(math.log10(len(fields))))format=' %%%dd %%s'%widthforidx,valinenumerate(fields):print(format%(idx+1,val.strip()))

I recently switched to fastmail in lieu of gmail, mostly because I increasingly dislike google’s stance on privacy, their integration between products, and their ongoing updates to gmail. I unfortunately updated gmail on my phone, and their new material design ethos was designed by an idiot who thinks that they should have whitespace everywhere, wasting tons of space already in short supply. I now can only see 5.5 messages in the inbox view, whereas I used to be able to see 8, an incredibly annoying change in the most important screen. So I switched.

Positives

gmail shrunk the view window on android for some stupid flat design rationale; they appear to assume everyone reads email on a 6 inch phone

Negatives

fastmail pretends to be a gmail style email client where the unit of manipulation is a conversation, not a message. But the underlying message orientation peeks through in many cases.

When deleting a conversation, it has repeatedly asked if I want to delete the entire conversation (what else would I want?) and had a Yes/No for don’t ask me again. I’ve clicked “don’t ask again” at least 3 times. It doesn’t take.

if you archive a conversation, the sent emails also move to archive out of sent. This is wrong.

Settings feels like my first javascript project.

routing rules have to be very simple and sometimes don’t work.

The UI for setting up routing rules is shit; you have to add them, click, add, then scroll to the top of a very long page and click “apply all changes” for the rules to take (yes, I missed that while porting rules from my old webmail and had to redo 40+ rules). It’s essentially two-phase commit ala git; not at all what I expected for a webmail ui.

The rules don’t work as you would expect: eg messages from “a@b.com” do not match “sender ends with” “b.com”.

Rules can only filter on one thing at a time — no compound rules on eg sender and subject. When you create a rule, it doesn’t offer to apply to existing messages in the inbox.

Rules can’t use “or”. So if you filter on receiver, you can’t say a@b.com or b@b.com or c@b.com. Instead, you have to have one rule per each. By the time you have 100+ of these, it’s damn annoying.

spam filtering is crappy:

When you mark something as not spam, it is delivered to your inbox and skips rules.

there’s no ability to sort by spam score. Hopefully the most likely nonspam would have the lowest score, so it would be convenient to sort by that to find nonspam.

the spam filter doesn’t learn: I’ve had to mark a loan payment confirmation email as not-spam every single month I’ve used fastmail

It sometimes loses the send button while composing messages.

No option to “filter emails like this”; instead, you have to copy and paste eg the address you want to filter into a screen 3 clicks away.

By default, it doesn’t load images in html email. There is a link that tries to load the images in the email you’re viewing; it works perhaps 2/3 of the time.

The rich editor is crap.

For just one of a long list: paste tsv data in there; it strips all the tabs. Awesome. So a b c pastes as abc. Wat?

the mobile site on firefox lags typing like 10+ seconds if you have a quoted reply in the message box. It’s strictly amateur hour.

The application disables access to files on mobile phones even after using requesting desktop in firefox. surprise!

Potential Dealbreakers:

They attempt to monetize security in an incredibly stupid way. If you setup two factor authentication to text a code to your phone, they charge for the sms messages — 0.12 each! — even on $40/year accounts. That’s just chintzy. Better yet, because they’re run by cheap dicks, purchased sms credits expire after a year!!! When I saw that it felt like purchasing a prepaid cellphone at a gas station level cheap. They’re seriously pricing at 1600 hundred times the pricing twilio has on their web page for joe-random-user, not even considering volume discounts. Monetizing security makes you an asshole.

Their calendar implementation doesn’t understand meeting requests from Outlook. For example, I got a meeting request for 3pm PDT (sent as 2200 Greenwich; see excerpt from the calendar invite below) that Fastmail interpreted as 2pm PDT / 10pm BST. What on earth? Exchange is only the most common professional calendar server; why would you assume fastmail interoperates with Outlook?

1

DTSTART;TZID=Greenwich Standard Time:20150729T220000

In summary, there’s just a lot of annoyances that make me assume the devs don’t use their own product or they’d fix it out of sheer annoyance. But they don’t sell your information, or decide to shrink the number of messages viewable in your inbox in order to conform to some stupid corporate design ethos.

]]>2015-01-18T22:51:52-08:00http://earlh.com/blog/2015/01/18/regression-questions-logistic-01Assume we have a logistic regression of the form $\beta_0 + \beta_1 x$, and for value $x_0$ we predict success probability $p(x_0)$.
Which of the following is correct?

Assume we run a logistic regression on the 1-dimensional data below. What happens?

a) $– \infty < B_0 < \infty; B_1 \rightarrow \infty$

b) $\beta_0 = 0$, $\beta_1 = 0$

c) $\beta_0 = 0$; $\beta_1 \rightarrow –\infty$

d) none of the above

]]>2015-01-17T22:51:52-08:00http://earlh.com/blog/2015/01/17/regression-questions-coin-teaserThis is a straightforward question that elucidates whether you understand regression, particularly the ceteris paribus interpretation of multiple regression.

let $Y$ be the total value of change in your pocket;

let $X_1$ be the total number of coins;

let $X_2$ be the total number of pennies, nickels, and dimes.

Now, regress $Y$ on $X_1$ or $Y$ on $X_2$ alone. Both $\beta_1$ and $\beta_2$ would be positive.

If you regress $Y$ on $X_1 + X_2$, what are the signs of $\beta_1$ and $\beta_2$?

Consider holding $X_2$ constant: if $X_1$ increases by 1, ie you turn a penny, nickle, or dime into a quarter, then $Y$ surely increases. Therefore $\beta_1$ is positive.

Now consider holding $X_1$ constant and increasing $X_2$. If the number of pennies, nickles, and dimes increases while the total number of coins stays constant, you’re replacing quarters with a lower valued coin. Thus increasing $X_2$ can decrease $Y$, so it is entirely possible that $\beta_2$ is negative.

Updated 26 August 2015.

]]>2014-12-01T17:37:31-08:00http://earlh.com/blog/2014/12/01/interview-questions-in-rPreviously, I wrote about a common interview question: given an array of words, output them in decreasing frequency order, and I provided solutions in java, java8, and python.

Here’s the reason I love R: this can be accomplished in 3 lines of code.

java8 also massively cleans up some common operations. A common interview question is given an array or list of words, print them in descending order by count, or return the top n sorted by count descending. A standard program to do this may go like this: create a map from string to count; reverse the map to go from count to array of words with that count, then descend to the correct depth.

]]>2014-08-21T14:42:53-07:00http://earlh.com/blog/2014/08/21/probability-problems-coin-flips-01You have an urn with 10 coins in it: 9 fair, and one that is heads only. You draw a coin at random from the urn, then flip it 5 times. What is the probability that you get a head on the 6th flip given you observed head on each of the first 5 flips?

Let $H_i$ be the event we observe head on the $i$th flip, and let $C_i$ be the event we draw the $i$th coin, $i = 1,…,10$.

what on earth? It turns out the winner is this beautiful bit of syntax:

1

boollearn_tree(constfloat*const*target,unsignedintnum_classes);

beautiful.

So for all future googlers, this is how you declare const double arrays or const multidimensional arrays in c++.

]]>2014-03-26T17:52:09-07:00http://earlh.com/blog/2014/03/26/splitting-files-with-awkTo split files (eg for test / train splits or k-folds) without having to load into R or python, awk will do a fine job.

For example, to crack into 16 equal parts using modulus to assign rows to files:

And finally, if your data file has a header that you don’t want to end up in a random file, you can dump the header row into both files, then tell your awk script to append (and use tail to skip the header row)

For googlers who want to move from wordpress to octopress, here’s how I moved 70-odd posts with minimal pain.

1 – Get thomasf’s excellent python script (accurately named exitwp) that converts wordpress posts to octopress posts. This will create one octopress post per wordpress post in the source directory.

2 – I simultaneously moved urls from blog.earlh.com to earlh.com/blog so I needed to 301 all the old posts. I did that by getting
this awesome wordpress post exporter script contributed by Mike Schinkel. I curled that to create a list of urls to forward, then built a tsv of pairs of old url\tnewurl. Then the below awk script will print nginx forward rules:

Add them to your site nginx.conf file inside the server configuration block.

I’ll update with solutions for better image embedding.

]]>2014-03-24T12:21:00-07:00http://earlh.com/blog/2014/03/24/c-plus-plus-is-horrificI’m poking at some c++ after not touching it for a decade. c++11 has apparently gotten roughly as capable as java pre 2000; it now can create threads! But the error messages. Oh, the error messages

so yeah, you can’t copy thread objects, enforced by having a private constructor. Still, the amount of knowledge it takes to translate from the error message to the error is pretty amazing.

]]>2014-03-21T12:59:00-07:00http://earlh.com/blog/2014/03/21/replacing-sort-|-uniqA code snippet: when poking at columnar data in the shell, you’ll often find yourself asking questions like what are the unique values of a particular column, or the unique values and their counts. R would accomplish this via unique or table, but if your data is largish it may be quite annoying to load into R. I often use bash to quickly pick out a column, ala
pick out a column

1

$ cat data.csv | awk -F, '{print $8}' | sort | uniq -c | sort -r -n

In order: bash cats my data, tells awk to print just column 8 using , as the separator field, sorts all the data so that I can use uniq, asks uniq to print the counts and then the unique strings, then sorts by the counts descending (-n interprets as a number and -r sorts descending). The obvious inefficiency here is if your data is a couple of gb, you have to sort in order for uniq to work. Instead, you can add the script below to your path and replace the above with:

pick out a column

1

$ cat data.csv | awk -F, '{print $8'} | count

not only is this a lot less typing, but it will be significantly faster since you don’t have to hold all the data in ram and sort it.

]]>2013-06-30T20:58:00-07:00http://earlh.com/blog/2013/06/30/hiring-software-engineersI perpetually see employers, on hacker news and elsewhere, complaining about difficulty hiring. I haven’t had such issues, so a (perhaps bold) guide to hiring software engineers:

are you paying market salaries?

are you really paying market salaries, or are employees supposed to join your company because you’re a special snowflake?

even if you are paying market, why should an employee go to your firm? What is the upside to them for leaving a boss and company that they know? Because it would be really convenient for you, the hirer, is not a good answer.

do you make the interviewing process decent, or do you scatter caltrops in front of potential employees?

good employees do not need to crash study then regurgitate graph algorithms that your company never uses on the whiteboard. They also have jobs and value their vacation time and don’t care to spend a week consulting for you.

how long does it take you to respond to resumes that come in? You should be able to say yes/no/maybe within 2 business days. Do your recruiters / interviewers actually read the cover letters / resumes? Last time I changed job a big sf / yc startup let my first interviewer roll into the interview room just shy of 20 minutes late without having read my resume. That’s a complete dick move, and it’s part of why I turned them down.

when potential employees send you github links, do you have an engineer actually bloody look at them (almost never in my experience)?

do you actually expend effort to meet potential employees and grow a bunch of warm leads, or do you wait until 3 weeks before you want someone to start then gripe because you can’t convert cold leads in 1 week plus a 2 week resignation period for their current employer?

do you use shit software like that jobvite bullshit that badly ocrs then expects me to hand proof their shitty ocr job, or do you directly accept pdf resumes?

for the love of god, I do not have a copy of ms word and wouldn’t take one if it were free. I will not put my resume into word format.

do you take some pains to grow employees?

do you hire people out of university? Take a chance on people?

Like one of my former employers, do you do a good job hiring new grads from schools besides stanford / berkeley / mit / cmu, but then 18 months in after employees have demonstrated their value refuse to bring them up to market rates and lose them?

when you send out offers, do you actually put out a good offer or do you throw numbers out that are 10% or more under your ceiling then expect employees to negotiate hard with you? I just turned down an sf startup because they did this; the ceo who successfully hired me said, “Earl: when I was an employee I hated negotiating, so I’m going to make you a great offer. This also means I’m not going to negotiate.” And you know what? It was a great offer, and I said yes the next morning. It also avoids starting your new job after a confrontational exercise.

if you have recruiters contacting people, do you have them make clear they’re internal not external?

do your recruiters actually read peoples’ linkedin profiles before contacting them? I used rails a bit 3 employers ago and had to remove that word from my profile because I got spammed with rails stuff.

do your job postings on linkedin, craigslist, message boards, and your website tell potential employees why he or she should work for you, or like the vast majority, is it simply a long list of desiderata?

just like the easiest sale is an upsell to a customer you have, the easiest recruit is the good employee you already have that you keep happy and prevent from leaving

like Rand says, do you know off the top of your head the career goals of your employees? What are you doing to help them get there?

do you give your employees raises to keep them at or above market, or can they get a $20k raise by swapping companies? If that raise is on offer, exactly why should they continue to work for you?

on that note… 0.2% of an A round company isn’t golden handcuffs. It’s more like paper handcuffs.

]]>2013-06-17T22:25:00-07:00http://earlh.com/blog/2013/06/17/moving-to-octopressI finally got tired of wordpress, and I’m trying out octopress. If you have a blog, I think you would probably also be happier using octopress.

Reasons to switch:

wordpress doesn’t just work; it takes endless supervision. This is complicated by the seeming inability to get only security related updates, so I always feel forced to stay on the version treadmill for security reasons. This frequently breaks plugins.

wordpress treats security as an afterthought. For instance, the first thing you may think of is locking down the wp-admin directory to your home ip, but last time I tried this breaks the site.

php is security-hole ridden junk. This seems to have improved over the years but I’m still a little uneasy about using it, whereas only serving static html should be very secure and only require a single serving program making it much easier to keep up with security.

serving php with nginx is fragile; there’s multiple ways to set it up and none of them seem to work particularly reliably

there’s always 10 plugins to do any given task, none of which fully work or fully integrate with wordpress. I tried to get markdown working with wordpress last weekend and it was a nightmare

wordpress is slow, and caching plugins are brittle; serving static html from octopress should be lightning fast.

octopress comes with a bunch of nice features like syntax highlighting that doesn’t require loading 15 different javascript files ala the syntax highlighter I’m using

Reasons I want to switch:

I really like the idea of serving a static site and deploying with rsync

I like using git to version my site and vim to write posts

I will miss comments, but I hope people will email instead. That said, of the nearly 20,000 comments my site has received I believe fewer than thirty weren’t spam. In fact, wordpress has a whole cottage industry selling a comment spam control tool called Akismet created to fix how easy wordpress makes comment spam.

]]>2013-05-02T23:58:59-07:00http://earlh.com/blog/2013/05/02/modifying-the-number-of-mappers-or-reducers-on-a-running-emr-clusterAmazon emr unfortunately doesn’t give you an easy way to change the number of mappers and reducers on a running cluster. To do so before booting the cluster, add