about

writing

contact

Counting Frequencies of Frequencies

By eric

2010-08-16

Lots of people forget about the usefulness of the core utilities (the tools available in Bash). I am even pretty guilty of it at times with such quick and easy things like Perl, Ruby, or Python that allow you to process items from the command line. However, they load up an entire interpreter. It is usually better to use the coreutils.

I’ll give you the specific example I had to deal with, but this can be extrapolated out and I’m sure reused for other purposes. I have lists of email addresses of who has received emails that were sent over the past 7 days. I do this with a find command since the address files are created nightly. They are laid out with 1 email address (and some other meta data) per line. You’ll notice the awk in the find command; that’s just to extract the email address from the line to make things easier to work with.

The overall goal here is find out how many times people were emailed over the past 7 days. Another way of saying it is, how many people received 1,2,3,4, etc mailings over the past week. This is an exercise in aggregation.

First, here is everything in its entirety and then I will proceed to go through it all piece by piece.

1) Find all mailing list master files (-name) that aren’t archived (-maxdepth) that were created in the last 7 days (-ctime) and print their id (exec awk) to a tmp file. Everything here needs to be concatenated so as not to clobber the emails added by the previous day.

2) The next piece word frequency counts the emails and prints them to STDOUT in a manner. This is to say that the first column is a count of how many times the second column has received an email over the past 7 days. Now we know how many emails each individual has received and we need to aggregate again.