I wanted to extract all the email addresses in my GMail. Sadly, GMail does not have this facility, and allows to export only the addresses you have written emails to. So I tried http://vallery.net/gmail/ about a week ago, but am yet to get the results!! So I decided to dirty my hands over the weekend.

I tried some Perl modules, but the installation was not clean (needs 'make', failed tests, and what not!!!). So I moved to PHP. I installed php5-cli and php5-imap on my Ubuntu, added the lineextension=imap.soto /etc/php5/cli/php.ini

And then it was just a matter of playing around with the API described at http://in2.php.net/imap

This script scans all the emails in the 'All Mail' label of GMail (which includes all the emails in your account, even archived, bit not the Spam and Trash labels). For each mail, it extracts the TO, CC, BCC, etc. fields (all those fields which may contain an email address) and prints the output on the screen in the following format:

The & is the separator of the different fields. First field shows which part of the email the address was extracted from. The second field shows the timestamp of the email message. Third field shows the email address and the last field shows the name as associated with the email.

Note that, the standards are quite flexible, so only the first two fields are guaranteed to be presented, rest of the fields can be empty.

You can take the output of this script an process it any which way you wish. You can use the output to determine who you have talked to most, or what was the frequency of your conversations with a person, etc. For eg. I run the output file through a series of tools to get all unique addresses like so:

The script can be customized for other IMAP provider/accounts/folder by changing the first 6 lines of code in the script (after the license strip).

Caveats:.) You should have IMAP enabled in your GMail > Settings > Forwarding and POP/IMAP section..) If the last two lines of the output seem like:

Warning: imap_headerinfo(): Bad message number in mine_emails_addrs_from_imap.php on line 31empty header found at 18109

that means the extraction is complete.

But, if you see only the 'empty header...' line, that means the connection was broken, or something happened so the extraction was not completed. You need to pick the number in the last line (18109 in this case) and provide that to the script as it's only argument, so the script will start from that message, and not redo the whole thing (which may cause it to fail again somewhere). You need to repeat this until you are able to see the WARNING message in last-but-one line.

Using this script, I am thinking of providing a service similar to vallery.net's, but with more transparency and better response times. Let me know if you really need it.

As I said in my previous post, I am getting rid of my Ubuntu in a VM (Gutsy Gibbon running inside VirtualBox), I am posting another script that I think will be useful in some situations.

Here's a little background. The place where I am consulting (Hi5.com) we need to perform rsync on a huge directory tree. And since we want this operation to be as fast as possible, the first measure the guys there took was to use rsync protocol, and not use rsync-over-ssh; thats a great speed booster.

Next, they (actually, Kenny Gorman) devised three scripts, that we need to run after each other; one to generate a list of all files in the directory we want to copy, second to split that list into 4 equal pieces, and the third to actually run these 4 pieces (batches) in background, in parallel.

The problem with this approach is that some batches finish quickly, because the files those batches are rsyncing are smaller than the files that other batches are working on. The result: we start with 4 parallel rsync commands, but somewhere down the line only one or two of them are running. We loose parallelism quite quickly, and end up waiting for the batch(es) containing large files, and that is processing files in sequential order.

So, I got to work trying to parallelize a bunch of commands that are placed in a file. This script reads lines from it's standard-input-stream (stdin) and executes those lines using the shell. At any time, it will run only a specified number of commands, and wait for them to finish. As soon as one of the running command finishes, this script reads next line from stdin and executes that.

I have also added the ability to change the degree of parallelism while this script is running. Just create a file named 'degree' in /tmp/parallel.$PID/ and and put a number in there, denoting the new degree of parallelism. This is quite useful in tweaking the degree of parallelism depending on your system load.

I have made no special efforts in redirecting the stdin/stdout/stderr of the commands that are read and executed by this script. So, if you wish to record the progress of this script, or wish to store away your commands' output, just redirect this script's streams and save them.

An example usage of this script can to remove all the files under a directory, in parallel (although it is a very bad example for such a simple task):

I have admitted on more than one occasion that I am a Windows fan; yes, even after using Vista! But when I got my new laptop, on which I installed Vista Business on my own, I tried to push myself into using Ubuntu; I'll leave blogging about that experience for some other post (on my RNFs). That was a long time ago (2 months to be precise) and this post is about something else.

I encountered too many network disconnections on Ubuntu. I noticed that the wireless' indicator on my laptop would just go away after using Ubuntu for a while. The only work-around to start the connection I had was to restart the OS! As I was very committed to using Ubuntu at any cost, I dug up the internet and found some clues. A little while later I developed this script.

What this script does is it uses the utility that is installed with the Intel (restricted) wireless driver, to check if the driver is still running,; if it is not, then it starts it, and if it is already tunning, it will kill and restart it. Worked like a magic for me for the week that I used Ubuntu after this.

So I finally got around to implementing one of my ideas (which I don't get to do very often!). The idea was posted here: http://gurjeet-rnf.blogspot.com/2008/05/ts.html

I first thought of implementing it in C, and thought that I'd use the time-tested code from postgres sources. I wanted to implement the code in C for performance reason, but then it looked a bit complex to extract PG's code and make it work independently.

So I cooked up a simple shell script that uses the standard 'date' command to get us what we want. Here it is:

Since I have a soft spot for Windows, and since this shell script cannot be easily utilized in Windows platform, I am working on a new binary, that will be based on the 'date' command, and work natively on Windows.