Pages

Tuesday, June 15, 2010

Ubuntu Tricks: Text File Manipulation

You've got a text file that is several million lines. Each line corresponds to a filename and each file needs to either be downloaded or uploaded somewhere.

Assuming you have a script that iterates and reads in serial from the input text file, what happens when the system kicks out an error? Start from the beginning? Thats what happened initially for me while I looked for solutions. Yeah, just letting it go this way exponentially meant waiting three(3) hours and more each time. Thats not very optimal.

Since at the point of error it was possible to see the lines of text corresponding to the filename already or currently being processed, these should be the resume point. Noted this down.

What do we need to do? Trim the file from the top using sed and pipe out to a secondary file.

How?

sed knows about line numbers, I haven't dug enough to make it do a string search in parallel, yet. The best way so far that worked for me is to use grep. Such that: cat | grep -n

Example:

$ cat files2009.txt | grep -n meatloaf.bin

4236321:meatloaf.bin

So, we need to cut until line 4,236,321 (or perhaps a few lines before that, your choice). In case you haven't been tracking your percentage done that also tells us that we've just got less than 800K files to go before we're done.

Time to use sed which understands regular expressions. For our puposes we'll be using: sed -e ',d'

Example:

$ sed -e '1,4236321d' files2009.txt > new2009.txt

The above means to start trimming at line "1" up to "4236321" from files2009.txt and pipe the rest of the original contents to new2009.txt

You can now imagine how to trim blocks of lines from the middle, just cat for the starting line of text and so on.