Tag Archives: slurm

This post contains a quick overview of useful command line tools for research found in standard bash shells. For the sake of brevity, I will only include essential information here, and link to other informative pages where available.

awk

awk (also implemented in newer versions nawk and gawk) is a utility to manipulate text files. It’s syntax is much simpler than perl or python but still enables the fast writing of scripts to process files by lines and columns.

examples

For each line in the file input.txt, print the first and third column separated by a tab. Store the result in output.txt

awk '{ print $1 “\t" $3 }' input.txt > output.txt

Replace each column in file input.txt with its absolute value and print out each line to output.txt.

links

screen

screen is an invaluable tool for creating virtual consoles that (1) keeps sessions active through network failure, e.g. when using secure shells, (2) connect to your session from different locations, and (3) run a long process without maintaining an active shell. Also see the alias command for a useful screen alias.

examples

Open a new screen console, list all available screen sessions, and reattach to screen with a particular ID

links

find

A tool for finding files. Can be chained with other tools for powerful pipelines. Useful options are –name to find files by filename, -wholename to find files by filename and path, -maxdepth descends down to at most this level. The -exec option is VERY useful.

examples

links

paste/join/cat

Tools for combining files.
paste merges files line by line.
join merges two files on a common field.
cat concatenates a number of files one after the other. Can also be used to print a file to standard input.

examples

paste the lines of input1.txt and input2.txt together separating them with a space

paste –d " " input1.txt input2.txt

join two files input1.txt and input2.txt by the first field of both files

uname -a: get information about currently logged in machine. Related: echo $0 to print your interpreter.

md5sum: checksum files. Verify that some important files you downloaded are genuine.

A worked-through exercise

Now that we’ve learned the basics of these commands, let’s put them all together.
You will have to use what you have learned to work out solutions to the tasks in bold. Note that there are many ways to solve each problem.

The scenario

Due to the recent discovery that the Brontosaurus may indeed be a genus of dinosaur, the NSF has reallocated all of its funding to dinosaur research. A collaboration between leading archaeologist Dr. Li-Fang (aka the iron fist of Taiwan) and crazed molecular biologist Dr. Bianca resulted in the extraction of DNA samples from several Brontosauri fossils. After characterizing the set of variants in the sample, a variant call format (VCF) file was generated and uploaded to the cloud. During upload, a deranged hacker and former world-class sprinter named Greg, who has a personal vendetta against the Brontosaurus, corrupted the text file. We must clean this file using the tools described in this tutorial.

First we grep the double # lines of the header and store it in tmp. We then grep the non header lines, paste them with the new dinosaur sequence using a space delimter, and concatenate with our header temp file.