Wednesday, March 30, 2016

Linux shell notes

This is a post containing various assorted Bash scripts, and some explanations of how they work. Some don't rely on Bash specific features and will probably work on many other shells as well. I intend to keep adding to this list over time.

Count instances of a specific character in a file:
(in the example here, I count the number of sequences in a fasta file. Sequence names all begin with'>',and that is usually the only time you see that character in a fasta file)
tr -cd '>' < sequences.fasta | wc -c

grep -c '^>' sequences.fasta
The first one reads the fasta file into the translate (tr) program. The -c option tells tr to translate every character except the ones in the parameter, the -d option tells it to delete the characters it translates, so the output of tr will be a stream of '>' characters. wc -c will count the bytes in the input stream, which will be the number of '>' characters in the fasta file, and therefore the number of sequences in the file.

In the second example, the -c option tells grep to output the number of times it encounters a line containing the query regex. the regex '^>' matches only the '>' character at the beginning of a line. Notably, the second one will not count twice if a > is found later in the line, to do that, you'd have to use:

grep -o , sequenes.fasta | wc -l

Interestingly, both of these commands seem to go at about the same speed. I expected grep to be slower since it uses regular expressions. That they go the same speed leads me to believe that they are hard-disk io bound. Because they are similar in speed, but grep is more powerful, I would recommend using that construction.

Convenient output from ls:
ls -alth
Of course this will change depending on the situation, but I find this to be a good general purpose set of options for ls.

List processes along with the command used to execute the process:
ps -af
-a shows all calls associated with a terminal. -f shows the full context of the calls (the command line options in addition to the process name)
To see all processes currently running, use ps -A

Simple multi-threading with GNU parallel:
GNU parallel makes it easy to parallelize any bash script. The -j option specifies how many calls to have active at one time. The positional parameter tells it which process to call. The items listed after ::: are the parameters to pass to the process when it is called. For example, the script:

Will call the function "thread" a total of three times, such that no more than 2 calls are running concurrently. First it will call thread(instance_1_argument), then thread(instance_2_argument). When one of those finishes, it will call thread(instance_3_argument).

Sort file, but skip the header:#the default for AWK, if no action is given is to print the whole line, that's what it does for line 1 (where NR==1)

#if NR > 1, awk does the action stated there, which is to output the whole line (print $0), but instead of printing it
#to stdout, it pipes it to sort. Sort then sorts it by the first field. LANG=en_EN is there so that sort and join (which appears later) are consistent