Wednesday, November 03, 2010

Some unix/perl oneliners for Bioinformatics

1. $ wc –l : count number of lines in a file.2. $ ls | wc –l : count number of files in a directory.3. $ tac : print the file in reverse order e.g; last line first, first line last.4. $ rev : reverse the file in lines.5. $ sed 's/.$//' or sed 's/^M$//' or sed 's/\x0D$//' : converts a dos file into unix mode.6. $sed "s/$/`echo -e \\\r`/" or sed 's/$/\r/' or sed "s/$//": converts a unix newline into a DOS newline.7. $ awk '1; { print "" }' : Double space a file.8. $ awk '{ total = total + NF }; END { print total+0 }' : prints the number of words in a file.9. $sed '/^$/d' or [grep ‘.’] : Delete all blank lines in a file.10. $sed '/./,$!d' : Delete all blank lines in the beginning of the file.11. $sed -e :a -e '/^\n*$/{$d;N;ba' -e '}': Delete all blank lines at the end of the file.12. $sed -e :a -e 's/<[^>]*>//g;/13. $sed 's/^[ \t]*//' : deleting all leading white space tabs in a file.14. $ sed 's/[ \t]*$//' : Delete all trailing white space and tab in a file.15. $ sed 's/^[ \t]*//;s/[ \t]*$//' : Delete both leading and trailing white space and tab in a file.

2.2 Working with Patterns/numbers in a sequence file16. $awk '/Pattern/ { n++ }; END { print n+0 }' : print the total number of lines containing the word pattern.17. $sed 10q : print first 10 lines.18. $sed -n '/regexp/p' : Print the line that matches the pattern.19. $sed '/regexp/d' : Deletes the lines that matches the regexp.20. $sed -n '/regexp/!p' : Print the lines that does not match the pattern.21. $sed '/regexp/!d' : Deletes the lines that does NOT match the regular expression.22. $sed -n '/^.\{65\}/p' : print lines that are longer than 65 characters.23. $sed -n '/^.\{65\}/!p' : print lines that are lesser than 65 characters.24. $sed -n '/regexp/{g;1!p;};h' : print one line before the pattern match.25. $sed -n '/regexp/{n;p;}' : print one line after the pattern match.26. $sed -n '/^.\{65\}/ {g;1!p;};h' < sojae_seq > tmp : print the names of the sequences that are larger than 65 nucleotide long.27. $sed -n '/regexp/,$p' : Print regular expression to the end of file.28. $sed -n '8,12p' : print line 8 to 12(inclusive)29. $sed -n '52p' : print only line number 52.30. $seq ‘/pattern1/,/pattern2/d’ < inputfile > outfile : will delete all the lines between pattern1 and pattern2.31. $sed ‘/20,30/d’ < inputfile > outfile : will delete all lines between 20 and 30. OR sed ‘/20,30/d’ < input > output will delete lines between 20 and 30.32. awk '/baz/ { gsub(/foo/, "bar") }; { print }' : Substitute foo with bar in lines that contains ‘baz’.33. awk '!/baz/ { gsub(/foo/, "bar") }; { print }' : Substitute foo with bar in lines that does not contain ‘baz’.34. grep –i –B 1 ‘pattern’ filename > out : Will print the name of the sequence and the sequence having the pattern in a case insensitive way(make sure the sequence name and the sequence each occupy a single line).35. grep –i –A 1 ‘seqname’ filename > out : will print the sequence name as well as the sequence into file ‘out’.

3.1 Error Checking and data handling:38. awk '{ print NF ":" $0 } ' : print the number of fields of each line followed by the line.39. awk '{ print $NF }' : print the last field of each line.40. awk 'NF > n' : print every line with more than ‘n’ fields.41. awk '$NF > n' : print every line where the last field is greater than n.42. awk '{ print $2, $1 }' : prints just first 2 fields of a data file in reverse order.43. awk '{ temp = $1; $1 = $2; $2 = temp; print }' : prints all the fields in the correct order except the first 2 fields.44. awk '{ for (i=NF; i>0; i--) printf("%s ", $i); printf ("\n") }' : prints all the fields in reverse order.45. awk '{ $2 = ""; print }' : deletes the 2nd field in each line.46. awk '$5 == "abc123"' : print each line where the 5th field is equal to ‘abc123’.47. awk '$5 != "abc123"' : print each line where 5th field is NOT equal to abc123.48. awk '$7 ~ /^[a-f]/' : Print each line whose 7th field matches the regular expression.49. awk '$7 !~ /^[a-f]/' : print each line whose 7th field does NOT match the regular expression.50. cut –f n1,n2,n3.. > output file : will cut n1,n2,n3 columns(fields) from input file and print the output in output file. If delimiter is other than TAB then give additional argument such as cut –d ‘,’ –f n1,n2.. inputfile > out51. sort –n –k 2,2 –k 4,4 file > fileout : Will conduct a numerical sort of column 2, and then column 4. If –n is not specified, then, sort will do a lexicographical sort(of the ascii value).