So, in my last post, I was stoked to find a way to filter FASTA files by a minimum sequence length using awk. However, that little one-liner fails miserably if your FASTA file isn’t formatted “correctly.” It turns out that the FASTA file I was working on (which was generated by a SOAP de novo assembly using iPlant) was technically formatted “correctly,” in that it had the necessary features to be called a FASTA file. It looked like this:

For all intents and purposes, that is a proper FASTA file. However, that formatting is problematic when using a program like sed or awk because both of those programs examine files by line. So, when I would run my awk one-liner to filter out all sequences (in this specific case, contigs) greater than a specified length,

I wasn’t getting any results that had sequences longer than 200bp because of the line breaks within each individual sequence!

So, how did I solve this issue?

Of course, there are programs/scripts that are able to handle instances like this. In fact, iPlant even has a sequence length restriction App built-in, so I don’t even need to take my SOAP FASTA file out of iPlant at all. However, I am trying to learn sed and awk, so I turned this problem into a learning exercise. Additionally, there may be times when I’m working with FASTA files outside of iPlant and may need/want a tool to manipulate a FASTA file without having to rely on a special script, program or website.

Unfortunately, the solution was WAY beyond anything I could end up finding on my own. I did find a way to remove “newlines” (indicated as ‘n’ in sed and awk) using sed and/or translate (“tr” command in Terminal). Here’s the sed command:

$sed ':a;N;$!ba;s/\n//g'

I still don’t fully follow how this works, but I do know that this part

s/\n//g

means that when a newline is encountered,

\n

substitute

s/

it with nothing throughout the entire file

/g

The remainder of the command is complicated and involves labels and branching and other stuff I don’t fully follow yet.

Using the built-in translate command in Terminal is MUCH cleaner and MUCH easier to understand for newbs like me:

$tr -d '\n'

This means use translate

tr

to delete all newlines

-d '\n'

The above are great for general usage, but they don’t address the idea of how to ignore certain lines. I ended up turning to the amazing folks at StackOverflow for help and they came up with solutions for both sed and awk amazingly quickly; it was great!

Each individual sequence is preceded by a sequence identifier line. This identifier line is always indicated by a “>” at the beginning of this line.

Here’s a quick explanation of how it works, as I currently understand it:

!/^>/ {next}

– If a line (i.e. record) begins with a “>”, go to the next line (record).

{getline seq}

– “getline” reads the next record and assigns the entire record to a variable called “seq”

length(seq) >= 200

– If the length of the “seq” record is greater than, or equal to, 200 then…

{print $0 "\n" seq}

– Print all records ($0) of the variable “seq” in the file that matched our conditions, each on a new line (“\n”)

Important note: this will only work on sequences that exist on a single line in the file. If the sequence wraps to multiple lines, the code above will not work. You can fix your FASTA files so that the sequences for each entry exist on single lines: