We're looking for long answers that provide some explanation and context. Don't just give a one-line answer; explain why your answer is right, ideally with citations. Answers that don't include explanations may be removed.

+1 For a sane solution when it's lots of data, and not just a handful of bytes. The files are in the disk cache though, aren't they?
–
Daniel Beck♦Oct 10 '12 at 18:24

2

The neat thing is that it has a complexity of O(N) in processing and O(1) in memory. The pipes usually have O(N log N) in processing (or even O(N^2)) and O(N) in memory.
–
queueoverflowOct 10 '12 at 19:54

68

You are stretching the definition of "command line" quite a bit, though.
–
gerritOct 10 '12 at 20:42

+1 I've been using grep for 25 years and didn't know about -o.
–
LarsHOct 10 '12 at 19:28

9

@JourneymanGeek: The problem with this is that it generates a lot of data that is then forwarded to sort. It would be cheaper to let a program parse each character. See Dave's answer for a O(1) instead O(N) memory complexity answer.
–
queueoverflowOct 10 '12 at 19:52

The key is knowing about the -o option for grep. This splits the match up, so that each output line corresponds to a single instance of the pattern, rather than the entire line for any line that matches. Given this knowledge, all we need is a pattern to use, and a way to count the lines. Using a regex, we can create a disjunctive pattern that will match any of the characters you mention:

Addendum: If you want to total the number of A, C, G, N, T, and - characters in a file, you can pipe the grep output through wc -l instead of sort | uniq -c. There's lots of different things you can count with only slight modifications to this approach.

@JourneymanGeek: Learing regex is well worth the trouble, since it's useful for so many things. Just understand it's limitations, and don't abuse the power by attempting to do things outside the scope of regexes capabilites, like trying to parse XHTML.
–
crazy2beOct 10 '12 at 15:17

20

grep -o '[ATCGN-]' could be a bit more readable here.
–
sylvainulgOct 10 '12 at 15:45

After using UNIX for a couple of years, you get very proficient at linking together a number of small operations to accomplish various filtering and counting tasks. Everyone has their own style-- some like awk and sed, some like cut and tr. Here's the way I would do it:

grep searches the given file(s) for the specified text, and the -o option tells it to only print the actual matches (ie. the characters you were looking for), rather than the default which is to print each line in which the search text was found on.

wc prints the byte, word and line counts for each file, or in this case, the output of the grep command. The -w option tells it to count words, with each word being an occurrence of your search character. Of course, the -l option (which counts lines) would work as well, since grep prints each occurrence of your search character on a separate line.

To do this for a number of characters at once, put the characters in an array and loop over it:

The downside of this approach, as user Journeyman Geek notes below in a comment, is that grep has to be run once for each character. Depending on how large your files are, this can incur a noticeable performance hit. On the other hand, when done this way it's a bit easier to quickly see which characters are being searched for, and to add/remove them, as they're on a separate line from the rest of the code.

I think any decent implementation avoids sort. But because it's also bad idea to read everything 4 times, I think one could somehow generate a stream that goes through 4 filters, one for each character, which is filtered out and where the stream lengths are also somehow calculated.

I didn't knew about uniq nor about grep -o, but since my comments on @JourneymanGeek and @crazy2be had such support, maybe I should turn it into an anwser of its own:

If you know there is only "good" characters (those you want to count) in your file, you can go for

grep . -o YourFile | sort | uniq -c

If only some characters must be counted and others not (i.e. separators)

grep '[ACTGN-]' YourFile | sort | uniq -c

The first one uses the regular expression wildcard ., which match any single character. The second one use a 'set of accepted characters', with no specific order, except that - must come last (A-C is interpreted as 'any character betweenA and C). Quotes are required in that case so that your shell do not try to expand that to check single-character files if any (and produce a "no match" error if none).

Note that "sort" also has a -unique flag so that it only reports things once, but no companion flag to count duplicates, so uniq is indeed mandatory.

If circumstances permit, compare file sizes of low character sets to one with no characters to get an offset and just count bytes.

Ah, but the tangled details:

Those are all Ascii characters. One byte per. Files of course have extra metadata prepended for a variety of stuff used by the OS and the app that created it. In most cases I would expect these to take up the same amount of space regardless of metadata but I would try to maintain identical circumstances when you first test the approach and then verify that you have a constant offset before not worrying about it. The other gotcha is that line-breaks typically involve two ascii white space characters and any tabs or spaces would be one each. If you can be certain these will be present and there's no way to know how many beforehand, I'd stop reading now.

It might seem like a lot of constraints but if you can easily establish them, this strikes me as the easiest/best performing approach if you have a ton of these to look at (which seems likely if that's DNA). Checking a ton of files for length and subtracting a constant would be gobs faster than running grep (or similar) on every one.

If:

These are simple unbroken strings in pure text files

They are in identical file types created by the same vanilla non-formatting text-editor like Scite (pasting is okay as long as you check for spaces/returns) or some basic program somebody wrote

And Two Things That Might Not Matter But I Would Test With First

The file names are of equal length

The files are in the same directory

Try Finding The Offset By Doing the Following:

Compare an empty file to one with a few easily-human-counted characters to one with a few more characters. If subtracting the empty file from both of the other two files gives you byte counts that match character count, you're done. Check file lengths and subtract that empty amount. If you want to try to figure out multi-line files, most editors attach two special one-byte characters for line breaks since one tends to be ignored by Microsoft but you'd have to at least grep for white-space chars in which case you might as well do it all with grep.