newbie1991 has asked for the
wisdom of the Perl Monks concerning the following question:

Hello Monks!
I'm trying to evaluate if a current array matches a preset pair and putting the number of occurrences of the pair in one element of the matrix.
I've tried the following grep statement and it's not working. I think I know why (evaluating to true and false will not increment the counter each time, just replace it), but can someone help me out with what to do instead?

Essentially I want the [0][0] element to count how many times AA appears in my input sequence. I have already segmented it up into pairs (hence the @dipept, it only has 2 elements).
As always, I'm just starting out and all help is appreciated. Thankyouu. :)
PS : I'm a little shaky with hashes so if you suggest hashes please do expand on it a little.

You say the approach you are using is "not working", leaving us to guess what data you are using and what result you expect.

My guesses are that either you want to count the number of strings in a dataset (i.e., an array) in which a pattern occurs at least once, or you want to count the total occurrences of a pattern in all strings in a dataset.

The code you posted seems to serve for the first purpose. A variation using map seems to take care of the other. (Note that the pattern occurs twice in 'zAAzAAz'.) In neither case is the original dataset changed.

I'm counting total occurrences of a pattern in the dataset.
Input in array format is M H D L N with each element being one letter.
Output should count how many times MH, HD, DL, etc. appear. The input is MUCH longer (it's a amino acid sequence).
And yes, overlaps are considered. AAA has 2 matches.

tobyink improves clarity by generating a list of counts and adding/summing them (makes an array of matches, scalar array is count), map made a list, feels good :)

you rely on generating a list of matches ( m//g) , and that map in scalar context returns a count

Is there a performance advantage? Penalty? Was map-in-scalar as expensive as map-in-void (before perlv5.8.1)?
It doesn't really matter as the reason for using map over foreach is clarity/brevity/tradition.
The basic intent echoed in all the manuals and books, map for transforming lists, grep for filtering lists, foreach(for) for counting (iterating).

I simply didn't think about doing it that way. I could probably count the number of times I've used map in a scalar context on my fingers.

That said, given that the original question was about gene sequences, which can be pretty long strings, it seems preferable to avoid the creation of a temporary array containing all matches. Whether that's an important concern or not depends on many factors (typical number of matches expected; typical length of matched strings; etc).

I have a good idea of what you want to do and the data your dealing with (I do quite a lot of bioinformatics based work).

Anyway, I've written two scripts for you to look at. I've kept the code simple and commented so you should be ok with it. Over a chromosome, this mightnt be as fast as it could be but should be ok.

Firstly, you wont want to split the sequence into an array, unless you are absolutely sure you arent going to miss out a count on an odd number occurrance of the acid eg FAAAD would be split into FA-AA-D? or ?F-AA-AD so you would only count AA once whereas its actually got 2 pairs.

This is the first script. This will find only AA pairs and count them. The sequence is ASDTDAAFRASEQSAAAFDG (its in the code) so the number of AA's should be 3.

This is the second script, its a bit more complicated. Instead of only counting the number of AA's it will count all pairs. It creates these on the fly and if it encounters one it has already created it just increments the value. Just to note for the 22 possible amino acids the number of possible pairs will be much higher (484 I think)

I have no bioinformatic background, but I'd like to offer a couple of comments on your code, specifically the version that counts overlapping letter pairs (would 'digrams' be an appropriate term for these?).

Because it is not necessary to check for the existence of a hash key before incrementing its value (due to autovivification), the body of this for-loop can be reduced to a single statement: ++$acids{ substr $string, $i, 2 }
This will almost certainly yield a speed benefit.

Alternatively, in 5.10+ versions of Perl, the entire for-loop can be replaced by a single regex (tested): $string =~ m{ (?= (..) (?{ ++$pairs2{$^N} }) (*FAIL)) }xms;
This may or may not increase speed; you will have to Benchmark this for yourself.
The alternate regex m{ (?= .. (?{ ++$pairs2{${^MATCH}} }) (*FAIL)) }xmsp
also works (note the additional /p regex modifier) and may be slightly faster because no capturing group is used. Again, Benchmark-ing will tell the tale.

So I went back to the start after posting this and essentially came up with your first option, I'm glad you recommended it as well because I was worried that it was childish/too roundabout.
Thanks so much everyone. :)