Hello! I'm new in perl, but I need to create a script to analyse some DNA sequence files. I have an input file like this: > 1 atggggagcgattt... > 10 atgggccat... .... I'd like to obtain only the numbers of each line in the output.

Hi Bill! I tried but this was the message I obtained: Can't modify concatenation (.) or string in scalar assignment at -e line 1, near "q() unless" Unmatched ( in regex; marked by <-- HERE in m/\D*?( <-- HERE / at -e line 1.

I also tried to change " with ', but I obtained only the second message Unmatched ( in regex; marked by <-- HERE in m/\D*?( <-- HERE / at -e line 1.

I almost have what I want, but it doesn't run the loop, that is I have only the number of the first line in the output. I explain better... I have a file like this: > 0 agtttatcg... aggttttccgg... > 2 ggttaattggcc aaaggttccgtatacg.... I want to copy only the number in the output. If I run the script you suggested, I have this: 0 1 CATCAAAATAATCATGTATTAGTAAAAGTTTAGTAAAAAATACTAAAACTATTGACAATTCAAACTAATACTTGTATAATGGAAGCGTATTCAAAAAATAACAGGAGGTTCTCATAATGAGAAAATCTAACGTTCAGATGAAGTCTCGTCTATCCTATGCAGCGGGTGCTTTTGGTAACGACGTCTACTATGCAACGTTGTCAACATACTTTATT > 10 CATCAAGATAACCATGTATTAGTAAAATTTTAGTAAAAAACACTGAAATTATTGACTGCATAAACCAATTTTCATATAATGTAAACGTATTCAAATAATAGGAGGTTTCCGAAATGGAAAAATCTAAAGGTCAGATGAAGTCTCGTTTATCCTACGCAGCTGGTGCTTTTGATAACGACGTCTTCTATGCAACCTTGTCAACATTACTTTATC > 100 CATCAAAATAATCATGTATTAGTAAAAGTTTAGTAAAAATACTAAAACTATTGACAATTCAAACTAATACTTGTATAATGGAAGCGTATTCAAAAAATGACAGGAGGTTCTCATAATGAGAAAATCTAACGTTCAGATGAAGTCTCGTCTATCCTATGCAGCGGGTGCTTTTGGTAACGACGTCTTCTATGCAACGTTGTCAACATACTTTATT .... and so on...

If I run the script I wrote, I have only the first number, so it didn't proceed for the other lines. Do you know why??

You may want to include partial output of the data file using the od command. e.g., [root@099-91-RKB-2 ~]# od -c rep_set_ass_tax.fna

Code

0000000 > 0 \r \n C A T C A A A A T A 0000020 A T C A T G T A T T G G T A A A 0000040 A G T T T A G T A A A A A T A C 0000060 T A A A A C T A T T G A C A A T 0000100 T C A A A C T A A T A C T T G T 0000120 A T A A T G G A A G C G T A T T 0000140 C A A A A A A T A A A C A G G A 0000160 G G T T C T C A T A A T G A G A 0000200 A A A T C T A A C G T T C A G A 0000220 T G A A G T C T C G T C T A T C 0000240 C T A T G C A G C G G G T G C T 0000260 T T T G G T A A C G A C G T C T 0000300 T C T A T G C A A C G T T G T C 0000320 A A C A T A C T T T A T T \r \n > 0000340 1 \r \n C A T C A A A A T A A 0000360 T C A T G T A T T A G T A A A A 0000400 G T T T A G T A A A A A A T A C 0000420 T A A A A C T A T T G A C A A T 0000440 T C A A A C T A A T A C T T G T 0000460 A T A A T G G A A G C G T A T T 0000500 C A A A A A A T A A C A G G A G 0000520 G T T C T C A T A A T G A G A A 0000540 A A T C T A A C G T T C A G A T 0000560 G A A G T C T C G T C T A T C C 0000600 T A T G C A G C G G G T G C T T

But when I run it, I just have 0 in the output file... When I run od -c rep_set_tax_ass.fna, I have

close $out;MacQIIME Mac-Pro-di-Francesca:cartella senza titolo $ od -c rep_set_ass_tax.fna 0000000 > 0 \r C A T C A A A A T A A 0000020 T C A T G T A T T G G T A A A A 0000040 G T T T A G T A A A A A T A C T 0000060 A A A A C T A T T G A C A A T T 0000100 C A A A C T A A T A C T T G T A 0000120 T A A T G G A A G C G T A T T C 0000140 A A A A A A T A A A C A G G A G 0000160 G T T C T C A T A A T G A G A A 0000200 A A T C T A A C G T T C A G A T 0000220 G A A G T C T C G T C T A T C C 0000240 T A T G C A G C G G G T G C T T 0000260 T T G G T A A C G A C G T C T T 0000300 C T A T G C A A C G T T G T C A 0000320 A C A T A C T T T A T T \r > 1 0000340 \r C A T C A A A A T A A T C A 0000360 T G T A T T A G T A A A A G T T 0000400 T A G T A A A A A A T A C T A A 0000420 A A C T A T T G A C A A T T C A 0000440 A A C T A A T A C T T G T A T A 0000460 A T G G A A G C G T A T T C A A 0000500 A A A A T A A C A G G A G G T T 0000520 C T C A T A A T G A G A A A A T 0000540 C T A A C G T T C A G A T G A A 0000560 G T C T C G T C T A T C C T A T 0000600 G C A G C G G G T G C T T T T G 0000620 G T A A C G A C G T C T A C T A 0000640 T G C A A C G T T G T C A A C A 0000660 T A C T T T A T T \r > 1 0 \r 0000700 C A T C A A G A T A A C C A T G 0000720 T A T T A G T A A A A T T T T A 0000740 G T A A A A A A C A C T G A A A 0000760 T T A T T G A C T G C A T A A A 0001000 C C A A T T T T C A T A T A A T 0001020 G T A A A C G T A T T C A A A T 0001040 A A T A G G A G G T T T C C G A 0001060 A A T G G A A A A A T C T A A A 0001100 G G T C A G A T G A A G T C T C 0001120 G T T T A T C C T A C G C A G C 0001140 T G G T G C T T T T G A T A A C 0001160 G A C G T C T T C T A T G C A A 0001200 C C T T G T C A A C A T T A C T 0001220 T T A T C \r > 1 0 0 \r C A T 0001240 C A A A A T A A T C A T G T A T 0001260 T A G T A A A A G T T T A G T A 0001300 A A A A T A C T A A A A C T A T 0001320 T G A C A A T T C A A A C T A A 0001340 T A C T T G T A T A A T G G A A 0001360 G C G T A T T C A A A A A A T G 0001400 A C A G G A G G T T C T C A T A 0001420 A T G A G A A A A T C T A A C G 0001440 T T C A G A T G A A G T C T C G 0001460 T C T A T C C T A T G C A G C G 0001500 G G T G C T T T T G G T A A C G 0001520 A C G T C T T C T A T G C A A C 0001540 G T T G T C A A C A T A C T T T 0001560 A T T \r > 1 0 0 0 \r C A T C 0001600 A A A A T A A T C A T G T A T T 0001620 A G T A A A A G T T T A G T A A 0001640 A A A A T A C T A A A A C T A T 0001660 T G A C A A T T C A A A C T A A 0001700 T A C T T G T A T A A T G G G A 0001720 G C G T A T T C A A A A A A T A 0001740 A C A G G A G G T T C T C A T A 0001760 A T G A G A A A A T C T A A C G 0002000 T T C A G A T G A A G T C T C G 0002020 T C T A T C C T A T G C A G C G 0002040 G G T G C T T T T G G T A A C G 0002060 A C G T C T T C T A T G C A A C 0002100 G T T G T C A A C A T A C T T T 0002120 A T T \r > 1 0 0 1 \r C A T C 0002140 A A G A T A A C C A T G T A T T 0002160 G G T A A A A T T T T A G T A A 0002200 A A A A C A C T G A A A T T A T 0002220 T G A C T G C A T A A A C C A A 0002240 T T T T C A T A T A A T G T A A 0002260 A C G T A T C C A A A T A A T A 0002300 G G A G G T T T C C G A A A T G 0002320 G A A A A A T C T A A A G G T C 0002340 A G A T G A A G T C T C G T T T 0002360 A T C C T A C G C A G C T G G T 0002400 G C T T T T G G T A A C G A C G 0002420 T C T T C T A T G C A A C C T T 0002440 G C C A A C A T A C T T T A T C 0002460 \r > 1 0 0 2 \r C A T C A A A 0002500 A T A A T C A T G T A T T A G T 0002520 A A A A G T T T A G T A A A A A 0002540 A T A C T A A A A C T A T T G A 0002560 C A A T T C A A A C T A A T A C 0002600 T T G T A T A A T G G A A G C G 0002620 T A T T C A A A A A A A T A A C 0002640 A G G A G G T T C T C A T A A T 0002660 G A G A A A A T C T A A C G T T 0002700 C A G A T G A A G T C T C G T C 0002720 T A T C C T A T G C A G C G G G 0002740 T G C T T A T G G T A A C G A C 0002760 G T C T T C T A T G C A A C G T 0003000 T G T C A A C A T A C T T T A T 0003020 T \r > 1 0 0 3 \r C A T C A A 0003040 A A T A A T C A T G T A T T A G 0003060 T A A A ..............

1446640 A T G C A A C C T T G T C A A C 1446660 A T A C T T T A T C \r > 9 9 4 1446700 \r C A T C A A A A T A A T C A 1446720 T G T A T T A G T A A A A G T T 1446740 T A G T A A A A A T A C T A A A 1446760 A C T A T T G A C A A T T C A A 1447000 A C T A A T A C T T G T A T A A 1447020 T G G A A G C G T A T T C A A A 1447040 A A A T A A C A G G A G G T T C 1447060 T C G T A A T G A G A A A A T C 1447100 T A A C G T T C A G A T G A A G 1447120 T C T C G T C T A T C C T A C G 1447140 C A G C G G G T G C T T T T G G 1447160 T A A C G A C G T C T T C T A T 1447200 G C A A C G T T G T C A A C A T 1447220 A C T T T A T T \r > 9 9 5 \r 1447240 C A T C A A A A T A A T C A T G 1447260 T A T T A G T A A A A G T T T A 1447300 G T A A A A A T A C T A A A A C 1447320 T A T T G A C A A T T C A A A A 1447340 C T A A T A C T T G T A T A A T 1447360 G G A A G C G T A T T C A A A A 1447400 A A T A A C A G G A G G T T C T 1447420 C A T A A T G A G A A A A T C T 1447440 A A C G T T C A G A T G A A G T 1447460 C T C G T C T A T C C T A T G C 1447500 A G C G G G T G C T T T T G G T 1447520 A A C G A C G T C T T C T A T G 1447540 C A A C G T T G T C A A C A T A 1447560 C T T T A T T \r > 9 9 6 \r C 1447600 A T C A A A A T A A T C A T G T 1447620 A T T A G T A A A A G T T T A G 1447640 T A A A A A A T A C T A A A A C 1447660 T A T T G A C A A T T C A A A C 1447700 T A A T A C T T G T A T A A T G 1447720 G A A G C G T A T T A C A A A A 1447740 A A T A A C A G G A G G T T C T 1447760 C A T A A T G A G A A A T C T A 1450000 A C G T T C A G A T G A A G T C 1450020 T C G T C T A T C C T A T G C A 1450040 G C G G G T G C T T T T G G T A 1450060 A C G A C G T C T T C T A T G C 1450100 A A C G T T G T C A A C A T A C 1450120 T T T A T T \r > 9 9 7 \r C A 1450140 T C A A G A T A A C C A T G T A 1450160 T T A G T A A A A T T T T A G T 1450200 A A A A A A C A C T G A A A T T 1450220 A T T G A C T A C A T A A A C C 1450240 A A T T T T C A T A T A G T G T 1450260 A A A C G T A T T C A A A T A A 1450300 T A G G A G G T T T C C G A A A 1450320 T G G A A A A A T C T A A A G G 1450340 T C A G A T G A A G T C T C G T 1450360 T T A T C C T A C G C A G C T G 1450400 G T G C T T T T G G T A A C G A 1450420 C G T C T T C T A T G C A A C C 1450440 T T G T C A A C A T A C T T T A 1450460 T C \r > 9 9 8 \r C A T C A A 1450500 A A T A A T C A T G T A T T A G 1450520 T A A A A G T T T A G T A A A A 1450540 A A T A C T A A A A C T A T T G 1450560 A C A A T T C A A A C T A A T A 1450600 C T T G T A T A A T G G A A G C 1450620 G T A T T C A A A A A A T A A T 1450640 A G G A G G T T C T C A T A A T 1450660 G G G A A A A T C T A A C G T T 1450700 C A G A T G G A G T C T C G T C 1450720 T A T C C T A T G C A G C G G G 1450740 T G C T T T T G G T A A C G A C 1450760 G T C T T C T A T G C A A C G T 1451000 T G T C A A C A T A C T T T A T 1451020 T \r > 9 9 9 \r C A T C A A G 1451040 A T A A C C A T G T A T T A G C 1451060 A A A A T T T T A G T A A A A A 1451100 A C A C T G A A A T T A T T G A 1451120 C T G C A T A A A C C A A T T T 1451140 T C A T A T A A T G T A A A C G 1451160 T A T T C A A A T A A T A G G A 1451200 G G T T T C C G A A A T G G A A 1451220 A A A T C T A A A G G T C A G A 1451240 T G A A G T C T C G T T T A T C 1451260 C T A C G C A G C T G G T G C T 1451300 T T T G G T A A C G A C G T C T 1451320 T C T A T G C A A C C T T G T C 1451340 A A C A T A C T T T A T C

It looks like the problem is coming from the fact that the file was produced under Mac OS and you are trying to process it under Linux.

Therefore the end of line character is \r, while under Linux or Unix it is \n.

You can either reprocess your file with a command like this one:

Code

perl -pi -e 's/\r/\n/g;' my_file.txt

prior to running your script, to put the file into a Unix/Linux format. This is usually the type of thing I do when I encounter problems between Dos/Unix/Mac end-of-line formats.

(The reson for which I usually prefer to change the data rather than the program is that I usually encounter this type of problem when someone accidentally change the format of the file, for example by editing under Windows a Unix generated file; in such case, I usually do not want to change my program, but prefer to make the data consistent with the program, because I know that if I change the program, next time I will probably have the opposite problem. When possible, though, I write my programs so that they can handle both kind of input without any problem.)

Or you can redefine the input record separator to be the Mac end-of-line character within your script (near the top of the script) with the following command (similar or identical to the one suggested above by Fishmonger):

Great!!! It works! Many thanks to everyone for the help!! I have another question: do you know how add the same string for each line of the file? I mean, I want to add "k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcusthermophilus" near each number of the file... And it should be tab delimited... Like this: 1 S.thermophilus 10 S.thermophilus.... and so on...

I normally do that with either dos2unix file.txt or unix2dos file.txt depending on which direction I need.

Yes, these are handy, but I am not sure that this works with files coming from Mac which as neither Dos-like, nor Unix-like. Besides, the dos2unix and unix2dos utilities are usually available under Linux (and/or bash) but not always available under Unix/ksh. For example, I recently had to make a dos2unix alias under AIX, aliasing dos2unix to "perl -pi -e 's/\r//g'" because the command did not exist natively under AIX.