@Matt: Have you tried it with unicode chars like \u0a0a ਊ or \u090aऊ... That's the only time the problem shows itself...My file has 532 such chars.
–
Peter.OMay 31 '12 at 12:52

Ah, no those get miscounted. You should count each 0a that is not "legitimate" I guess, to fix your script. (xx0a doesn't get counted, 0a0a only counts for one, if I understand it correctly).
–
MatMay 31 '12 at 13:13

There is nothing wrong with my script. It works fine. The tally error comes from wc (and awk's counting of NR is out by a further 1).. the above script's line-count is the same as shown in emacs... I'm just trying to find a less clumsy way of counting lines in a UTF-16LE/CR-LF (with BOM, in this case, if that makes a difference) file..
–
Peter.OMay 31 '12 at 13:15

Good, thanks. It works well. I'm still on the outer fringes of perl, so its a good learning example for me. It seems as simple as telling perl to read the file as UTF-16.. I like that... This method has no manipulation of the data (a definite plus), and Warren's method is very simple to write (no scripting is also a definte plus). so both answers are great...
–
Peter.OMay 31 '12 at 14:06

Thanks Waren... I haven't got either d2u or dos2unix in my Ubuntu repo, but in all this, I've just discovered a similar solution.. It's very much the same, and slightly different, so here it is, for general reference... <"$file" recode UTF-16LE..UTF-8 |wc -l ... recode uses the iconv libraries, and has added the concept of surfaces which I'm finding to be quite handy.
–
Peter.OMay 31 '12 at 13:47

@Peter.O: Really? sudo apt-get install dos2unix worked just fine here on my 11.10 box. This version of dos2unix works in pipelines. Since iconv(1) appears to be installed by default, that should be all you need.
–
Warren YoungMay 31 '12 at 14:20

@Peter.O: I just upgraded that box to 12.04, and dos2unix is still in the stock package repo. Then I checked an old stable 8.04 box, and there's not a package called dos2unix, but there is a package tofrodos which includes a fromdos command symlinked to dos2unix, which works in a pipeline.
–
Warren YoungMay 31 '12 at 15:17

For interest's sake, I just now installed a fresh Ubuntu 11.10 as a VM and dos2unix doesn't show up in that defaut system, either... The only tab completion for dos is dosfs... Oh well, it doesn't matter if the line endings are CR-LF or just LF. for a simple line count, so long as it is UTF-8 (which means that there are no extraneous \x0a bytes floating about.
–
Peter.OMay 31 '12 at 16:06