Now admittedly, my code is probably clunky and whatnot, but I would assume
that this would be the model one would follow for splitting a
file into multiple lines. My question has two parts: why would one use <> when
it's so slow relative to read, and why hasn't <> been implemented
in such a fashion that it takes advantage of read's quickness?
Cluka

I think at least one reason is because <> is line-oriented, in the sense that it scans the data for the next line separater and returns everything before that (well, since the last line separater). On the other hand, read is block-oriented. You tell it how big of a block you want, and it reads in that many bytes. It doesn't look at or scan through the data like <> does. So it depends on how much structure you want. If you want the next line of text, <> does that for you, at the cost of a little speed. If you just want the next n-byte chunk, read is faster. You could try to implement <> with read, but what you'd end up doing is reading in some kind of "reasonable" size chunk and scanning through it for the line separater, throwing the rest away or maybe needing to get the next check to find the end of line. And that method doesn't really have any advantages over just using <> in the first place.

Actually, after I posted the above message I started work on
a module that implements <> (via overloading) using read.
It still beats the socks off of the traditional <> and gives
the "line-oriented" feel back to the user. (The module essentially
reads in a 8k block and feeds lines to the user until it needs
to read another block...

There is good prior discussion at File reading efficiency and other surly remarks. The short answer is <> has to be slower because it does something more complicated. But on some platforms and versions of Perl it is unreasonably slow, and that has to do with external stdio libraries that it relies on.

Benchmarking is a complex thing. See podmaster's node showing that <> version is much faster for him and that a simple change makes it faster still. More on this later. (Oh, and your code is badly broken.)

I stand by a previous statement of mine: I consider perl to be broken if it can't internally implement a faster version of "read a block-at-a-time and split into lines" than you can implement it by writing Perl code. After all, if it can't, we'd be better off replacing the <> implementation with some external module implemented purely in Perl (which could then be converted to C since it all ends up in C anyway, and then optimized, etc.).

But the fact is that Perl is broken. Perl went out of its way to make <> fast. It did so by doing some "interesting" tricks which meant that, in Perl, <> was sometimes faster than fgets() in C. The problem with this was that these tricks didn't work so well in all cases.

It is a bit like what was written in the original node. Now you've got some complex, hard-to-maintain code that is a bit faster than some very simple, portable code. When things change (like Linux or PerlIO), suddenly you end up with big, complex code that is also slow. This describes what happened to Perl and it also describes what some are trying to do to "fix" it.

If you need lines from a file, then use <>. If that ends up being uncomfortably slow, then you might want to look into doing it a different way, like our opening example. But don't "optimize" before you need to.

And here are my results for the benchmarks. I noticed that you cheated by letting your big code use binmode (which means save the C libraries some translation work on some platforms) and which means that your replacement code doesn't even give the same results. So I fixed that (which, on my platform, makes about a 20% difference in speed).

Then I checked for other bugs in your code. And this is why you don't get so obsessed about speed! Great, you have code that you think is at lot faster but I've already found two bugs in it (make that 3, if there is no final newline). Put a lot more effort into getting the code correct and a lot less worry into how fast it runs.

So I rewrote your block-at-a-time code because I thought I saw some places where I could make it faster. (:

And this is when I found the fourth bug! And this was a big one, that completely invalidates the speed tests for the input file I was using.

The reason your block-at-a-time code is so much faster is probably that it says $i > 0 instead of $i >= 0 which means that it manages to read a fraction of the total number of lines.

My question has two parts: why would one use <> when it's so slow relative to read,

Because efficiency isn't always the goal. If you wanted something to be really fast you'd probably be better off choosing another language. It's very convenient for writing straightforward and readable code.

and why hasn't <> been implemented in such a fashion that it takes advantage of read's quickness?

It depends on the degree of efficency required and the size of data one is working with.
When thinking of Perl in the mind set of it original goal of the language the <> method makes prefect sense (easy things easy). You read in a small text file and either report on it or make a small change, but when you look at reading in a 56MB log file that someone forgot to put newline characters on (that really happened), then doing it using the read method makes more sense. (hard things possible)

In reply to Tye's comments...
I appreciate the in-depth commentary. This is where I'm
going to learn some real perl.
To give some background that might clear up why I wrote
the code the way I did: I'm running perl on a Win98 system.
I'm writing a piece of code that accepts a collection of
text files from a legacy database along with a configuration
file and generates a series of reports in MS Excel.
The text files are comma-delimited and in a specific, reliable
format (for example, no '\n' at the end of file), allowing for
some of the assumptions I made. I tested the code above
on several of these files and achieved the correct results each
time.
The reports are extremely time-sensitive, and routinely sum to
over 40MB of data. I need the routines to be fast and quick -
although I take to heart Tye's comment that fast code is less
important than correct code.
One question I still have: what effect does binmode have
on the data? In WinBlows it looks as though I still end up
with text in my final array, regardless of whether I use binmode.
In that case, switching to binmode and gaining the speed
increase seems reasonable.
Thanks.

The difference binmode makes in DOS and Windows is crucial! Without binmode, all routines have to do (roughly speaking) two things; 1. did we just read a OD OA pair? (if yes, convert to '\n', 2. did we just read end of file? (if yes, stop. In binmode, all we care about is end-of-file. Even end-of-file can cause a problem if there is an embedded ^Z in the file (original DOS end-of-file mark--ignored in binmode.) And since these are implemented in the OS at root (thin wrapper in 'C' library) the distinction is important...

--hsm

"Never try to teach a pig to sing...it wastes your time and it annoys the pig."

Did you read binmode? (Granted, it is rather inaccurate and shows that the author doesn't understand the point -- click on one of the links to more modern versions of the document for much better text.) It says "Files that are not in binary mode have CR LF sequences translated to LF on input", which is accurate for Win32 systems. Checking for such takes some time. Actually having to fix that requires that all of the text in the buffer after any CRs needs to be "moved up", which takes even more time. Checking the source for the standard Microsoft C run-time library, I see that the standard:

method is used to avoid multiple "move" operations, which means that the cost of moving is incurred even if no CRs are found [ and also makes for much simpler code that is easier to "get right" than if we tried to switch to strchr() and memmove() to allow assembly-language constructs to search the string and to move the bytes (: ].
- tye (or ldad $54796500 if you're in a hurry)

What if your data contains an empty line? At that point, index $block, "\n"
will at one time return 0. From there on, the loop's condition will always return 0, $left will grow until the end of the file, and the data will not ever be pushed onto the array @lines.

What if the file data doesn't end with "\n"? Again: the last line is disregarded.
At least, push the contents of $left onto the array, if it has a length > 0.

The next code will behave better on both regards, though I haven't benchmarked it, but it could be that your suspiciously excellent results are indeed partly caused by the presence of an empty line.

Does you test cases include lines that are 15000 characters each or just test cases were the lines are 50 characters each? It's going to make a difference if one read() is going to get 163 lines or one if you need two read()'s to get one line.

Excellent points! Several people have pointed out very egregious
logic errors in the initial code. Several of them are not
errors in the sense that I know the specifics of the files I
am using (no blank lines, no '\n' at the end of the file).

I really like blogans comment. Does <$fh> hit the disk each
time, or is it reading from a cached block? Does anyone know?

I guess my initial point was flawed for the general case, but
I can reformulate it to a better, stronger statement:

If you know certain aspects of the files you are reading
(e.g. average line size, whether there are blanks in the file, etc)
you could implement a bare-bones, lightning-fast read method
that passes up the traditional <$fh>. But for a basic, system-independent
file-reader, <$fh> is a strong contender.

When putting a smiley right before a closing parenthesis, do you:

Use two parentheses: (Like this: :) )
Use one parenthesis: (Like this: :)
Reverse direction of the smiley: (Like this: (: )
Use angle/square brackets instead of parentheses
Use C-style commenting to set the smiley off from the closing parenthesis
Make the smiley a dunce: (:>
I disapprove of emoticons
Other