On Apr 21, 2010, at 5:29 PM, Martin.Zinser@deutsche-boerse.com wrote:
> If you open a text file with Carriage return carriage control for
> output
> (based off an existing file) and populate the new file with longer
> records, at some point gratuitous
> line breaks are added to the file.
Finally getting back to this after six months. And I think I have a
solution. To review, what happens when you use the Perl "open"
operator is that it calls into its own buffered I/O layer named
"perlio" which sits on top of another layer called "unixio" which is
implemented in terms of the CRTL read/write functions. This
arrangement was new in about 5.6 but became the default in 5.10, and
that's where we started seeing the problem Martin describes on VMS.
The problem is that while the perlio layer is buffered, the unixio
layer is not. When the buffer in the perlio layer gets filled up, it
triggers a flush to the lower layer. The flush in the perlio layer
causes a write() in the unixio layer, and when you do that you go all
the way to disk, and if writing to a record-oriented file, you'll
likely introduce an extra record boundary in the file unless you had
the extreme good fortune to hit the end of a line at the same time you
hit the end of the buffer. Part of the problem is that the buffer in
the perlio layer is hard-wired to 4K. With a larger buffer, you would
typically not see as many extra records, but you would still see them.
It turns out the perlio layer has some knobs and switches on it, and
one of them is a "line buffering" option. If this option is enabled,
then the flush to the lower layer happens whenever a newline character
appears in the data. As long as your lines are shorter than the
length of the buffer, you write them out whole, which empties the
buffer in the upper layer making room for more data, and everything is
peachy.
So, where and how to enable this line buffering? Here's my proposed
patch:
--- perlio.c;-0 2010-10-21 07:58:15 -0500
+++ perlio.c 2010-11-02 21:32:41 -0500
@@ -3758,6 +3758,22 @@ PerlIOBuf_open(pTHX_ PerlIO_funcs *self,
*/
PerlLIO_setmode(fd, O_BINARY);
#endif
+#ifdef VMS
+#include <rms.h>
+ /* Enable line buffering with record-oriented regular
files
+ * so we don't introduce an extraneous record boundary
when
+ * the buffer fills up.
+ */
+ if (PerlIOBase(f)->flags & PERLIO_F_CANWRITE) {
+ Stat_t st;
+ if (PerlLIO_fstat(fd, &st) == 0
+ && S_ISREG(st.st_mode)
+ && (st.st_fab_rfm == FAB$C_VAR
+ || st.st_fab_rfm == FAB$C_VFC)) {
+ PerlIOBase(f)->flags |= PERLIO_F_LINEBUF;
+ }
+ }
+#endif
}
}
}
[end]
This is right after the perlio layer has called down to the unixio
layer to get the file open. We have an fd, so we can do an fstat() on
that and retrieve the record format from the VMS-specific bits of the
stat structure. Then I check to see if it's a regular file (not a
device like a mailbox that may need to carry binary data) and that the
record format is either variable or variable with fixed control. If
these conditions are met, I enable the line buffering option on that
filehandle.
I have tested this and it works for situations similar to Martin's
original report, and it does not introduce any new test failures in
the test suite. But what situations, if any, does this break? I'm
assuming that if the record format is FAB$C_VAR or FAB$C_VFC, the
records will never contain binary data with embedded newlines. Is
that true? What other assumptions am I making that I shouldn't?
________________________________________
Craig A. Berry
mailto:craigberry@mac.com
"... getting out of a sonnet is much more
difficult than getting in."
Brad Leithauser