bobf has asked for the
wisdom of the Perl Monks concerning the following question:

I wrote a program that is being run on different platforms (Linux, Windows, and Mac). I tried to write it to be as system-independent as I could (using File::Spec for paths, etc), but recently someone reported a bug. It turns out that she was creating one of the input files on a Mac, then transferring it to a Windows machine and running the program (I didn't think of that...). The error occurred when the program tried to read the input file line-by-line. I presume that since the program was run on a Windows machine, the input record separator ($/) was set to the Windows newline (\015\012). The input file was created on a Mac, though, so it had newlines of \015. As a result, the file got slurped and things turned ugly.

Use $^O, but if I understand it correctly that will just tell me about the system the program is running on, which (as exemplified here) is not necessarily the same as the system that created the file.

Use a regex to match the newline character(s) in the file. I think this would require slurping the whole file and then doing something like if( $file =~ m/\015$/ ) (which assumes the file will end with a newline) or if( $file =~ m/\015(?!\012)/ ) (which doesn't), setting $/ according to what matched, and re-reading the file line-by-line.

Preprocess the input file to convert all newline characters to the current system's newline character. I experimented a little, and I think this will work:

I think this is my favorite solution, but it seems like a lot of extra overhead for each input file since the conversion only needs to occur once (assuming the input file is not then moved to another OS).

Are there better ways of handling this?

Thanks!

Comment on Newlines: reading files that were created on other platforms

Since the \015 and \012 are not used for anything but newlines why not just look through the file _untill_ you see \015 or \012. If \015 is found and followed by \012 you know it is windows. If \015 without \012 you know its mac, and \012 by itself is linux.

If you are worried about huge file sizes (or you have to handle a stream), you can read a file a byte/char at a time until you find a run of either of your two regexes and then process the line up to that point.

I lost my temper on the same problem quite recently and resorted to barbaric (bordering on, if not actually a, "this is not entirely a Perl solution") means. If you intend to display the output to HTML like I intended to, what I did might be of help:

I simply wrapped the entire "offending" input file text within <pre></pre> tags and displayed it into an iframe in the same page. This preserved the layout of the page while displaying the text in a separate section of the page. It also allowed for the use of a menu of the files, created only once at the time page is loaded. Clicking on the menu items displayed each file in the iframe without reloading the entire page, unless you need to in case someone updates the directory from which the menu originates. You still need to use Perl a script (aha!) to read the files and spew it into html into the iframe.

Be warned however that I haven't tested this on a MAC (which are virtually non-existent in my direct social environment).

Is the person using OS X or the older Mac OS? OS X is BSD Unix underneath, so it should have just the newline "\n".

Really, though, the problem is probably in the FTPing. I've run into this before, more times than I can count, and the individual needs to be sure to transfer the file in ASCII mode. Any FTP program worthy of the name should know to handle that. In fact, by correcting the user, you might save problems later on another mixed-platform project.

Nice theory, but unfortunately incompletely. To maintain backward compatibility with the previous decade-and-change worth of documents, GUI apps in OS X (e.g. TextEdit) generally use the Mac line-ending convention (CR), while BSD-derived command-line programs (e.g. perl) use the Unix one (LF). This could be charitably described as a mess.

I agree with your solution, though—make FTP work right, and the problem should go away. There might be a more specific solution to this specific problem as well, but absent more details on the nature of the program, it's going to be tough to say anything very helpful.

If God had meant us to fly, he would *never* have given us the railroads. --Michael Flanders

Unix variants support the file commamd to determine the file type . Don't know if there is a similar Windows command, but probably is...
If so, .. can you simply test input files to determine the file type, and adjust what the input line seperator is for that file ??

When reading text of an unknown source that is likely to mix line endings (I find this in the html source of quite a few websites and in WinNT error messages), I tend to preprocess with "tr/\15\12/\n/s;"
and then carry on as if nothing was amiss. This does, however, depend _entirely_ on what you want to do with the rest of the data. To demonstrate its usefulness...

Preprocessing input files is not something to be afraid of. Of course, it depends on the flow of your data but it is sometimes more efficient to split the job into two distinct, simpler parts than to try and code one more complex solution.

It seems as if you have a (approximately) a set of client system running something like a data entry or data processing system. You are taking the results from these client systems and loading/processing them into some master program. Even if you are just swapping files among systems, a preprocessing step could be very helpful (your milage may vary). The benefit is that, at a certain point in the flow of data, the data will all look exactly the same regardless of originating/destination platform. That can greatly simplify further processing downstream of that point.

Its worth looking at more closely.

PJ

use strict; use warnings; use diagnostics;

I'll have to agree with PJ and adamk here. Checking for each type of newline ending would be horrible, but in a pre-processing situation it might actually speed up the processing time. Another suggestion, not that I'm aware of how the input file is obtained, in the creation of the input file, use a specific newline of your choice to make it uniform and use an output method that would determine what to use when the program is running.

I.E. Program 1 or Section 1 of program takes user input, uses \015 as a standard newline, instead of a system specific newline and creates the input file with that format. Upon use of the input file, the program then spits out system specific newline for "display" of the file.

"I have said, Ye are gods; and all of you are children of the most High." - Psalms 82:6

To clarify: the input files are just text files created (and edited) by the user. It is possible for a file to be created (or edited) on one platform and then moved to another to be processed, but I expect this to be a rare occurrence. That said, it already happened. :)

After reading the replies here and thinking about this a little more, I think preprocessing is the best way to go in this case. My initial reluctance towards this approach stems only from the fact that each input file will be reprocessed every time the program is run. Thinking out loud: I would be very surprised if any one input file exceeded 1 MB, so slurping it is not an issue with respect to memory. Maybe I could save time by only writing the file back out if a newline character did not match \n (figuring out how to do that will be the next step - perhaps using binmode to make sure they don't get converted to \n on input (per monkey_boy's suggestion), or something like msemtd's example). I will also look into adamk's File::LocalizeNewlines module.

periapt and Drgan summed up the implementation and advantage of this approach quite nicely. I'll write a sub to preprocess the files and see how it goes.