Examining and Manipulating Cross Platform Text Files

Have you ever had to transfer files over FTP to and from Windows and Linux systems and had to deal with an administrator that just could not comprehend why the files he or she is giving you aren’t coming out right?

Maybe they are transferring the files to you and they keep showing up with no line breaks. The problem if course is pretty simple on the surface. Windows uses Carriege Return (Char(13))/Line Feed(Char(10)) together to represent line breaks, while many other operating systems only use CR. Of course if you’re not very lucky you are dealing with an OS or an import program that expects something even weirder like RS(Char(30)).

So yeah, the problem is simple to understand, and usually simple to fix. Use ASCII Transfer mode in FTP instead of Binary and most of the time the problem goes away. But what do you do when the admin on the other side doesn’t believe you that the file is fine from your perspective? Or maybe they insist that they DID FIX IT!!! How do you convince someone that their encoding is wrong, or that yours is fine? Maybe it’s not worth the struggle. How do you fix a file so that they can consume it no matter what?

With UTF32 it gets a little more complicated. We start with the byte order mark and then have a bunch of unused bytes between each letter. Our Char 10 and 13 are still there though. We’ve taken up more space but this is still a fairly plain windows file.

Lastly we see what the bytes look like in a file that looks fine to the linux admin but looks like just a blob of text to us. This simple file is easy to fix manually, but if you’re trying to set up automated data imports on a Windows system, this can be a real pain.

Fixing the File

So now that we’ve seen how we can inspect the file, what can we do if the admin on the other end just doesn’t know how to fix this. And by the way, this doesn’t always mean they are incompetent. I’ve been told by a very smart admin that getting this right transferring in and out of AIX is just hard.

That last command really shows us how we can make the other guys life easier for very little effort on our part. If you aren’t super familiar with Powershell it’s worth looking at exactly how it works.

Get-Content reads a file’s content, but it will break up each line into a discreet string object, stripping its line endings in the process. The syntax forces the entire file to be processed at once and the newly created array of string objects is handed off to the -join operator. We join by char 10 in this case to give us Linux line endings. We pass that resulting string off to Set-Content choosing ASCII as our encoding (encoding can be whatever the recipient wants), ensuring that we use -NoNewline so we don’t get a Windows line ending appended at the very end of the file. Now you can do a binary file transfer and the Linux system is happy.

Need to terminate lines with a “~”? yeah I’ve seen it. Just use -join [char]126. Any crazy line terminator they want, you can provide.

This also gives us insight into how to fix Linux line endings that they can’t figure out how to fix for us.

Get-Content c:\BrokenFile.txt | Set-Content c:\FixedFile.txt

In this case we take advantage of the fact that while many older Windows programs adhere slavishly to Windows Cr\Lf line endings, Powershell really does attempt to be smarter, so it has been designed such that many commandlets like Get-Content understand Linux line endings by default. Again it strips the line endings as it breaks the files lines into an array of strings. As those string objects are passed on to Set-Content though, it adds them to file one at a time, but this time it uses the standard Windows line endings, and just like that, a file that just a second ago looked like one long line of gibberish is fixed.

One last thing, just to save you a minute of frustration, notice that I did not write to the same file as I read the data from. When Powershell starts reading a file and breaking the lines into string objects, the first of those strings will reach Set-Content before Get-Content has actually finished reading the file. If you want to convert a file in place you will have to stage the data somewhere else first, or Set-Content will just encounter a file that is still open for reading and throw an error that it can’t write to the file because it’s still locked.