Writing Portable Shell Scripts on z/OS UNIX

z/OS UNIX

z/OS UNIX is a facet of z/OS that makes the operating system more approachable to those of us who weren’t brought up around 3270 terminals. Amongst other things, it provides support for remoting into the z/OS mainframe under a choice of UNIX shells (a simple Bourne-style shell or tcsh), access to a hierarchical file system, and a handful of Unix tools like ls, vi and grep (which work fine, as long as you don’t expect them to behave quite the same way as on other Unix systems). Like the rest of the operating system, z/OS UNIX is an EBCDIC environment.

EBCDIC Variants

My previous post brought up the topic of variant characters in EBCDIC. In summary, the characters shown below can be represented by different byte values depending on the locale under which a user is running:

1

\ ^ ~ ! [ ] { } # | ` $ @

This raises following question: how can you distribute a z/OS UNIX shell script that will “just work” for any z/OS user, given that entities like the comment character (#), hashbang (#!), pipes (|) and the variable marker ($) might need to be represented by different bytes in different locales?

Changing locale in z/OS UNIX

To experiment with different EBCDIC variants in z/OS UNIX, you will need to change locale. A word of warning: always start a new shell after changing locale on z/OS. If you don’t, behaviour is unspecified. I’ve noticed that tcsh in particular behaves very oddly after setting the locale but before starting a new shell (e.g. for some operations the new locale is used, for others the old locale is used).

How the Shell Transcodes Scripts

The core of the z/OS shell expects script data to be encoded in IBM-1047. If the active charset is not IBM-1047, then the script is internally transcoded to IBM-1047 before it is executed. The transcode operation is interesting. It consists of modifying a very specific set of bytes in the original script: only those byte values assigned to variant characters will be changed to their IBM-1047 equivalents. You can display the variant byte values for the current charset with the command locale -ck LC_SYNTAX. For example, in an IBM-297 locale:

So, with IBM-297 (French variant), “number_sign” (#) is a variant character, represented with byte value 0xb1. Therefore, before a script is executed in an IBM-297 environment, all occurrences of byte 0xb1 will be transcoded to the IBM-1047 representation of “number_sign”, which is 0×79.

Interestingly, 0×79 is not listed as a value for a variant char in IBM-297. This means that occurrences of byte 0×79 in a script are left unchanged. If a script already contains the IBM-1047 byte value for #, it is left as-is and treated as a comment character by the shell.

It turns out this holds for # under pretty much every EBCDIC flavour and also applies to $ (0x5b). The implication is that a rudimentary shell script (with comments and variables – but with no other variant chars like pipe, aka vertical_line, or backtick, aka grave_accent), can be written in IBM-1047, and will execute correctly in any EBCDIC environment. The hashbang (#!) is handled by the kernel exec/spawn processing, which expects IBM-1047 – the shell simply sees the #! line as a comment. Therefore, the hashbang can safely be left as IBM-1047 too.

Transcoding on the Fly

We can use this characteristic of EBCDIC variants to create a simple “top level” script in IBM-1047 that dynamically transcodes more elaborate scripts from a known charset to the current charset, and then executes them.

For example, here’s a script that would not run in all EBCDIC environments when encoded in IBM-1047, because it uses back-slashes, back-ticks and pipes:

Let’s save that script in UTF-8 as foo.sh.utf8 (we could pick any encoding – I’m using UTF-8 here to distinguish it from any of the EBCDIC flavours we might want to transcode it to).

Next, let’s create the script that the user actually invokes. It transcodes foo.sh.utf8 to the current character set (obtained with the command locale charmap) if this has not already been done, then executes it.

1234567891011121314151617181920212223242526

#!/bin/sh

# Obtain absolute script path so this works# when invoking script from another directory.# Avoid using backticks.PRG="$0"whiletest-h"$PRG"; doPRG=$(readlink"$PRG")doneSCRIPT_DIR=$(dirname"$PRG")SCRIPT_DIR=$(cd"$SCRIPT_DIR"&&pwd)

Should the foo.sh references in the 2nd half be $PRG references? I guess for added bonus marks you could check whether the transcoded version was older than the original and transcode if the user had updated the script!

I had no idea this was a problem – I think so far I’ve only ever send scripts between Linux and MVS USS using ASCII to EBCDIC. One to store away for the future, though!

Is there anyway to dynamically evaluate the LC_SYNTAX value of a character for the current code page and avoid converting the script? Or is there a way to convert JUST one line in the script that has variant characters in it?