Monday, January 21, 2013

Mainframe Information Representation and Storage

How Mainframe computers store data

On computers, data is stored in the form of bits – 0s and 1s (binary). Characters like A,B,C,...,Z are formed with 8 bits called as Bytes. When you press a key on the keyboard, the key emits out eight bits from the cable.

Every key is represented with a unique combination of 0s and 1s. Because, we use 8 bits to store a character, a total of 2^8 = 256 patterns are possible. IBM Mainframe’s designers assigned a unique 8-bit pattern to each character. This scheme of representing characters and data in mainframe computers is called Extended Binary Coded Decimal Interchange Code (EBCDIC). Every character occupies storage of one byte space in computer memory.

Fields, records, files and datasets

Financial institutions like banks are computerized. A bank may store data of its customers on a central computer. This would include details such as the full-name of the customer, his residential address, his social security number, his contact details and his account number. Data such as the customer name consist of a sequence or a group of characters. A group of characters that represent a data item is called a field. For example, the customer name is one field, customer address is another such field and so forth.

Associated fields such as the customer name, address, the account number etc. about an entity put together make up a record. A record represents data for a single instance. A customer record tells you everything about one customer. A record may be divided into several fields.

A record has a length. Assume that the customer name field can be 10 characters long, the customer address can have upto 30 characters, the contact details field can span 10 characters, the social security number 09 digits and the account number upto 21 digits. Recollect that each character on the mainframe computer occupies one byte of memory space. The length of each customer record would then be the sum of the fields’ sizes, 30 + 10 + 10 + 09 + 21 = 80 bytes. Hence, the record of a customer would occupy 80 bytes in computer storage.

If a collection of records are stored together, say 1000 customer records of the bank, it is called a file. A file is then, just a sequence of records. IBM mainframes use the term dataset instead of file.

Generally, datasets (files) are stored on computer storage devices. On the mainframe, there are two storage devices commonly used – disk and tape.

The concept of fields, records and files (datasets)

Mass Storage devices

Mass storage devices are used to permanently store data. They are non-volatile. It does not lose its contents even when the electrical power is cut-off. Mass storage devices can store humungous volumes of data. They have high capacity.

The data stored on a storage device can be accessed by two methods – Sequential Access and Direct (Random) access.

Sequential Access

Sequential access means that the mainframe system must search the storage device from the beginning, till the desired data is found. This is like playing a Bollywood music cassette tape on your Sony Walkman. The audio cassette tape of the Hindi movie film Jab Tak Hai Jaan had five songs – Challa ki, Saans me teri sung by Shreya Ghosal, Heer heer Harshdeep kaur, Jiya Re by Neeti Mohan and Jab tak hai jaan title track by Neeti Mohan. Now, if you had to listen to JTHJs title track, you will start at the first song and must go past the second, third songs and so forth, till you reach the last song. The most common storage device that allows sequential access is a Magnetic Tape.

Direct (Random) Access

Direct access implies that the mainframe system can directly locate the desired data on the storage device. This is like reading a topic in a reference book. If I would like to read about optical isomers in Organic Chemistry’s book written by Morrison Boyd, I’d check the entry optical isomers in the index. In the index, the entry optical isomer had the page number 257 against it. I can directly jump to page 257 of the book and read about optical isomers at length. I don’t have to read the book from cover-to-cover. The most common device that allows direct access is a Magnetic Disk.

Magnetic recording and playback

Faraday’s laws of electro-magnetic theory in Physics say that when a magnet is moved past a magnetic field, it can be magnetised, usually created by an electro-magnet. Like-wise, when a coil (wire) is moved past a magnetic field, it induces an electric voltage (signal) in the coil.

Usually, the surface of a storage media is coated with a magnetic material like Ferric oxide. At an atomic level, the oxide coating has several tiny poles (magnets) which are randomly oriented. The head is an electro-magnetic coil that receives computer data (Bit 0 or 1) in the form of electric pulses. The electro-magnetic coil creates a magnetic field around it.

While recording data, the surface of media is moved past the head (coil). Variations in the electric pulses cause variations in the magnetic field created by the head. The tiny poles (magnets) on the surface of the storage media are magnetised (aligned) and they orient themselves N-S or S-N depending on the bits 0 and 1. Playback is the reverse. N-S or S-N orientation of the poles of the storage media are sensed as bits 0 and 1 by the head.

Magnetic Tape

A standard magnetic tape consists of a ½ inch wide plastic ribbon coated with Ferric oxide magnetic substance. On the mainframe, each character (A-Z,a-z,0-9) is represented by a unique 8-bit pattern(1 byte). The magnetic tape has eight tracks to store eight bits for a character + one track for the parity bit which is used for error detection. Look at the figure below. It illustrates how the string HELLO will be stored on a magnetic tape.

How information is stored on a magnetic tape

Data can be stored and retrieved sequentially from a magnetic tape. It is not possible to get quick access to data. However, magnetic tapes are very cheap compared to other storage media. Large amounts of data can be stored on tape. It is the preferred medium for taking backups and archiving old data.

Blocks and Inter-block gap (IBG)

When you visit the mall to buy groceries for the coming month, you usually buy grain, cooking oil, flour and other non-perishable goods in bulk. In a single trip, you’d purchase items in sufficient quantity, so that stocks last at-least a month and you don’t have to make a second trip. It is economical.

A block is a contiguous group of records on a storage device. Media such as disks or tapes are said to be block-oriented devices as opposed to record-oriented. On these devices, a group of records are stored together as a block. The below figure illustrates, how records are blocked. Consecutive blocks have a gap between them called the Inter-block gap (IBG).

On the mainframe system, a block is the basic unit of data transfer. During a single READ or WRITE operation, 1 block is transferred from the storage device.

Blocks are described by their block size. It is upto the programmer to determine the block’s size (length). Say, I choose a block size = 800 bytes while storing customer records. Each customer record is 80 bytes long. Then, the number of records in each block is,

The blocking factor is 10 records per block. As a result, you end up transferring 10 customer records in a single READ or WRITE. This indeed helps. Say, you were generating monthly account statements of all the customers. Having processed one customer record, the probability or likelihood of processing the neighbouring records (customer 2, customer 3, … and so forth) is very high. It would relatively cheap to get a block of 10 records in a single READ, rather than executing ten READs and get only one record at a time.

There’s a trade-off between performance and data-transfer time. One must realize that a small block size degrades the performance. Choosing a very large block size shall boosts the performance, but takes a toll on the amount of time required to transfer the data. Choosing the optimum block-size for your datasets is therefore important. Many mainframe shops install a software product that determines the best block-size for your file.

Advantages of blocking 1. Fewer I/O operations are needed because a single READ moves an entire block containing several records.

Disadvantages of blocking 1. Tiny software programs called access method routines block and de-block the data. This is an overhead.

Magnetic disk

A magnetic disk resembles a phonograph vinyl record. Here, the tracks are laid out in a circular shape. A single disk known as a platter has several concentric tracks. Data is stored on both the sides of the platter.

In IBM 3390 DASD drives used currently in the modern mainframe systems of today, eight such disks are stacked to form a disk pack or volume. When the drive is in operation, these platters revolve at a very high speed around an axis called the spindle (not shown the diagram).

Magnetic Disk unit organization

An arm (actuator) has READ/WRITE heads. There is one READ/WRITE head for each surface of a platter. But, how is a particular track located? To seek the desired track, the arm (actuator) moves the READ/WRITE heads from the outermost track towards the center of the disk. Note that, all the READ/WRITE heads move as one unit. So, if the arm moves the head to Track 150, all the READ/WRITE heads are positioned at Track 150 on their respective surfaces. The same track on each of the surfaces can be imagined to form a virtual cylinder.

Eight disks have 16 surfaces. One of the surfaces is used to record control information. As there are 15 surfaces on which data is recorded, 1 cylinder is equal to 15 tracks of storage space.

The READ/WRITE head assembly travel as a single unit and are capable of transferring an entire cylinder without any movement. As a consequence, storage space on the DASD drive is not filled up surface by surface. Instead, it is filled up cylinder by cylinder.

Forming dataset names

On a mainframe computer, there would be thousands of files (datasets). There must a unique name for every dataset. Mainframes support very large dataset names. On mainframes, a dataset name can be 44 characters long in this format:

XXXXXXXX.XXXXXXXX.XXXXXXXX.XXXXXXXX.XXXXXXXX

A dataset name is made up of several segments or qualifiers. Each segment can be upto eight characters in length. The qualifier must start with a capital letter (alphabet).

Generally, it is a good practice to give meaningful names to your dataset. For example, if you are storing Employees data in a file, you can name it as EMPLOYEE.DATA.

The operating system keeps track of groups of datasets by referring to their names, which are called qualifiers. The first part of the dataset name is called High-level Qualifier (HLQ).

Generally, when you access or log-in to Mainframes, you are given a TSO USER-ID, just as on Windows PC, you need a user-id to login. Most professionals or software engineers who work on a mainframe have a TSO user-id and password to access the mainframe computer.

When you use a TSO-id, a special requirement applies to most datasets (files) that belong to you. Suppose your TSO-id is AGY0157. Then all your datasets should have the High-level qualifier AGY0157. Thus, the name of the files that belong to you should start with AGY0157. For example, the name of Employees file would be AGY0157.EMPLOYEE.DATA. In fact, you can identify the files that belong to you, your application or your system by looking at the High-level Qualifier of the file.

Security software products like RACF, CA TOP-SECRET are generally installed at a mainframe shop. These products can be used to control access to files. For example, you may grant read-only access to the file AGY0157.EMPLOYEE.DATA to others. Thus, RACF would then prevent others from making changes to your dataset.

Sequential Datasets

Sequential datasets can be likened to a music cassette, on which songs are stored one after the other. When you play the cassette, you listen to music; one song and then the next song and so on … till you reach the end of the cassette. You cannot directly jump-to, or fly to the fifth song, or last song. Thus, a music cassette tape is accessed sequentially.

On the same lines, records in sequential files are stored one after the other in a series. In a sequential file, you need to read through all the records one by one, step-by-step till you reach the desired record.

Thus, a sequential dataset is the simplest form of dataset. On the mainframe zOS Operating System, sequential dataset is known as PS dataset, PS stands for Physical Sequential.

Creating a new dataset

ISPF organises most common functionalities that you need to perform on Mainframes – like creating a new file/folder, editing a file, searching for files, taking backup etc. in the form of menus, just like a Nokia Cellular Phone.

To create a new file, you have to go to the menu 3.2 in ISPF. You enter the dataset name here on ISPF Screen, select the option ===> A (Allocate) and press Enter.

On a mainframe system, every disk volume or pack is identified by a unique code of six-characters. This is called the volume serial. For example, the mainframe system I am connected to has several 3390 DASD volumes S4RES1, S4RES2, OS39M1, S4DB21 and so forth.

System administrators assign a group of storage devices, a generic name. For example, all the DASD volumes together may be collectively assigned a name SYSDA. All the tape cartridges could be collectively assigned a name TAPE. The logical name for a group of devices is referred to as UNIT.

Primary and Secondary Space

The zOS operating system is a miser. It doesn’t give away a single byte of memory for free. When you request a new mainframe file, you must declare and announce, how many bytes/tracks/cylinders of storage space do you need for your file?

These space projections (how much space would the dataset need) of the new dataset are in terms of primary space and secondary space. Let’s say, I request 10 tracks of primary space and another 10 tracks of secondary space for my new dataset SYSADM.EMPLOYEE.DATA. Take a look at figure below.

To honour your request, the zOS operating system sets aside ten tracks of primary space for your dataset on the storage device. Once this request is satisfied, you’d get a Data set allocated message on your screen.

As time passes, you store more and more data in your file. Ultimately, the dataset becomes full with data and it has no empty space left. Now, the secondary space comes into picture. The zOS allocates ten tracks of secondary space to the dataset. The dataset is then 10 + 10 = 20 tracks large. The dataset is said to extend itself. Over a period of time, if the dataset again becomes full, the zOS allocates yet another 10 tracks of secondary space. So, it becomes 20 + 10 = 30 tracks big. This way, the zOS allows upto 15 extents of the secondary space.

Thus, when you create a new dataset, the zOS tries to fulfil the primary space. If the dataset fills up, zOS tries to allocate additional space in increments of the secondary space. A dataset can extend upto 15 times the secondary space.

A fixed length file is the one, in which each record occupies the same fixed number of bytes. For example, in the INPUT DATA file below, each record is 50 bytes long. Every record has the same fixed length. The SYSADM.INPUT.DATA file is then said to be fixed length. The record format is fixed – FB.

A variable length file is the one, in which the records vary in length. As an example, there is a possibility that the customers of a bank might hold two or more accounts. In such cases, the customer record is 100 bytes large, whereas if he has just 1 primary account, the record would be 80 bytes large. Hence, not all of the records in the customer file are of the same length. The customer file would then be a variable length file. The record format would be variable – VB.

Take a look at the figure below. I have entered Record Format as FB, Record length = 80 bytes and block size = 800 bytes for the SYSADM.EMPLOYEE.DATA file.

When you create a sequential dataset (PS), it occupies a large amount of storage space (minimum 50,000 bytes). Sometimes for doing trivial tasks, storing only a few records, you would not be utilising the entire space available in a sequential file. Thus most of the space in the Sequential file is wasted and is blocked.

IBM provides a way to split-up or partition the space in a sequential data-set. IBM Software engineers invented the Partitioned Dataset, often called PDS. Each part is called a member. A Partitioned Dataset can have many members (parts). Each member behaves like a sequential file on its own.

How does the PDS then keep track of its members? A Partitioned Dataset (PDS) includes a directory (like a telephone directory), which is a diary or journal, where zOS Operating system keeps track of the members in a PDS. The members of a PDS can be scattered hap hazardly anywhere in the vast computer storage space. Well-then, you consult the directory to find out the members in the PDS. Just as in a book, the index tells you, the keyword you are looking for and the page no. where it can be found, the same way, the directory tells you the member name and the computer memory address where you can find it. Thus, the directory maintains the list of member-names and pointers to the actual physical place in computer memory, where they are stored.

Not just this, but the directory also keeps track of its statistics, when the member was created, when was it last modified/viewed by somebody, how much space it occupies, the TSO user-id that created it amongst other things

On the Windows Operating System, we generally keep related files together in a folder. On the same lines, you keep related members together in one PDS. What a folder is to Windows OS, a Partitioned dataset(PDS) is to z/OS. Folders contain many related files. A PDS contains many related members.

When creating this special type of file – a Partitioned Dataset, you need to determine, how big its directory is going to be. The bigger the directory, the more no. of partitions (members) it will support. You express this in terms of directory Blocks. Generally, 1 directory block can store information of about 5 members. The directory size always remains fixed. The directory can’t grow bigger. So you must carefully specify the directory blocks and you have to do it in advance.

To sum up, a Partitioned Dataset (PDS) is a specialised type of sequential file, which is divided into members, each member being a sequential dataset by itself. Thus, a PDS acts like a library which houses several related files.

Specifying whether file is sequential or partitioned

A sequential PS File doesn’t have any partitions or members (0 members). So, it doesn’t have a directory to look upto. Thus, the space allocated to a directory, in terms of Directory Blocks for a Sequential PS File is Zero (00).

On the other hand, for Partitioned Dataset (PDS), it can contain 1 or more members. So, it needs a directory to record the where-abouts (location) of its members, which are scattered throughout the memory randomly. You need to allocate space in terms of Directory Blocks to a Directory in Partitioned Dataset (PDS). Thus, Directory Blocks for a PDS should be some finite number, say 5 blocks, or 10 blocks.

Suppose you are creating a new File SYSADM.EMPLOYEE.DATA. You specify the Directory Blocks on the ISPF Menu 3.2 screen, as follows -

1. If you want the file SYSADM.EMPLOYEE.DATA to be a Physical Sequential File (PS), then fill the Directory Blocks field = 0.

2. On the other hand, if you want the file SYSADM.EMPLOYEE.DATA to be a Partitioned Dataset(PDS), having 1 or many members and a directory, specify Directory Blocks field = 1,2,3 or ...(any finite value) blocks. In the figure below, I have put Directory Blocks = 10 Blocks.

You want to find and locate the files you’ve stored data in. Finding the file on zOS Operating System is done using, ISPF Menu 3.4, called Dataset List Utility. Go to ISPF Menu 3.4 screen and type the name of the dataset that you want to search and press enter key. For example, if you want to find the file SYSADM.EMPLOYEE.DATA, then type SYSADM.EMPLOYEE.DATA in Dsname field and press . Take a look at figure below.

Enter one or both of the parameters below: Dsname Level . . . SYSADM.EMPLOYEE.DATA Volume serial . . Data set list options Initial View . . . 11. Volume Enter "/" to select option2. Space /Confirm Data Set Delete 3. Attrib /Confirm Member Delete 4. Total /Include Additional Qualifiers When the data set list is displayed, enter either:"/" on the data set list command field for the command prompt pop-up, an ISPF line command, the name of a TSO command, CLIST, or REXX exec, or"=" to execute the previous command.

On pressing <enter>, you would find list of files displayed before. Note that, Dataset List Utility (DSLIST) works by searching for all files that match a pattern. So, when you type SYSADM.EMPLOYEE.DATA, what you actually specify is a pattern.

All files matching this pattern will be displayed. So, if there is a file – SYSADM.EMPLOYEE.DATA.MASTER, it also matches the pattern. Thus, it will also be listed.

Enter one or both of the parameters below: Dsname Level . . . SYSADM.DEMO.* Volume serial . . Data set list options Initial View . . . 11. Volume Enter "/" to select option2. Space /Confirm Data Set Delete 3. Attrib /Confirm Member Delete 4. Total /Include Additional Qualifiers When the data set list is displayed, enter either:"/" on the data set list command field for the command prompt pop-up, an ISPF line command, the name of a TSO command, CLIST, or REXX exec, or"=" to execute the previous command.

All datasets like SYSADM.DEMO.SRCLIB, SYSADM.DEMO.JCLLIB that begin with the name SYSADM.DEMO are listed.

To many people who are thrown to work at a mainframe computer on their first job, they feel lost. Mainframe people seem to speak a completely different language and that doesn't make life easy. What's more, the books and manuals are incredibly hard to comprehend.

"What on earth is a Mainframe?" is an absolute beginner's guide to mainframe computers. We'll introduce you to the hardware and peripherals. We'll talk about the operating system, the software installed on a mainframe. We'll also talk about the different people who work on a mainframe. In a nutshell, we'll de-mystify the mainframe.

Readers based in India, can buy the e-book for Rs. 50 only or the print book. International readers based in the US and other countries can click here to purchase the e-book.