Frankenstein's Fix: the Mysterious Data Packet

Who knew that using a laptop to upload data to a mainframe was so ahead of its time?

In the 1980s I worked for a computer company that manufactured multi-user time-shared computers. In those days, a system consisted of several 10"x12" circuit boards filled with hundreds of TTL and CMOS chips.

I helped design and build a laptop-based system to keep track of circuit boards and parts in the stockroom. (There were thousands of different boards and parts in inventory.) The system used a laptop computer with a bar code scanner and software we developed for the laptop. An uploader program was also developed for a mainframe computer that would receive the data from the laptop for processing. To take inventory, everything in the stockroom would be scanned with the bar code scanner plugged into the laptop. The data would then be sent from the laptop to the mainframe using the uploader program.

The development process went well. We did several tests bar-coding data into the laptop and uploading it to the mainframe. We also did several dry runs where we bar-coded and uploaded most of the inventory -- a task that took several hours for each dry run.

Finally, the big day came to take inventory. Things started out well. We scanned the entire inventory into the laptop, connected to the mainframe, and started the uploader program. After about 45 minutes, the uploader program froze up. At first I thought this was a glitch, so I restarted the uploader program. Again, the program froze up after about 45 minutes. We had uploaded large amounts of data to the mainframe many times during testing with no problem, so why would the program freeze up now and why after about 45 minutes? At this point, several megabytes and several million packets had been sent to the mainframe.

After more investigation, I found that the uploader stalled on the exact same data packet each time. What was so special about this particular packet? Was there something about the data in this packet that caused the mainframe to stall? More investigation showed that the data did not matter -- the system always froze on the same data packet no matter what the data stream was. What was so special about this particular packet in the data stream?

After much more investigation, the problem turned out to be the I/O processor (IOP) in the mainframe. As data packets are sent to the mainframe, the IOP stores the data in its own buffer until the mainframe can find a buffer in main memory and empty the IOP buffer. Each time the upload process began, the mainframe memory was empty (no other processes were running during the upload), so the operating system (OS) was able to find a memory buffer and empty the IOP buffer before the next packet arrived from the laptop. After about 45 minutes, all the free memory in the mainframe was gone. The OS, which is a virtual memory system, would need to find a buffer to flush to disk to make room for the new data from the IOP, which takes more time.

It turns out that the IOP buffer was the same size at the data packets, so before the IOP could accept a new packet from the laptop, the previous packet in the IOP memory had been moved into system memory. (Remember, it was the 1980s, when memory was more expensive and less plentiful.) Because of a bug in the IOP firmware, the IOP did not send a "not ready for data" signal to the laptop, causing the next packet from the laptop to overwrite the packet already in IOP memory. This caused the IOP to lock up (another bug), stopping the entire upload process. Each time the upload process was restarted, the OS would empty memory, providing the same amount of free memory for the uploader each time, causing the process to freeze on the same packet each time.

Talking to the engineers who designed the system and the IOP, they mentioned that they had not expected large amounts of data to be sent to the system at a high rate of speed over a sustained period of time. These were terminal-based systems, designed for people to enter data from a keyboard, not upload data from a laptop. The engineers simply had not expected anyone to do this. It's interesting how the intended use for a system can drive design decisions that can cause it to fail when it's used in ways that are not expected, or prevent you from thinking about possible failure modes that could occur.

The company decided that fixing the IOP hardware and upgrading all the field systems would be too expensive, since customers didn't normally upload data from a laptop. To fix the problem, the laptop put a small time delay between each data packet providing enough time for the mainframe to empty the IOP buffer before the next packet arrived. This was not the best solution, but it worked. This is an example of the software person writing software to get around a hardware problem. This occurs more often than you might think. Anyone remember the old serial port fix for the 8086 processors in the early PCs? You had to put in small time delays between IN and OUT instructions in the comm port code to get it to work with some UARTS.

About author Frank Rose: "I have over 30 years of experience as a software developer (both application software and embedded software) and an analog/digital circuit designer. Today, I work as a digital designer developing FPGA programs using VHDL and designing analog/digital circuits for L-3 Power Paragon in Anaheim, CA."

I've seen that problem several times over the years, and it's always been a timing problem related to interactions of hardware design and software design.

One occurred when the one side didn't have enough time to process received data before the other side requested to send more, and things ended in a loop where the one side sent an ENQ (Enquiry or request to send data) and the other side only had time to send a WACK (Wait ACKnowledge) before the first side repeated the ENQ, and so on ad infinitum I don't remember how they fixed that.

A second occurred when a new minicomputer-based (probably all TTL) RJE (Remote Job Entry) terminal kept dropping data. The vendor had their top programmer on site trying to fix it. You could recognize him by the flannel shirt, blue jeans, and lost look on his face as he travelled between the RJE and a card punch an back, for what seemed like weeks on end. We were finally asked to come in with a datascope and saw that as soon as the RJE sent the acknowledment for the last block of data, the mainframe shoved another ENQ down its throat, but the RJE didn't see it because it was still processing the last block it received. It turns out the mainframe programmers had set the mainframe FEP (Front-End Processor) for full duplex, figuring it was faster and more efficient, and both modems were set for constant carrier. The RJE, however, needed to be set for half duplex because it needed breathing room. It was fixed when we convinced the system programmers to set the mainframe for half duplex and we reset the modems for switched carrier with a 250 ms turnon time. With a quarter-second delay the RJE was happy, and so was its programmer!

A third one was also a timing issue, only this time a microprocesser- based barcode reader was stuffing things down the mainframe's throat too fast. The group in charge of furniture inventory bought this neat little barcode reader to scan barcodes on furniture and sent it to the mainframe all in one fell swoop. It worked great at the demo at another agency, using a larger IBM mainframe that we had. On our system the mainframe kept dropping data. Part of the problem was that the barcode reader blurted out all 64K bytes of data without stopping, and they had set the end of record as a carriage return. The mainframe was set to sense a carriage return as end of data, and would terminate the read and go on to the next step in its program. Ours was a slower mainframe, so by the time it got its act together and hung another read up, several records had gone past and into the bit bucket. Part of the problem was that the application programmer read a record and then went on to process it, including disk access, before reading the next record, which took a lot of time (relative to the datacomm line speed). It was fixed by changing the barcode reader to terminate each record with a Record Separator (RS) character and not send a carriage return until all 64K had been sent (many teeth were pulled as we interrogated the vendor progammer over the phone). Then one of our datacomm systems programmers (to whom a macro-assembler was a high-level language) set the mainframe to continuously read data into a gargantuan buffer until it read the final carriage return that terminated the read. The next step was to split out the records based on the RS character and then pass that list to the remainder of the program.

"You have to understand how a starship operates." -- Capt. Kirk, Star Trek: The Wrath of Kahn.