Friday, March 19, 2010

Software glitches!

In my posting of last week I depicted a fuse box that had been reworked to bypass those pesky fuses that kept popping. In that last post I observed how frustrating it can be when services fail and we can no longer do what we had planned on doing. However, while I could sympathize with whoever it was that eliminated the need for fuses, I also suggested that this was not one of the finest examples of thinking outside the box. Unfortunately, there are other times where we just need a temporary fix, and the photo above is of just one example – the collapsing awning held in place by two appropriately positioned crutches!

This picture was taken by the same contractor who came across the fuse box and was kept as a reminder of poor decision making. The symbolism of propping up something with crutches didn’t escape me either. There has been many times where I would have gladly accepted a pair of crutches to help me rectify a deteriorating circumstance irrespective of the foolishness that it may have represented. Also symbolic is the real estate agent’s lock box attached to the right side railing – I have to wonder how enthusiastic any prospective buyer would be to enter the dwelling once they had noticed this quick fix!

This weekend saw me back on the race track in the Corvette, continuing with my driver education and laying down laps. The venue was the Auto Club Speedway in Fontana, California. Only a few weeks before NASCAR had held its second event of the year on this track, and it is, without doubt, the premier circuit we visit each year. It’s high banking, long straights, and a demanding infield road course thrown in for good measure, make the track quite challenging. It is a circuit that puts very high demands on any car and it’s not for the weak-of-heart to roll out of the pits and stand on the gas pedal. Speed rapidly builds on this track and with concrete barriers everywhere it’s not all that forgiving.

Three laps into my first session the automatic gearbox elected to stop shifting. It’s happened a couple of times before and I can usually free it by returning the selector to the full automatic position and forego using the steering wheel mounted paddle shifters. Not this time – the car remained firmly stuck in third gear. Exactly the same thing happened during the second session and I was very frustrated. Unlike previous occasions, however, the dreaded “check engine” light didn’t come on and there were no codes generated for later analysis.

“Gremlins!” I was told by Dave, my local GM service manager. “We checked the data base and there are no reported symptoms, and without any error codes stored away in the car’s computer, there’s nothing to look at.” Repeating the first observation, Dave then added, “I would suggest you have an intermittent gremlin, and that there’s likely a bug in the transmission software!” The fix for this, and it’s been done twice before, is to flush the memory and reload the base program, so next time the car is back for a service we will have to go through the process one more time. Unfortunately, the temporary fix is to just drive the car in full automatic mode – and for a track-ready Corvette, that’s pretty close to having it hobble around on a pair of wooden crutches.

Software glitches?

When I came to Tandem Cupertino as a Program Manager it was my first time on a major computer vendor’s campus, and the number of development teams working on Tandem hardware and software was a real eye-opener. Coming from much smaller companies with less than a hundred employees, seeing literally thousands of developers engaged in implementing everything from operating systems to compilers to data base and transaction processing infrastructure software was a little overwhelming at first. Cross-functional core teams kept a semblance of order through all the chaos, and attending beer busts on Friday was always as instrumental in ensuring visibility for your program as it was about the beer and popcorn!

The fault tolerant Tandem was not a product anyone wanted to see crash. A “downed” Tandem was cause for immediate executive concern, and Tandem’s Critical Account team had been established to monitor any customer situation where the Tandem systems were experiencing difficulties. Providing a work around or a temporary fix wasn’t unheard of and many a time I witnessed the frenetic activity surrounding the quick generation of such a fix. Unlike other systems, Tandem was designed to fail-over whenever it suspected any individual processor was experiencing difficulties and for most situations, this worked very well. In talking with field engineers I often heard how customers had been running with a processor offline, its workload picked up by other processors, without the customer aware of any problems.

The Quality Assurance (QA) teams within Tandem development were among the most diligent technical staff on campus and they delighted in doing nothing more than beating the very stuffing out of any newly developed product. They took a lot of pride in ensuring that products that made it out of QA rarely generated failures in the field. But with so many products, Tandem still needed a way to ensure conflicts and incompatibilities didn’t arise among combinations of these products. Particularly when layered they formed complete stacks – did a Tandem stack of SNAX, with Pathway, TMF and NS SQL all worked as specified and could EMS events generated out of each layer in the stack provide the necessary insight as to what was happening above and below the layer. Did it all hang together and worked in harmony?

To ensure quality across the whole system Tandem invested in building the Gremlin Test Center. Under the management of John Merrick, I recall, this included a number of Tandem systems configured with different releases of the OS and stacks – some SNAX, other’s TCP/IP, some WAN-centric, some LAN-centric, and each with different combinations of Enscribe, SQL, etc. The test center represented a sizeable investment for Tandem and it had been provided with a number of working solutions to further ensure the OS, infrastructure, middleware, and all the supporting management tools worked well together. There was little that ever frightened any Program Manager more than being advised that his program was scheduled next for tests on Gremlin!

In today’s heavily consumer-focused technology marketplace this level of testing has proved too expensive to maintain, and younger generations have grown up completely at ease with situations like the dreaded blue screen of death, a la Microsoft. Modern development tools have certainly cut down on the number of deterministic bugs that make it through the development cycle. Test tools have become very sophisticated and as the adoption of industry-standard components increases, so too does the access to multiple test tools.

Forward-thinking software houses these days don’t leave the responsibility of catching every bug with the QA group – just as with Gremlin in the past, I am seeing these companies pass tested, and proclaimed “QA OKed” solutions to support organizations for a period of intense “thrashing!” Off-the-wall usage scenarios can often uncover some of the hardest-to-find non-deterministic bugs! Maybe not quite as regimented as was the case with testing on Gremlin, but effective all the same. Anything at all that can be done to ensure bugs do not make it into releases and onto customer’s production systems is aggressively pursued and customers are quick to recognize those companies that take these extra steps.

Even the best software houses will never find every bug in a complex software offering. And customers will never receive the “bug-free” release that they may believe is only one or two releases away! In the late ‘90s a customer did go so far as to suggest that they would prefer to wait for the bug-free release, and I was called upon to go into detail that this was unlikely to ever happen. But today, a decade and a half later, even as software houses have become more proficient in weeding out troublesome bugs before they ever make it into a release, many customers harbor a dimly-lit hope that bugs are now a thing of the past.

It’s a testament of the effort exerted by the development teams at Tandem that the quality of NonStop products was as high as it was. With thousands of systems deployed, there were only ever a handful that experienced difficulties at any given time – mostly, a combination of new solutions as well as untried interfaces as well as variations in regional networking protocols. It was reassuring to know that when such failure occurred Tandem’s Critical Accounts team would get together to find a way to provide a quick fix or workaround that would at least prop up the customer’s system “with crutches” until a permanent solution was ready.

I’m not all that sure that I will be able to shake loose the gremlin in my car’s transmission, or be free from worrying about having just one gear. Thank goodness that driving an American car, all I have to remember is that there is essentially an , , sequence for restarting my transmission’s processor and that can make me mobile in no time at all! Well, sort of … and, with no disrespect of other car manufacturers, at least the gremlins left my gas peddle and brakes alone!