Since the release of Windows NT Service Pack 4, the engineering
team had been hard at work making sure the new product
would be compatible with all the current hardware standards
and mission-critical applications in use throughout the
enterprise. After months of compatibility testing in the
lab, the project was finally passed over to the software
distribution team for global deployment. Microsoft Systems
Management Server pushed the job out to the workstations
with little problem; it was now time to focus on the servers.
The site in Bangor, Maine was the first to deploy—and
thus the first to witness the “blue screen” boot failures
on some of the older Compaq servers.

It never fails that efforts in the controlled safety
of the lab often don’t yield the same results when we
apply them to a production environment. Despite our best
efforts to look at the task at hand from every angle,
we tend to run into problems that cause us to be up all
night racking our brains about where we diverged from
the beaten path. When the problem is finally solved, we
often find that the issue was caused by some incompatibility
that was either well known (except to us), or we realize
that the lab configuration didn’t accurately reflect our
production environment.

Many of the horror stories we hear regarding production
environment failures during deployments come not from
lack of knowledge or skill, but because of some divergence
from what was expected. Microsoft’s claim that Service
Pack 4 (SP4) was a simple upgrade shouldn’t have freed
you from having to test the product in your environment.
You may have applied SP4 successfully to your desktop,
but when you applied it to the file server hosting all
of the executive’s home directories, you witnessed the
blue screen of death. After standing in the data center
scratching what remains of your hair for the balance of
the night and trying everything under the sun short of
voodoo, you receive the dreaded call. It’s the director
of IT, asking, “Why can’t I access my home directory?”
You don’t really want to tell him that you never tested
it on this hardware platform, do you?

For those who don white lab coats for a living, existence
is dependent not upon work done in the lab, but on the
ability to repeat experiments successfully on demand.
Successful scientists maintain pristine laboratories and
document every step of every process they perform to assure
that their results will be repeatable if success is attained.
If a scientist believes she found the cure for cancer,
wouldn’t it be a shame if the results were unrepeatable?
Did she really find the cure if she can’t repeat the findings
of the experiment?

When we explain to our peers, customers, and bosses that
a procedure worked in the lab but doesn’t work as planned
in the production environment, our credibility is put
at stake. As technologists, we’re typically a financial
liability to an organization, unless we work for a contracting
firm whose business is to sell our services. We rarely
make any money for the organization, but instead we must
justify our existence within the enterprise for the value
our work adds to existing business processes. We build
solutions that enable business users to do their work
more efficiently, allowing them to spend more time on
the profit-generating business processes rather than on
the tools needed for the job.

Avoiding TechnoDarwinism

If you prefer to fly by the seat of your pants rather
than apply some basic scientific principles to your work,
Darwin’s theory of natural selection will work against
you within your organization. Quite simply, those who
test products and changes before rolling them into production
stand a higher chance of continued employment. On the
converse side, those who choose to take their chances
by failing to test a product before deploying it in a
production environment quickly fall victim to Darwin’s
theory of natural selection. These are the individuals
often “selected” to leave the organization after failing
to grasp the importance of applying scientific principles
to their work.

In any well-devised deployment plan, there should always
be time reserved for research and testing. But when things
run late, lab time is usually the first item to get cut.
Most project managers seem to think that the week of testing
you entered on your deployment project plan is merely
a code word to describe the extra time added to every
project plan to accommodate our inability to accurately
predict the unknown. Immediately he targets this seemingly
bogus entry for deletion or reduction from the project
plan.

Inevitably, once you move your project from the development
domain into production, a host of unforeseen circumstances
keeps you from seeing daylight for the next few days.
This prevents the project from completing anywhere near
the milestone set by the project manager, raising questions
as to whether or not it was truly worth it to cut out
that week of pre-production testing.

All too often, the work we do is so new or unique that
we can’t accurately estimate the time we’ll need or the
obstacles we’ll encounter along the way. Did the NASA
scientists accurately estimate the time or money required
to put the first man on the moon? The moon landing proved
to be an event that NASA would repeat, and inevitably,
the knowledge gained from the first mission would benefit
the time and resource estimates for subsequent missions.
Armed with a bit of knowledge learned from our own lab
experiments, we too can begin to benefit from our previous
experiences.

For systems administrators, there’s often little reason
why we can’t practice in a non-critical environment to
prepare ourselves for the pitfalls that may lie ahead
in the upgrade. Not to say that every upgrade, migration,
and deployment will go smoothly if we practice it once
or twice in the lab—there will always be unforeseeable
problems. But generally speaking, significant amounts
of practice beforehand will yield a better success ratio
for our efforts than if we just give it a try and see
what transpires.

The time to research incompatibility issues, test changes
to the environment, and devise disaster plans isn’t after
the event occurs, but long before. If you work in an environment
where you feel you should be donning a fire helmet most
days, you’re already familiar with the dangers of avoiding
a proactive approach to problem solving. Those who are
constantly in a reactive state have no time to prepare
technologies that will increase competitive advantages
for the enterprise. Considering the increasing role of
technology in today’s super-competitive market, even entire
organizations can easily fall victim to the selective
nature of TechnoDarwinism.

A Few Guidelines

To help ensure that efforts in the lab are indeed useful,
consider the following guidelines.

Standardize the User Environment

Too many enterprises lack strict standards for the user
environment. Instead, they let machines exist with varying
directory structures, office automation suites, hardware
platforms, and even operating systems. Because we’re generally
financial liabilities to most organizations, we must find
ways to reduce the cost of supporting machines in the
environment to justify our continued existence. If each
machine is different, there’s no way to benefit from the
economies of scale that we’d enjoy in large enterprise
environments. While a discussion on the importance of
enterprise standards is well outside the scope of this
article, organizations that lack a strict policy on hardware
and software standards are destined to drive IT support
costs significantly higher than truly necessary. Without
a normalized environment, we have no way to predict successfully
our ability to re-create the results derived in the lab
in a production environment.

Research Known Incompatibilities Before
Trying to Change Production Environments

The inability for certain Compaq servers to boot Windows
NT successfully after installation of Service Pack 4 is
well documented on Compaq’s Web site, but we most likely
didn’t find that out until after the blue screen appeared.
All too often, bonus-protecting managers insist that a
deployment be done by some arbitrary date, leaving us
with little time to perform the required testing or research.
A simple visit to Compaq’s Web site could have saved us
hours of downtime (thus killing the manager’s bonus) and
kept us from having to answer the dreaded queries from
senior management of how this could have happened.

By visiting the Compaq site before the upgrade, we would
have learned that there’s a known incompatibility between
firmware v.1.36 and below on SMART/2P and SMART/2E array
controllers and Microsoft Windows NT Service Pack 4. Armed
with such knowledge, we could have applied SSD 2.08 (as
per the guidance of the Customer Advisory) while we had
the scheduled downtime. Had we taken a single proactive
step to gather more information regarding the task at
hand, the SP4 installation on the server might have succeeded.

Document All Procedures Performed
in the Lab Environment

The most important way to increase the repeatability
of your work in the lab is to make sure you document every
step of the process, no matter how trivial it may seem.
Our notes must be so detailed that a third party can easily
re-create our work without our involvement.

It’s also essential that you have a peer (or a QA group,
if your organization has one) review your documentation.
As authors, we have a tendency to make assumptions that
we may not clearly document in the text.

Create Identical Lab and Production
Environments

If we hope to gain any useful data from our lab experiments,
the lab must closely resemble the production environment
for the task at hand. For example, if we want to simulate
the interaction of an application across domain trusts,
we must first establish a similar environment to what
we have in production. While it’d be ideal to match every
aspect of the production environment in the lab, this
is often cost-prohibitive. Instead, we may be able to
simulate the 10 servers making up the domain architecture
using decommissioned desktops and servers to simulate
the interaction of our product in a multi-domain environment.
The same is true for testing driver updates, hot fixes,
and other system-level software changes to hardware. This
includes making sure that the firmware revisions, drivers,
card locations, memory, processor count, etc. in the lab
equipment match what’s being used in production.

Each application installed on a machine wants to install
its own DLLs in the system directory, and perhaps the
latest version of MDAC installed with Office 2000 may
just break the critical database application the primary
user runs each day. Without significant testing in a lab
that mirrors your standardized production environment,
you can’t provide any assurances (beyond mere guesswork)
to those who count on you that your efforts will be truly
successful.

Use Scripting Methods to Improve Repeatability
of Results

One of the best ways to make sure you can repeat complex
operations is to write a script to perform the upgrade.
Once the script runs the way you want it to, it can be
easily run in the production environment to duplicate
your efforts exactly. This is especially useful when trying
to apply complex NTFS permissions, create users or groups,
or modify the Registry. Scripts also help assure that
the environment has been initialized to a known state
for each test we perform, which is essential for garnering
valid data from our experiments.

Using the Active Directory Service Interfaces (ADSI)
with our favorite programming language, we can perform
almost any Windows NT, Windows 2000, Exchange, IIS, or
Novell administrative function programmatically. This
can be useful not only for developing scripts that will
re-create our actions in the lab in a production environment,
but we can also use Visual Basic and ADSI to create powerful
scripts that can re-create the production user domain
SAM in our lab environment.

If
you find the concepts in this article
interesting, you might enjoy the following
links:

To help increase your chances of success for implementing
new changes in your production environment, here are some
steps to follow:

If you’re operating in a non-standardized environment,
seize the opportunity to implement standards when performing
a major upgrade to the enterprise (such as Windows 2000).

Research potential known incompatibilities for the
software or hardware you’re about to install.

Re-create the elements of the production environment
that will be affected by your changes in a non-critical
environment or isolated network.

Document your experiences and lab procedures with
meticulous detail.

Script procedures in the lab environment where possible
to guarantee the same procedure will be followed when
it’s moved to production. Whether it’s being used to
initialize the environment during the testing or to
perform the actual task at hand, scripting can help
assure consistent results.

Test the impact of a new application or system update
with all critical applications. Simply logging into
the client isn’t an adequate test for most deployments.

Have a third party validate your documentation to
make sure it can be reproduced without your intervention.

The next time you avert a major system outage because
you found the problem and resolution before the change
was implemented in a production environment, raise a glass
to the parents of scientific thought for their contribution
to your success.