Category Archives: Best Practices

Few things are more universally desired in IT than High Availability (HA) solutions. I mean really, say those words and any IT Pro will instantly say that they want that. HA for their servers, their apps, their storage and, of course, even their desktops. If there was a checkbox next to any system that simply said “HA”, why wouldn’t we check it? We would, of course. No one voluntarily wants a system that fails a lot. Failure bad, success good.

First, though, we must define HA. HA can mean many things. At a minimum, HA must mean that the availability of the system in question must be higher than “normal”. What is normal? That alone is hard enough to define. HA is a loose term, at best. In the context of its most common usage, though, which is common applications running on normal enterprise hardware I would offer this starting point for HA discussions:

Normal or Standard Availability (SA) would be defined as the availability from a common mainline server running a common enterprise operating system running a common enterprise application in a best practices environment with enterprise support. Good examples of this might include Exchange running on Windows Server running on the HP Proliant DL380 (the most common mainline commodity server.) Or for BIND (the DNS server) running on Red Hat Enterprise Linux on the Dell PowerEdge R730. These are just examples to be used for establishing a rough baseline. There is no great way to measure this, but with a good support contract and rapid repair or replacement in the real world, reliability of a system of this nature is believed to be between four and five nines of reliability (99.99% uptime or higher) when human failure is not included.

High Availability (HA) should be commonly defined as having an availability significantly higher than that of Standard Availability. Significantly higher should be a minimum of one order of magnitude hincrease. So at least five nines of reliability and more likely six nines. (99.9999% uptime.)

Low Availability (LA) would be commonly defined as having an availability significantly lower than that of Standard Availability with significantly, again, meaning at least one order of magnitude. So LA would typically be assumed to be around 99% to 99.9% or lower availability.

Measurement here is very difficult as human factors, environmental and other play a massive role in determining the uptime of different configurations. The same gear used in one role might achieve five nines while in another fail to achieve even one. The quality of the datacenter, skill of the support staff, rapidity of parts replacement, granularity of monitoring and a multitude of other factors will affect the overall reliability significantly. This is not necessarily a problem for us, however. In most cases we can evaluate the portions of a system design that we control in such a way that relative reliability can be determined so that at least we can show that one approach is going to be superior to another in order that we can then leverage well informed decision making even if accurate failure rate models cannot be easily built.

It is important to note that other than providing a sample baseline set of examples from which to work there is nothing in the definitions of high availability or low availability that talk about how these levels should be achieved – that is not what the terms mean. The terms are resultant sets of reliability in relation to the baseline and nothing else. There are many ways to achieve high availability without using commonly assumed approaches and practically unlimited ways to achieve low availability.

Of course HA can be defined at every layer. We can have HA platforms or OS but have fragile applications on top. So it is very important to understand at what level we are speaking at any given time. At the end of the day, a business will only care about the high availability delivery of services regardless of how it is achieved, or where. The end result is what matters not the “under the hood” details of how it was accomplished or, as always, the ends justify the means.

It is extremely common today for IT departments to become distracted by new and flashy HA tools at the platform layer and forget to look for HA higher and lower in the stack to ensure that we provide highly available services to the business; rather than only looking at the one layer while leaving the business just as vulnerable, or moreso, than ever.

In the real world, though, HA is not always an option and, when it is, it comes at a cost. That cost is almost always monetary and generally comes with extra complexity as well. And as we well know, any complexity also carries additional risk and that risk could, if we are not careful, cause an attempt to achieve HA actually fail and might even leave us with LA or Low Availability.

Once we understand this necessary language for describing what we mean, we can begin to talk about when high availability, standard availability and even low availability may be right for us. We use this high level of granularity because it is so difficult to measure system reliability that getting too detailed becomes valueless.

Conceptually, all systems come with risk of downtime and nothing can be always up, that’s impossible. Reliability costs money, generally, all other things being equal. So to determine what level of availability is most appropriate for a workload we must determine the cost of risk mitigation (the amount of money that it takes to change the average amount of downtime) and compare that against the cost of the downtime itself.

This gets tricky and complicated because determining cost of downtime is difficult enough, then determining the risk of downtime is even more difficult. In many cases, downtime is not a flat number, but it might be. This cost could be expressed as $5/minute or $20K/day or similar. But an even better tool would be to create a “loss impact curve” that shows how money is lost over time (within a reasonable interval.)

For example, a company might easily face no loss at all for the first five minutes with slowly increasing, but small, losses until about four hours when work stops because people can no longer go to paper or whatever and then losses go from almost zero to quite large. Or some companies might take a huge loss the moment that the systems are down but the losses slowly dissipate over time. Loss might only be impactful at certain times of day. Maybe outages at night or during lunch are trivial but mid morning or mid afternoon are major. Every company’s impact, risk and ability to mitigate that risk are different, often dramatically so.

Sometimes it comes down to the people working at the company. Will they all happily take needed bathroom, coffee, snack or even lunch breaks at the time that a system fails so that they can return to work when it is fixed? Will people go home early and come in early tomorrow to make up for a major outage? Is there machinery that is going to sit idle? Will the ability to respond to customers be impacted? Will life support systems fail? There are countless potential impacts and countless potential ways of mitigating different types of failures. All of this has to be considered. The cost of downtime might be a fraction of corporate revenues on a minute by minute basis or downtime might cause a loss of customers or faith that is more impactful than the minute by minute revenue generated.

Once we have some rough loss numbers to deal with we at least have a starting point. Even if we only know that revenue is ~$10/minute and losses are expected to be around ~$5/minute we have a starting point of sorts. If we have a full curve or a study done with some more detailed numbers, all the better. Now we need to figure out roughly what our baseline is going to be. A well maintained server, running on premises, with a good support contract and good backup and restore procedures can pretty easily achieve four nines of reliability. That means that we would experience about five hours of downtime every five years. This is actually less than the generally expected downtime of SA in most environments and potentially far less than expected levels in excellent environments like high quality datacenters with nearby parts and service.

So, based on our baseline example of about five hours every five years we can figure out our potential risk. If we lose about ~$5/minute and we expect roughly 300 minutes of downtime every five years we looking at a potential loss of $1,500 every half decade.

That means that at the most extreme we could never spend $1,500 to mitigate that risk, that would be financially absurd. This happens for several reasons. One of the biggest is that this is only a risk, spending $1,500 to protect against losing $1,500 makes little sense, but it is a very common mistake to make when people do not analyze these numbers carefully.

The biggest factor is that any mitigation technique is not completely effective. If we manage to move our four nines system to a five nines system we would reduce only 90% of the average downtime moving us from $1,500 of loss to $150 of loss. If we spent $1,500 for that reduction, the total “loss” would still be $1,650 (the cost of protection is a form of financial loss.) The cost of the risk mitigation combined with the anticipated remaining impact when taken together must still be lower than the anticipated cost of the risk without mitigation or else the mitigation itself is pointless or actively damaging.

Many may question why the total cost of risk mitigation must be lower and not simply equal as surely, that must mean that we are at a “risk break even” point? This seems true on the surface, but because we are dealing with risk this is not the case. Risk mitigation is a certain cost- financial damage that we take up front in the hopes of reducing losses tomorrow. But the risk for tomorrow is a guess, hopefully a well educated one, but only a guess. The cost today is certain. Taking on certain damage today in the hopes of reducing possible damage tomorrow only makes sense when the damage today is small and the expected or possible damage tomorrow is very large and the effectiveness of mitigation is significant.

Included in the idea of “certain cost of front” to reduce “possible cost tomorrow” is the idea of the time value of money. Even if an outage was of a known size and time, we would not spend the same money today to mitigate it tomorrow because our money is more valuable today.

In the most dramatic cases, we sometimes see IT departments demanding tens or hundreds of thousands of dollars be spent up front to avoid losing a few thousand dollars, maybe, sometime maybe many years in the future. A strategy that we can refer to as “shooting ourselves in the face today to avoid maybe getting a headache tomorrow.”

It is included in the concept of evaluating the risk mitigation but it should be mentioned specifically that in the case of IT equipment there are many examples of attempted risk mitigation that may not be as effective as they are believed to be. For example, having two servers that sit in the same rack will potentially be very effective for mitigating the risk of host hardware failure, but will not mitigate against natural disasters, site loss, fire, most cases of electrical shock, fire suppression activation, network interruptions, most application failure, ransomware attack or other reasonably possible disasters.

It is common for storage devices to be equipment with “dual controllers” which gives a strong impression of high reliability, but generally these controllers are inside a single chassis with shared components and even if the components are not shared, often the firmware is shared and communications between components are complex; often leading to failures where the failure of one component triggers the failure of another – making them quite frequently LA devices rather than SA or the HA that people expected when purchasing them. So it is very critical to consider if the risk mitigation strategy will mitigate which risks and if the mitigation technique is likely to be effective. No technique is completely effective, there is always a chance for failure, but some strategies and techniques are more broadly effective than others and some are simply misleading or actually counter productive. If we are not careful, we may implement costly products or techniques that actively undermine our goals.

Some techniques and products used in the pursuit of high availability are rather expensive, which might include buying redundant hardware, leasing another building, installing expensive generators or licensing special software. There are low cost techniques and software as well, but in most cases any movement towards high availability will result in a respectively large outlay of investment capital in order to achieve it. It is absolutely critical to keep in mind that high availability is a process, there is no way to simply buy high availability. Achieving HA requires good documentation, procedures, planning, support, equipment, engineering and more. In the systems world, HA is normally approached first from an environmental perspective with failover power generators, redundant HVAC systems, power conditioning, air filtration, fire suppression systems and more to ensure that the environment for the availability is there. This alone can often make further investment unnecessary as this can deliver incredible results. Then comes HA system design ensuring that not just one layer of a technology stack is highly available but that the entire stack is allowing for the critical applications, data or services to remain functional during as much time as possible. Then looking at site to site redundancy to be able to withstand floods, hurricanes, blizzards, etc. Of course there are completely different techniques such as utilizing cloud computing services hosted remotely on our behalf. What matters is that high availability requires broad thinking and planning, cannot simply be purchased as a line item and is judged by the ability to return a risk factor providing a resulting uptime or likelihood of uptime much higher than a “standard” system design would deliver.

What is often surprising, almost shocking, to many businesses and especially to IT professionals, who rarely undertake financial risk analysis and who are constantly being told that HA is a necessity for any business and that buying the latest HA products is unquestionably how their budgets should be spent, is that when the numbers are crunched and the reality of the costs and effectiveness of risk mitigation strategies are considered that high availability has very little place in any organization, especially those that are small or have highly disparate workloads. In the small and medium business market it is almost universal to find that the cost and complexity (which in turn brings risk, mostly in the form of a lack of experience around techniques and risk assessment) of high availability approaches is far too costly to ever offset the potential damage of the outage from which the mitigation is hoped to protect. There are exceptions, of course, and there are many businesses for which high availability solutions are absolutely sensible, but these are the exception and very far from being the norm.

It is also sensible to think of the needs for high availability to be based on a workload basis and not department, company or technology wide. In a small business it is common for all workloads to share a common platform and the need of a single workload for high availability may sweep other, less critical, workloads along with it. This is perfectly fine and a great way to offset the cost of the risk mitigation of the critical workload through ancillary benefit to the less critical workloads. In a larger organization where there is a plethora of platform approaches used for differing workloads it is common for only certain workloads that are both highly critical (in terms of risk from downtime impact) and that are practically mitigated of risk (the ability to mitigate risk can vary dramatically between different types of workloads) to have high availability applied to them and other workloads to be left to standard techniques.

Examples of workloads that may be critical and can be effectively addressed with high availability might be an online ordering system where the latency created by multi-regional replication has little impact on the overall system but losing orders could be very financially impactful should a system fail. An example of a workload where high availability might be easy to implement but ineffectual would be an internal intranet site serving commonly asked HR questions; it would simply not be cost effective to avoid small amounts of occasional downtime for a system like this. An example of a system where risk is high but the cost or effectiveness of risk mitigation makes it impractical or even impossible might be a financial “tick” database requiring massive amounts of low latency data to be ingested and the ability to maintain a replica would not only be incredibly costly but could introduce latency that would undermine the ability of the system to perform adequately. Every business and workload is unique and should be evaluated carefully.

Of course high availability techniques can be actioned in stages; it is not an all or nothing endeavor. It might be practical to mitigate the risk of system level failure by having application layer fault tolerance to protect against failure of system hardware, virtualization platform or storage. But for the same workload it might not be valuable to protect against the loss of a single site. If a workload only services a particular site or is simply not valuable enough for the large investment needed to make it fail between sites it could easily fall “in the middle.” It is very common for workloads to only implement partially high availability solutions, often because an IT department may only be responsible for a portion of them and have no say over things like power support and HVAC, but probably most common because some high availability techniques are seen as high visibility and easy to sell to management while others, such as high quality power and air conditioning, often are not even though they may easily provide a better bang for the buck. There are good reasons why certain techniques may be chosen and not others as they affect different risk components and some risks may have a differing impact on an individual business or workload.

High availability requires careful thought as to whether it is worth considering and even more careful thought as to implementation. Building true HA systems requires a significant amount of effort and expertise and generally substantial cost. Understanding which components of HA are valuable and which are not requires not just extensive technical expertise but financial and managerial skills as well. Departments must work together extensively to truly understand how HA will impact an organization and when it will be worth the investment. It is critical that it be remembered that the need for high availability in an organization or for a workload is anything but a foregone conclusion and it should not be surprising in the least to find that extensive high availability or even casual high availability practices turn out to be economically impractical.

In many ways this is because standard availability has reached such a state that there is continuously less and less risk to mitigate. Technology components used in a business infrastructure, most notably servers, networking gear and storage, have become so reliable that the amount of downtime that we must protect against is quite low. Most of the belief in the need for knee jerk high availability comes from a different era when reliable hardware was unaffordable and even the most expensive equipment was rather unreliable by modern standards. This feeling of impending doom that any device might fail at any moment comes from an older era, not the current one. Modern equipment, while obviously still carrying risks, is amazingly reliable.

In addition to other risks, over-investing in high availability solutions carries financial and business risks that can be substantial. It increases technical debt in the face of business uncertainty. What if the business suddenly grows, or worse, what if it suddenly contracts, changes direction, gets purchased or goes out of business completely? The investment in the high availability is already spent even if the need for its protection disappears. What if technology or location change? Some or all of a high availability investment might be lost before it would have been at its expected end of life.

As IT practitioners, evaluating the benefits, risks and costs of technology solutions is at the core of what we do. Like everything else in business infrastructure, determining the type of risk mitigation, the value of protection and how much is financially proper is our key responsibility and cannot be glossed over or ignored. We can never simply assume that high availability is needed, nor that it can simply be skipped. It is in analysis of this nature that IT brings some of its greatest value to organizations. It is here that we have the potential to shine the most.

Like this:

Traditionally Long Term Support operating system releases have been the bulwark of enterprise deployments. This is the model used by IBM, Oracle, Microsoft, Suse and Red Hat and has been the conventional thinking around operating systems since the beginning of support offerings many decades ago.

It has been common in the past for both servers and desktop operating system releases to follow this model, but in the Linux space specifically we began to see this get shaken up where less formal products were free to experiment with more rapid, unsupported or simply unstructured releases. In the primary product space, openSuse, Fedora and Ubuntu all provided short term support offerings or rapid release offerings. Instead of release cycles measured in years and support cycles closing in on a decade they shorted release cycles to months and support to just months or a few years at most.

In the desktop space, getting new features and applications sooner, instead of focusing primarily on stability as was common on servers, often made sense and brought the added benefit that new technologies or approaches could be tested on faster release cycle products before being integrated into long term support server products. Fedora, for example, is a proving ground for technologies that will, after proving themselves, make their way into Red Hat Enterprise Linux releases. By using Fedora, end users get features sooner, get to learn about RHEL technologies earlier and Red Hat gets to test the products on a large scale before deploying to critical servers.

Over time the stability of short term releases has improved dramatically and increasingly these systems are seen as viable options for server systems. These systems get newer enhancements, features and upgrades sooner which is often seen as beneficial.

A major benefit of any operating system is their support ecosystem, including the packages and libraries that are supported and provided as part of the base operating system. With long term releases, we often see critical packages aging dramatically throughout the life of the release which can cause problems with performance, compatibility and even security in extreme cases. This obviously forces users of long term release operating systems to choose between continuing to live with the limitations of the older components or to integrate new components themselves which often breaks the fundamental value of the long term release product.

Because the goal of a long term release is to have stability and integration testing, replacing components within the product to “work around” the limitations of an LTS means that those components are not being treated in an LTS manner and that integration testing from the vendor is no longer happening, most likely, or if it is, not to the same degree. In effect, what happens is that this becomes a self-built short term release product but with legacy core components and less oversight.

In reality, in most respects, doing this is worse than going directly to a short term release product. Using a short term or rapid release product allows the vendor to maintain the assumed testing and integration, just with a faster release and support cycle, so that the general value of the long term release concept is maintained and with all components of the operating system, rather than just a few, being updated. This allows for more standardization, industry testing and shared knowledge and integration than with a partial LTS model.

Maybe the time has come to rethink the value of long term support for operating systems. For too long, it seems, the value of this approach was simply assumed and followed, and certainly it had and has merits; but the operating system world has changed since this approach was first introduced. The need for updates has increased while the change rates of things like kernels and libraries have slowed dramatically. More powerful servers have moved compatibility higher up the stack and instead of software being written to an OS it is often written for a specific version of a language or run time or other abstraction layer.

Shorter release cycles means that systems get features, top to bottom, more often. Updates between “major” releases are smaller and less impactful. Changes from updates are more incremental, providing a more organic learning and adaptation curve. And most importantly the need for replacing system components that are carefully tested and integrated with third party provided versions becomes, effectively, unheard of.

Stability for software vendors remains a value for long term releases and will cause there to be a need for the use of long term releases for a long time to come. But for the system administrator, the value to this approach seems to be decreasing and, I feel personally, has found an inflection point in recent years. It used to seem expected and normal to wait two or three years for packages to be updated, but today this feels unnecessarily cumbersome. It seems increasingly common that higher level components are built with a requirement of newer underlying components; an expectation that operating systems will either be more current or that portions of the OS will be updated separately from the rest.

A heavy reliance on containerization technologies may reverse this trend in some ways, but in ways that always reduce the value of long term releases at the same time. Containerization reduces the need for extensive capabilities in the base operating system making it easier and more effective to update more frequently for improved kernel, filesystem, driver and container support while leaving libraries and other dependencies in the containers allowing applications that need long term support dependencies to be met in that way and applications that can benefit from newer components to be addressed in that manner.

Of course virtualization has played a role in reducing the value of long term support models by making rapid recovery and duplication of systems trivial. Stability that we’ve needed long term support releases to address is partially addressed by the virtualization layer; hardware abstraction improves driver stability in very important ways. In the same vein, devops style support models also reduce the need for long term support and make server ecosystems more agile and flexible. Trends in system administration paradigms are tending to favour more modern operating systems.

Time will tell if trends continue in the direction that they are headed. For myself, this past year has been an eye opening one that has seen me move my own workloads from a decade of staunch support for very long term support products to rapid release ones and I must say, I am very happy with the change.

Like this:

We get this all of the time in IT, a vendor tells us that a system cannot be virtualized. The reasons are numerous. On the IT side, we are always shocked that a vendor would make such an outrageous claim; and often we are just as shocked that a customer (or manager) believes them. Vendors have worked hard to perfect this sales pitch over the years and I think that it is important to dissect it.

The root cause of problems is that vendors are almost always seeking ways to lower costs to themselves while increasing profits from customers. This drives a lot of what would otherwise be seen as odd behaviour.

One thing that many, many vendors attempt to do is limit the scenarios under which their product will be supported. By doing this, they set themselves up to be prepared to simply not provide support – support is expensive and unreliable. This is a common strategy. It some cases, this is so aggressive that any acceptable, production deployment scenario fails to even exist.

A very common means of doing this is to fail to support any supported operating system, de facto deprecating the vendor’s own software (for example, today this would mean only supporting Windows XP and earlier.) Another example is only supporting products that are not licensed for the use case (an example would be requiring the use of a product like Windows 10 be used as a server.) And one of the most common cases is forbidding virtualization.

These scenarios put customers into difficult positions because on one hand they have industry best practices, standard deployment guidelines, in house tooling and policies to adhere to; and on the other hand they have vendors often forbidding proper system design, planning and management. These needs are at odds with one another.

Of course, no one expects every vendor to support every potential scenario. Limits must be applied. But there is a giant chasm between supporting reasonable, well deployed systems and actively requiring unacceptably bad deployments. We hope that our vendors will behave as business partners and share a common interest in our success or, at the very least, the success of their product and not directly seek to undermine both of these causes. We would hope that, at a very minimum, best effort support would be provided for any reasonable deployment scenario and that guaranteed support would be likely offered for properly engineered, best practice scenarios.

Imagine a world where driving the speed limit and wearing a seatbelt would violate your car warranty and that you would only get support if you drove recklessly and unprotected!

Some important things need to be understood about virtualization. The first is that virtualization is a long standing industry best practice and is expected to be used in any production deployment scenario for services. Virtualization is in no way new, even in the small business market it has been in the best practice category for well over a decade now and for many decades in the enterprise space. We are long past the point where running systems non-virtualized is considered acceptable, and that includes legacy deployments that have been in place for a long time.

There are, of course, always rare exceptions to nearly any rule. Some systems need access to very special case hardware and virtualization may not be possible, although with modern hardware passthrough this is almost unheard of today. And some super low latency systems cannot be virtualized but these are normally limited to only the biggest international investment banks and most aggressive hedgefunds and even the majority of those traditional use cases have been eliminated by improvements in virtualization making even those situations rare. But the bottom line is, if you can’t virtualize you should be sad that you cannot, and you will know clearly why it is impossible in your situation. In all other cases, your server needs to be virtual.

Is it not important?

If a vendor does not allow you to follow standard best practices for healthy deployments, what does this say about the vendor’s opinion of their own product? If we were talking about any other deployment, we would immediately question why we were deploying a system so poorly if we plan to depend on it. If our vendor forces us to behave this way, we should react in the same manner – if the vendor doesn’t take the product to the same degree that we take the least of our IT services, why should we?

This is an “impedance mismatch”, as we say in engineering circles, between our needs (production systems) and how the vendor making that system appears to treat them (hobby or entertainment systems.) If we need to depend on this product for our businesses, we need a vendor that is on board and understands business needs – has a production mind set. If the product is not business targeted or business ready, we need to be aware of that. We need to question why we feel we should be using a service in production, on which we depend and require support, that is not intended to be used in that manner.

Is it supported? Is it being tested?

Something that is often overlooked from the perspective of customers is whether or not the necessary support resources for a product are in place. It’s not uncommon for the team that supports a product to become lean, or even disappear, but the company to keep selling the product in the hopes of milking it for as much as they can and bank on either muddling through a problem or just returning customer funds should the vendor be caught in a situation where they are simply unable to support it.

Most software contracts state that the maximum damage that can be extracted from the vendor is the cost of the product, or the amount spent to purchase it. In a case such as this, the vendor has no risk from offering a product that they cannot support – even if charging a premium for support. If the customer manages to use the product, great they get paid. If the customer cannot and the vendor cannot support it, they only lose money that they would never have gotten otherwise. The customer takes on all the risk, not the vendor.

This suggests, of course, that there is little or no continuing testing of the product as well, and this should be of additional concern. Just because the product runs does not mean that it will continue to run. Getting up and running with an unsupported, or worse unsupportable, product means that you are depending more and more over time on a product with a likely decreasing level of potential support, slowly getting worse over time even as the need for support and the dependency on the software would be expected to increase.

If a proprietary product is deployed in production, and the decision is made to forgo best practice deployments in order to accommodate support demands, how can this fit in a decision matrix? Should this imply that proper support does not exist? Again, as before, this implies a mismatch in our needs.

Is It Still Being Developed?

If the deployment needs of the software follow old, out of date practices, or require out of date (or not reasonably current software or design) then we have to question the likelihood that the product is currently being developed. In some cases we can determine this by watching the software release cycle for some time, but not in all cases. There is a reasonable fear that the product may be dead, with no remaining development team working on it. The code may simply be old, technical debt that is being sold in the hopes of making a last, few dollars off of an old code base that has been abandoned. This process is actually far more common than is often believed.

Smaller software shops often manage to develop an initial software package, get it on the market and available for sale, but fail to be able to afford to retain or restaff their development team after initial release(s). This is, in fact, a very common scenario. This leaves customers with a product that is expected to become less and less viable over time with deployment scenarios becoming increasingly risky and data increasing hard to extricate.

How Can It Be Supported If the Platform Is Not Supported?

A common paradox of some more extreme situations is software that, in order to qualify as “supported”, requires other software that is either out of support or was never supported for the intended use case. Common examples of this are requiring that a server system be run on top of a desktop operating system or requiring versions of operating systems, databases or other components, that are no longer supported at all. This last scenario is scarily common. In a situation like this, one has to ask if there can ever be a deployment, then, where the software can be considered to be “supported”? If part of the stack is always out of support, then the whole stack is unsupported. There would always be a reason that support could be denied no matter what. The very reason that we would therefore demand that we avoid best practices would equally rule out choosing the software itself in the first place.

Are Industry Skills and Knowledge Lacking?

Perhaps the issue that we face with software support problems of this nature are that the team(s) creating the software simply do not know how good software is made and/or how good systems are deployed. This is among the most reasonable and valid reasons for what would drive us to this situation. But, like the other hypothesis reasons, it leaves us concerned about the quality of the software and the possibility that support is truly available. If we can’t trust the vendor to properly handle the most visible parts of the system, why would we turn to them as our experts for the parts that we cannot verify?

The Big Problem

The big, overarching problem with software that has questionable deployment and maintenance practice demands in exchange for unlocking otherwise withheld support is not, as we typically assume a question of overall software quality, but one of viable support and development practices. That these issues suggest a significant concern for long term support should make us strongly question why we are choosing these packages in the first place while expecting strong support from them when, from the onset, we have very visible and very serious concerns.

There are, of course, cases where no other software products exist to fill a need or none of any more reasonable viability. This situation should be extremely rare and if such a situation exists should be seen as a major market opportunity for a vendor looking to enter that particular space.

From a business perspective, it is imperative that the technical infrastructure best practices not be completely ignored in exchange for blind or nearly blind following of vendor requirements that, in any other instance, would be considered reckless or unprofessional. Why do we so often neglect to require excellence from core products on which our businesses depend in this way? It puts our businesses at risk, not just from the action itself, but vastly moreso from the risks that are implied by the existence of such a requirement.

Like this:

We should set the stage by looking at some historical context around GUIs and their role within the world of systems administration.

In the “olden days” we did not have graphical user interfaces on any computers at all, let alone on our servers. Long after GUIs began to become popular on end user equipment, servers still did not have them. In the 1980s and 1990s the computational overhead necessary to produce a GUI was significant in terms of the total computing capacity of a machine and using what little that there was to produce a GUI was rather impractical, if often not even completely impossible. The world of systems administration grew up in this context, working from command lines because there was no other option available to us. It was not common for people to desire GUIs for systems administration, perhaps because the idea had not occurred to people yet.

In the mid-1990s Microsoft, along with some others, began to introduce the idea of GUI-driven systems administration for the entry level server market. At first the approach was not that popular as it did not match how experienced administrators were working in the market. But slowly, as new Windows administrators and to some degree as Novell Netware administrators, began to “grow up” with access to GUI-based administration tools there began to be an accepted place in the server market for these systems. In the mid to late 1990s the UNIX and other non-Windows servers completely dominated the market. Even VMS was a major player still and on the small business and commodity server side Novell Netware was the dominant player mid-decade and still a very serious contender late in the decade. Netware offered a GUI experience but one that was very light and should probably be considered only “semi-GUI” in comparison to Windows NT’s rich GUI experience offered by at least 1996 and to some degree earlier with the NT 3.x family, although Windows NT was only just finding its place in the world before NT 4’s release.

Even at the time, the GUI-driven administration market remained primarily a backwater. Microsoft and Windows still had no major place on the server side but was beginning to make inroads via the small business market where their low cost and easy to use products made a lot sense. But it was truly the late 1990s panic and market expansion brought on by the combination of the Y2K scare, the dotcom market bubble and excellent product development and marketing by Microsoft that a significant growth and shift to a GUI-driven administration market occurred.

The massive expansion of the IT market in the late 1990s meant that there was not enough time or resources to train new people entering IT. The learning curve for many systems, including Solaris and Netware, was very steep and the industry needed a truly epic number of people to go from zero to “competent IT professional” faster than it was possible to do with the existing platforms of the day. The market growth was explosive and there was so much money to be made working in IT that there were no available resources to effectively train new people who needed to be coming into IT as anyone qualified to handle educational duties was also able to earn so much more working in the industry rather than working in education. As the market grew, the value of mature, experienced professionals became extremely high, as they were more and more rare in the ever expanding field as a whole.

The market responded to this need in many ways but one of the biggest ones was to fundamentally change how IT was approached. Instead of pushing IT professionals to overcome the traditional learning curves and develop the needed skills to effectively manage the systems that were on the market at the time, the market changed which tools that they were using to accommodate less experienced and less knowledgeable IT staff. Simpler and often more expensive tools often with GUI interfaces began to flood the market allowing those with less training and experience to at least begin to be useful and productive almost immediately even without ever having seen a product previously.

This change coincided with the natural advancement of the performance of computer hardware. It was during this era that for the first time the power of many systems was such that while the GUI still made a rather significant impact to performance, the lower cost of support staff and speed at which systems could be deployed and managed generally offset this loss of computing capacity taken by the GUI. The GUI rapidly became a standard addition to systems that just a few years before would never have seen one.

To improve the capabilities of these new IT professionals and to rush them into the marketplace the industry also shifted heavily towards certifications, more or less a new innovation at the time, which allowed new IT pros, often with no hands on experience of any kind, to establish some degree of competence and to do so commonly without needing any significant interaction or investment from existing IT professionals like university programs would require. Both the GUI-based administration market, as well as the certification industry, boomed; and the face IT significantly changed.

The result certainly was a flood of new, untrained or lightly trained IT professionals entering the market at a record pace. In the short term this change work for the industry. The field went from dramatically understaffed to relatively well staffed years faster than it could have done so otherwise. But it did not take long before the penalties for this rapid uptake of new people began to appear.

One of the biggest impacts to the industry was that there was an industry-wide “baby boom” with all of the growing pains that that would entail. An entire generation of IT professionals grew up in the boot camps and rapid “certification training” programs of the late 1990s. This resulted in a long term effect of the rules of thumb and general approaches common in that era becoming often codified to the point of near religious belief in a way that previous, as well as later, approaches would not. Often, because education was done quickly and shallowly, many concepts had to be learned by rote without an understanding of the fundamentals behind them. As the “Class of 1998” grew into the senior IT professionals in their companies over time, they became the mentors of new generations and that old rote learning has very visibly trickled down through similar approaches in the years since, even long after the knowledge is outdated or impractical and in many cases it has been interpreted incorrectly and is wrong in predicable ways even for the era from which it sprang.

Part of this learning of the era was a general acceptance that GUIs were not just acceptable but that they were practical and expected. The baby boom effect meant that there was little mentorship from the former era and previously established practices and norms were often swept away. The baby boom effect meant that the industry did not exactly reinvent itself as much as it simple invested itself. Even the concept of Information Technology as a specific industry unto itself took its current form and took hold in the public consciousness during this changing of the guards. Instead of being a vestige or other departments or disciplines, IT came into its own; but it did so without the maturing and continuity of practices that would have existed with more organic growth leaving the industry in possibly a worse position than it might have been would it have developed in a continuous fashion.

The lingering impact of the late 1990s IT boom will be felt for a very long time as it will take many generations for the trends, beliefs and assumptions of that time period to finally be swept away. Slowly, new concepts and approaches are taking hold, often only when old technologies disappear and knew ones are introduced breaking the stranglehold of tradition. One of these is the notion of the GUI being the dominant method by which systems administration is accomplished.

As we pointed out before, the GUI at its inception was a point of differentiation between old systems and the new world of the late 1990s. But since that time GUI administration tools have become ubiquitous in their availability. Every significant platform has and has long had graphical administration options so the GUI no longer sets any platform apart in a significant way. This means that there is no longer any vendor with a clear agenda driving them to push the concept of the GUI. The marketing value of the GUI is effectively gone. Likewise, not only did systems that previously lacked a strong GUI nearly all develop one (or more) but the GUI-based systems that did not have strong command line tools went back and developed those as well and developed new professional ecosystems around them. The tide most certainly turned.

Furthermore, over the past nearly two decades the rhetoric of the non-GUI world has begun to take hold. System administrators working from a position of a mastery of the command line, on any platform, generally outperform their counterparts leading to more career opportunities, more challenging roles and higher incomes. Companies focused on command line administration find themselves with more skilled workers and a higher administration density which, in turn, lowers overall cost.

This alone was enough to make the position of the GUI begin to falter. But there was always the old argument that GUIs, even in the late 1990s, used a small amount of system resources and only added a very small amount of additional attack surface. Even if they were not going to be used, why not have them installed “just in case.” As CPUs got faster, memory got larger, storage got cheaper and as system design improved the impact of the GUI became less and less so this argument of having GUIs available got stronger. Especially strong was the proposal that GUIs allowed junior staff to do tasks as well making them more useful. But it was far too common for senior staff to retain the GUI as a crutch in these circumstances.

With the advent of virtualization in the commodity server space, this all began to change. The cost of a GUI became suddenly noticeable again. A system running twenty virtual machines would suddenly use twenty times the CPU resources and twenty times the memory and twenty times the storage capacity of a single GUI instance. The footprint of the GUI was noticeable again. As virtual machine densities began to climb, so did the relative impact of the GUI.

Virtualization gave rise to cloud computing. Cloud computing increased virtual machine deployment densities and exposed other performance impacts of GUIs, mostly in terms of longer instance build times and more complex remote console access. Systems requiring a GUI began to noticeably lag behind their GUI-less counterparts in adoption and capabilities.

But the far bigger factor was the artifact of cloud computing’s standard billing methodologies. Because cloud computing typically exposes per-instance costs in a raw, fully visible way IT departments had no means of fudging or overlooking the costs of GUI deployments whose additional overhead would often even double the cost of a single cloud instance. Accounting would very clearly see bills for GUI systems costing far more than their GUI-less counterparts. Even non-technical teams could see that the cost of GUIs was adding up even before considering the cost of management.

This cost continues to increase as we move towards container technologies where the scale of individual instances becomes small and smaller means that the relative overhead of the GUI becomes more significant.

But the real impact, possibly the biggest exposure of the issues around GUI driven systems is the industry’s move towards the DevOps system automation models. Today only a relatively small percentage of companies are actively moving to a full cloud-enabled, elastically scalable DevOps model of system management but the trend is there and the model leaves GUI administrators and their systems completely behind. With DevOps models direct access to machines is no longer a standard mode of management and systems have gone even farther than working solely from the command line to being built completely in code meaning that not only do system administrators working in the DevOps world need to interact with their systems at a command line but they must do so programmatically.

The market is rapidly moving towards fewer, more highly skilled systems administrators working with many, many more servers “per admin” than in any previous era. The idea that a single systems administrator can only manage a few dozen servers, a common belief in the GUI world, has long been challenged even in traditional “snowflake” command line systems administration with numbers easily climbing into the few hundred range. But the DevOps model or similar automation models take those numbers into the thousands of servers per administrator. The overhead of GUIs is becoming more and more obvious.

As new technologies like cloud, containers and DevOps automation models become pervasive so does the natural “sprawl” of workloads. This means that companies of all sizes are seeing an increase in the numbers of workloads that need to be managed. Companies that traditionally had just two or three servers today may have ten or twenty virtual instances! The number of companies that need only one or two virtual machines is dwindling.

This all hardly means that GUI administration is going to go away in the near, or even the distant, future. The need for “one off” systems administration will remain. But the ratio of administrators able to work in a GUI administration “one off” mode versus those that need to work through the command line and specifically through scripted or even fully automated systems (a la Puppet, Chef, Ansible) is already tipping incredibly rapidly towards non-GUI system administration and DevOps practices.

What does all of this mean for us in the trenches of the real world? It means that even roles, such as small business Windows administration, that traditionally have had little or no need to work at the command line need to reconsider the dependence on the local server GUI for our work. Command line tools and processes are becoming increasingly powerful, well known and how we are expected to work. In the UNIX world the command line has always remained and the need to rely on GUI tools would almost always be seen as a major handicap. This same impression is beginning to apply to the Windows world as well. Slowly those that rely on GUI tools exclusively are being seen as second class citizens and increasingly relegated to more junior roles and smaller organizations.

The improvement in scripting and automation tools also means that the value of scale is getting better so that the cost to administer small numbers of servers is becoming very high on a per workload basis which means that there is a very heavy encouragement for smaller companies to look towards management consolidation through the use of outside vendors who are able to specialize in large scale systems management and leverage scripting and automation techniques to bring their costs more in line with larger businesses’ costs. The ability to use outside vendors to establish scale or an approximation to it will be very important, over time, for smaller businesses to remain cost competitive in their IT needs while still getting the same style of computing advantages that larger businesses are beginning to experience today.

It should be noted that happening in tandem with this industry shift towards the command line and automation tools is the move to more modern, powerful and principally remote GUIs. This is a far less dramatic shift but one that should not be overlooked. Tools like Microsoft’s RSAT and Server Administrator provide a GUI view that is leveraging command line and API interfaces under the hood. Likewise Canonical’s Ubuntu world now has Landscape. These tools are less popular in the enterprise but are beginning to make the larger SMB market able to maintain a GUI dependency while also managing a larger set of server instances. The advancement in these types of GUI tools may be the strongest force slowing the adoption of command line tools across the board.

Whether we are interested in the move from the command line, to GUIs and back to the command line as an interesting artifact of the history of Information Technology as an industry or if we are looking at this as a means to understanding how systems administration is evolving as a career path or business approach for our own uses it is good for us to appreciate the factors that caused it to occur and why the ebb and flow of the industry is now taking us back out to the sea of the command line once again. By understanding these forces we can more practically asses where the future will take us, when the tide may again change, how to best approach our own careers or decide on both technology and human talent for our organizations.