How my experience differed from all those online blogs and HowTos which explain how things just work. There are also some entries were I'm just trying to remember some "stuff" which seemed important or cool (in a minor way) at the time.

Sunday, February 21, 2010

So in part 1, I promised to discuss some of the "soft" requirements to be considered when preparing for a recovery from disaster.

The following documentation should be held electronically. These days it is probably more difficult not to hold documentation electronically.

Documentation of your hardware maintenance contracts
The purpose of this is not to fix your lost equipment, but to immediately take the lost servers off the contract and to later update it with their replacements. This might not seem like a priority, but the larger the company the larger the cost saving that performing this action will save. Should you manage to recover some of your servers you'll need to know this information to raise support calls.

This action presupposes that you have negotiated your contract so that you can add and remove items during the life of the contract. If you haven't already, you should start doing so from your next renewal date.

Your policies & procedures.
Some might argue that a clean slate is the perfect opportunity to start again. As creating these documents isn't the most fun activity in the known universe, I have some sympathy with this idea. However it really should be resisted. It will have taken you some considerable time to develop those policies & procedures. Some policies might need tweaking, some might need obsoleting, but they are a gold mine of information about your environment.

Good Supplier Relationships
Another "soft" requirement. If it isn't obvious why this is a requirement for disaster recovery, consider that on that Sunday morning I mentioned in part 1 that by 11:00am one of our two main hardware suppliers had:

opened their offices

provided us with internet access

provided us with hardware that had been purchased by and for someone else!

provided us with lab space; phones; electricity; etc

and were arranging with their security company to allow us to stay through the night whilst we worked at building and recovering our backup system from the backup tapes and some installation media.

(Obviously, we did later pay for the hardware. Whether the original purchasers were ever told, I do not know.)

Of course, a good supplier relationship is not something that can't be magic-ed out of a hat first thing on the morning of your disaster. Good supplier relationships are an ongoing concern. That doesn't mean that you overpay for goods and services. That isn't a good relationship. That is being a doormat. It also doesn't mean screwing them over on every deal. It does mean being open with them. Working with them over a long time so that they understand your requirements; that sometimes every quote doesn't lead to a purchase; that company rules require that you get quotes from other suppliers too!

Offsite storage for the backup tapes and all the other documentation above
Offsite storage for your backup tapes is fairly standard. But how frequently do the tapes go offsite? It needs to be daily during the week! If your company is large enough to be able to afford weekend shifts, then you might also want to investigate weekend pickups as well.

Every month on the first day of the month my admin server sends me an email. That email reminds me to burn all the latest documentation from the intranet site down onto a DVD. That DVD then stays in my laptop bag until the next month.

Some of the documentation is stored in a number of Lotus Notes databases. These are even more ideal from a DR perspective. (It is a shame that IBM have made a hash of marketing Lotus Notes - some of its features are ideal for enterprises of any size. But that is a story for another blog post.) You can just make a local replica of the database onto your PC. Whatever the contents of that database. And it can be kept in step via replication, which can be as often as you like. Or never after the initial replication.

At a time when my company operated a campus of multiple building, and indeed multiple sites, a firesafe in one of the other buildings was considered offsite.

How to recover your environment.
Given your backup tapes and an empty room, would you know where to start/how to start to rebuild your environment? This is a question Joel Spolsky covered quite cogently in a post just before Christmas. Doing the backup is part of the bread and butter of the job. But so should be the restore.

Even with all the knowledge of your environment that you should have documented, the answer to the question of which servers to restore first will be similar to the the start-up order of your datacentre. Similar, but unlikely to be exactly the same.

Of course, the shutdown and startup orders will be part of the documentation listed under "Description of Inter-relationships" described in part 1.

Most "old lags" in IT will have a good idea of which systems need to be restored first and how to do so. Hopefully, there will have been an exercise in how to to restore individual servers and systems over time.

In part 1 of this series of posts on disaster recovery, I listed some/most of the information you should keep for each server. However, I missed some items out:
It is just about essential to list what is backed up & how to recover the server with that dataset.
And when you have recovered your server, how do you know it is recovered? How can you prove it has been recovered successfully?

Document a series of tests that will exercise the functionality of the server/system fully or at least to some acceptable (to you and justifiable to others) level of completeness. Generally this information is referred to as return to service information (RTS).

Knowledge of the company's insurance policy
This might not be regarded as an IT responsibility. In some companies this might be a site services or facilities management responsibility or a Financial or Legal Dept. concern. In smaller companies, the office manager might be responsible.
In fact, Id agree it wasn't an IT responsibility or shouldn't be. But if you are responsible for the company's IT infrastructure, you should make yourself aware of whether your company actually has Critical Incident Insurance or whether your company is large enough to carry the risk itself.

The answer will help you prepare. If the replacement cost of your infrastructure is US$2million and your comapany has no insurance, then the business should know that up to that amount will have to be found/provided in a disaster.

If the company does have insurance, then it is necessary to keep that policy up to date with the value of the company's infrastructure.

Multiple sites
In theory, having multiple sites should enable you to provide resilience through replication of information to the other sites. It depends upon your level of risk and the budget available to you whether you implement replication.

But it is possible to mitigate a lot of risk through data replication between sites. At one stage, there was only the UNIX utilities: rdist and then rsync to accomplish the task. But they work at the file level. Then a lot of companies worked out how to accomplish this task at the block level. NetApp were possibly the first - the first I was aware of anyway. but it now appears to be a common facility in every venders' repertoire.

Both free and paid for dbms offer varieties of replication, master/slave and master/master. One of the best database replication mechanisms seems to be that used by Lotus Notes. But Lotus Notes isn't suitable for all applications. Plus IBM doesn't seem to have known how to market it. Actually IBM frequently doesn't appear to know how to market anything. Anyway, it should be possible to set up your database applications, to be location independent.

Well, that was part 2. After part 1, I stated there would be an additional two parts, Whilst finalising this part, I realised some of the issues I overlooked. There may well be a part 4. It depends upon how part 3 goes.

Saturday, February 20, 2010

This post possibly falls into the "stating the bleeding obvious " category! However,...

Reducing server count isn't just about reducing the amount of hardware. But, if you have virtualised a bunch of servers will you be so keen to look into combining their functions?
There still can be value in combining the functions of two virtual servers. There are the management benefits of fewer servers to manage and if you have software that is licensed on a per server basis, then there will be financial benefits too.

Eventually, you will develop an architectural dependence on servers that are virtualised. If your mail relay is virtualised and your payroll system isn't, then should your payroll system need to send email, a lot of the infrastructure needed for your virtualisation environment will need to be up and running too.
In most cases the virtual environment will be up and running. However, there are periods when you may be recovering from a disaster when that isn't the case.

Are these reasons not to virtualise?

Of course, not!
But they are factors to consider and factor in to your decision making.

Tuesday, February 16, 2010

This is really useful information. All pulled together in one place. Some of it might seem common sense, but as has frequently been said it is surprising how uncommon common sense can be.

As I wrote in my previous post about recovering from a disaster, documentation on your environment is vital. And these two links below provide that.

I've only just noticed that RVTools has been considerably updated. This is an excellent tool and the more so for being free. I feel a bit guilty about not being in a position to donate some of my employers' money via paypal.

Actually, the new feature that I'm most excited by is one that has been there for several versions now. Specifically, the ability to export all the information in csv format. And to do that from the command line!

I was browsing sunfreeware and came across sys_diag, a script generated by Todd Jobson. For doumenting the state of a Solaris server, it looks damn comprehensive. I intend to use it on the Solaris systems at my work and save the results into our database of server information.

Saturday, February 13, 2010

Ever since I have been involved in IT, management have been concerned with producing a disaster recovery plan. Inevitably, the hearts, morale and motivation of most of the staff involved has sunk deeper than the bottom of the Marianas Trench.

Management have inevitably sought a tome large enough for any doorstep - your basic shelf-ware, in fact.

However, having lived and worked through as big a disaster as any in the UK - I took a call from my boss at 7:00a.m. on a Sunday morning telling me "the building is gone", this is what I've learnt on the job. Both what we had done before and some things I wish we had done.

Some of the requirements covered over the initial three posts I'll be making on this topic could be considered "soft" requirements, as in soft systems methodology, as they aren't focused on hardware, software or physical artifacts. And some of these sections could be the basis of a major article or blog posting. Heck! There are any number of books on Amazon on the subject of Disaster Recovery for IT systems.

Now, to start with, you need to know what you have and therefore what you might have lost.

So have aFull Hardware Inventory
List your servers:

make

model

number of cpus

amount of RAM

number and type of NICs

number and type of HBAs

internal storage

external storage

OS Version

IP Addresses

MAC Addresses

MAC addresses

hostid

purchase order number

purchase date

maintenance contract number

serial number

asset tag

etc

The internal and external storage descriptions should be stated both in terms of the disks used and the partitions built on top of them. It is also necessary to document the RAID schemes used and whether RAID is via the hardware or software.

List all the networking equipment: routers, switches, firewalls, load balancers, network appliances whether for web caching, spam filtering, mail relays, dns, etc. There will be a reason for making any configuration change to these systems. If you've opened a port through your firewall, there will be a very good business reason for doing so. In some geographies, e.g. the US, you will be required to record and keep that information for auditing and compliance control in a non-editable format.You can usually make comments against your firewall rules - at least you can with checkpoint - and it is good practise to give firewall change requests an unique reference number to include in the comment in the firewall rules. Then it will be possible to not only take a firewall change request and find out if rules exist for it, but it will also be possible to trace back from a rule to the request that caused its creation.

List all the Facilities "stuff", i.e. Air conditioning units, UPS, racking. Depending upon your company and the extent of your loss, then you might also list additional items like printers, scanners, photocopiers and multi-function devices. Perhaps also webcams. Although webcams aren't often allowed in the office anymore.

Basically, you want as much detail as possible. You aren't going to want to purchase the exact self same hardware, but this information will inform any purchasing decisions. You'll understand what your required processing and storage capacities are or were.

Additionally, you'll also need aFull Server Listing
With virtualisation in whatever guise becoming almost mandatory, it will be clear what the difference is between a full Hardware Inventory and a Full Server Listing. It is still necessary to document the information that is being listed in the previous section above for each server.

Full Software Inventory
If you have been following ITIL you should have a Definitive Software Library (DSL), which will contain all your required OS and Application software installation media and any significant updates. Even where a DSL exists, individual engineers will have installation media. Unless you have very fussy applications software, it is probably not necessary to define individual patch levels for various components of the OS. Although, documenting any patches or software updates that should never be applied might well be valuable, e.g. the company I work for still has to use ie6 (*sigh*) allegedly because of SAP, so Microsoft updates for ie7 and ie8 are blocked from download. I like to cross-reference systems against software in addition to software against systems. A spreadsheet isn't always the best mechanism to maintain the referential integrity. A database which is designed to support such references and searching is a lot more useful.

Review of Inventory Requirements
The information listed above shouldn't just be kept for DR. It can/should be used to:

generate your annual hardware maintenance requirements

generate your annual software maintenance requirements.

determine candidates for hardware upgrades

determine candidates for OS and application software upgrades

identify assets during audits

monitor for capacity planning

monitor the software for feature review, i.e. if you are about to renew the maintenance of a tool would you be better off with a new utility? e.g. we have used vRanger Pro for a couple of years but apparently Veeam Backup and Recovery is now a better product.

identify staff training requirements

identify staff hiring requirements

Consequently, this information should always be gathered or generated and kept up to date. And stored both on and securely offsite.

Now your monitoring system will almost certainly be saving the system state for capacity analysis and planning. In any large enterprise, the realistic timeframes of interest are the last quarter and the last year. You have to plan your spending a year ahead. Capital and expense spending for each quarter will be reviewed quarterly. Your documentation, monitoring and planning should be reviewed over the same periods.

Description of Inter-relationships, i.e identification of systems
So you know you hardware, your servers, your aplications, now how do they hang together as systems? If this has been mismanaged in the past, then Tideway or someone similar will sell you some software and may even come in and perform a network discovery for a fee. However, up to a certain size, you should be able to accomplish most of the same yourself. Especially if you undertake this task as you go along. every time you add a new element to your infrastructure or you simplify something: document it. Thoroughly!

However, systems may extend further than you initially consider.
The IBM Rational software configuration management tool, ClearCase, can have the following components: VOB servers, view servers, build servers, registry servers and licence servers. Some servers may have more than one function. However, ClearCase is dependent upon an OS for security and id management. So, in a Windows environment it is dependent on the AD and in a UNIX environment, is dependent upon NIS, NIS+, LDAP or similar. And in a multi-platform development environment it is dependent on both. However, in many installations ClearCase will be teamed with ClearQuest, IBM Rational's Defect Tracking system. And sometimes it will also be teamed with a requirements management system like DOORS or Requisite Pro. These integrations and others will then extend the system to database and web servers as well as client programs running on engineer's desktops. All these interconnections have to be documented.

Networking Information
The company I work for has been allocated 8 Class B and a further 41 Class C networks.
I do not say that to gloat. Although,... I am aware of Google's work to popularise IPv6, where the main finding was that if each network node had its own unique address everything became easier to address. Well, with sufficient IPv4 IP addresses you can still do that!
With that many IP addresses, the use and disposition of those networks and the addresses within them must be documented and mapped. There are any number of network management tools available. Three FOSS choices are nagios, cacti or zenoss. There are many others. If you want to pay you have any number of choices.

The previous tools have been mostly about performance and alerting. There is also a requirement for network architecture, subnet mapping to sites and the actual use of IPs within those subnets. Again there is a choice of paid and "free" software on offer. VitalQIP is a very solid piece of software, but requires management and oversight. I have heard of it being used as a mechanism for enabling a helpdesk to allocate static IP addresses, and "freeing up valuable resources for other tasks". A free alternative might be IPPLAN, but there are many choices. If you are running a Microsoft Active Directory, then you will have subnet to site mapping information within the Sites and Servers section of the directory.

It is an historical curiosity of the company that the team that controlled the EMEA and APR regions arranged that the routers on all the subnets were always on IP address .20, i.e 10.10.10.20, 10.10.11.20 etc. Whereas in the US, it was always IP address .1, i.e. 10.10.10.1, 10.10.11.1, etc. Whilst recovering from a disaster, such decisions can be revisited. Standardisation of this kind of detail across an enterprise is always to your advantage.

OK, that's enough for now.

In Part 2, I'll cover some of the more "soft" requirements.

In Part 3, I wrap up considering the human element and make some recommendations.

Thursday, February 11, 2010

Yesterday, I upgraded to the latest version of Virtualbox - its my favourite virtualisation product for the laptop/desktop market. I like to stay up to date.

Suddenly the shared folders weren't!

The virtual network was still working, so that wasn't the problem.

A quick

lsmod | grep vboxsf

produced no result. But entering

modprobe vboxvfs

returned the error message:

FATAL: Module vboxvfs not found.

Aha! Something I could use. I reckoned that something had happened to the virtualbox addins. Which doesn't completely make sense. I could see that they might have become out of date, but not that they "disappear". Lets try re-adding those additions.

I selected Install Guest Additions from the devices menu and a cdrom image appeared the VM desktop. A quick

Monday, February 8, 2010

I've been adding some additional servers into my Zenoss system. As most of my Linux boxes are running CentOS Linux, adding them in is no problem.

But I have one grml system. Why different? I had downloaded an appliance from the vmware community site. I needed a bugzilla system - there was one ready built - it seemed like a good idea at the time!

So,...
I had checked that snmp was installed on the system.
I had edited the /etc/snmp/snmpd.conf file to include the settings I wanted.
I had started the snmpd service.

But I couldn't perform an snmpwalk from my zenoss server and couldn't add it as a device into Zenoss!

Some digging with netstat revealed that port 161 was only listening for the localhost/127.0.0.1! Which was confusing because I wasn't configuring it to be like that.

Where was it picking that up from?

A quick google, quickly found these two sites, HowToForge & DebianWiki with essentially the same information.

There are some default setting in /etc/default/snmpd including the value of the environment variable SNMPDOPTS.
Out of the box, it is set to
SNMPDOPTS='-Lsd -Lf /dev/null -u snmp -I -smux -p /var/run/snmpd.pid 127.0.0.1'
which I chenged to
SNMPDOPTS='-Lsd -Lf /dev/null -u snmp -I -smux -p /var/run/snmpd.pid'

I restarted the snmpd service and checked the result from netstat -an

Huzzah!

I checked the snmpwalk and saw screeds of stuff fly up my console. And lastly, I successfully added the box into zenoss.

Sunday, February 7, 2010

By default, when you create a Virtual Center Server installation or vCenter Server the administrators group of the PC you install on becomes the Administrators of the application. At my work the result of that is that the AD Domain administrator group is the administrors group.

But that doesn't have to be the case

The following is an extract from a VMware resource on Virtual Center Server permissions.
• VirtualCenter runs as a user that requires local administrator
privilege and must be installed by a local administrative user. However, to limit the scope of administrative access, avoid using the Windows Administrator user to operate VirtualCenter after you install it. Instead, use a dedicated VirtualCenter administrator account. To do so, take the following
steps:
1. Create an ordinary user account that will be used to manage VirtualCenter, for example, the VI Admin user. Make sure that this user does not belong to any local groups, such as Users or Administrators. This precaution ensures that any future role assignments involving a local group does not inadvertently affect this account.
2. In VirtualCenter, log on as the Windows Administrator, then grant the role of Administrator (that is, the global VirtualCenter administrator) to the newly created account on the top-level Hosts and Clusters folder.
3. Log out of VirtualCenter, then make sure you can log in to VirtualCenter as the new user and that this user is able to perform all tasks available to a VirtualCenter administrator.
4. Remove the permissions in VirtualCenter for the local Administrators group.
By configuring accounts in this way, you avoid automatically giving administrative access to domain administrators, who typically belong to the local Administrators group. You also provide a way of getting into VirtualCenter when the domain controller is down, because the local VirtualCenter administrator account does not require remote authentication.

There certainly can be a benefit to adding in some local users to the administrator's group on the Virtual Center Server. Which is what we had originally done. If something happens to the AD, you will still be able to access the VMware environment. Although hopefully, some thought will be being given to recovering AD. That account was then automatically inherited as an Administrator. However, it is just as easy to create another group within the domain and then provide that account with the permission of Administrator from the top level down - ensuring that that permission propagates down to child nodes.

I almost called this post "Deskilling in IT #2", but I've changed my mind. My previous blog about deskilling dealt more about my change in attitude to the way to build and deploy customised server. With this blog it is really the manner in which the web interface is an enabler of complex server configuration without knowledge of the underlying command line options.

To set the scene, I had an rPath Mediawiki appliance which was configured to provide 4 separate wikis. The conary package management system had stopped working some time ago.- possibly something I did - and every now and again was filling up the / partition (the only partition). I couldn't add SNMP to it - well not easily. So remote management was fraught!

To resolve the issues, I built a new CentOS Linux v5.4 server with Apache and MySQL installed; configured SNMP so it could be added into our Zenoss monitoring system; copied the last night's mysqldump backup files across to the new server; copied the webpages across; copied the appropriate .conf files to /etc/httpd/conf.d/; installed webmin.
Through webmin, I then:

created new databases for each of the wikis;

added the individual database users for the wikis;

added the database access rules for each of the database users

loaded the tables and data from the previous nights backup

started the Apache service

All that took only a few minutes. And it was up and working.
Obviously starting the Apache service was simply a "service httpd start" with possibly a "chkconfig httpd on" beforehand.
The database actions are not beyond my wit to discover, either by reading the documentation on the MySQL website or indeed in terms of going through the source code for webmin and the mysql module. But with webmin, I did not have to either look them up online or in my notebook.

Perhaps, the only thing that seemed to be harder to configure through webmin is the iptables based Firewall. I much preferred to just edit the /etc/sysconfig/iptables file and enter service iptables restart. Myabe that is because generally, the edits I make are pretty simple involving only copy and pastes of lines and then just changing the port number.

Webmin makes it easier to administer many parts of *nix systems if you already know what you are doing. If you do not know what you are doing, it is just as difficult as previously.

I like webmin's facility to create links from one webmin installation to one or more other installations. My zenoss server is also a ssh jump server to other *nix systems, particularly those in the local DMZ. Making it also a console for the webmin consoles is just a logicial next step. Although to increase security even further, I'm considering tunnelling the access to the webmin port 10000 though SSH.