As a follow up to the recent post regarding the investment in the platforms, I want to provide you with some more details about the Mail Storage Upgrade that we are currently working on.

As some of you will be only too aware, we have recently purchased 2 Sun 5310 head units, each with 3 trays of disks. With the return of the unit from the data recovery company we are now pushing ahead with the implementation of this new platform to replace the NetApp system we have been using for the mail storage.

I wanted to keep you all informed of the specific plans for how we will be implementing the new platform, so here is the plan as it stands now:

Tomorrow morning we will be making some changes to the network in our Internet House and Fieldhouse Way facilities to optimize the connectivity between the devices. This will consist of the following changes:

1) Building a specific VLAN for the replication of data between the two sites.2) Increasing the availability of the servers by bonding the network interfaces for fail-over rather than throughput.

Following the successful completion of that work we are going to set up mirroring between the two sites. This should be completed tomorrow afternoon.

If you recall, when we were implementing the system before we ran into a problem with the responsiveness of the servers. We think that this issue was resolved by upgrading the to the latest stable release of the firmware and increasing the number of the NFS threads the system could use. However we want to be completely sure that this is the case before we migrate any live customer data on to the platform. So starting on Friday morning we are going to run some stress tests on the boxes. The tests are going to consist of the following steps:

- Using several servers we will write lots of small files to the NAS - Increase the number of files being written over time to levels that reflect how the system would be used if there was live data on the platform. - Monitor performance of the NAS

If we see a repeat of the same issue, we will revisit the current problem we have open with Sun, update it and begin working with them to resolve the issue.

As long as we are comfortable that the performance of the platform is as good as we expect it to be, then we will submit a change control request for Director level sign off that we can begin the migration of live data to the platform. If all goes to plan then, we will commence that migration on Tuesday morning of next week (22nd August).

Initially the migration will start with a very small number of test users, none of the original data on the NetApps will be deleted so that there is no risk of data loss. As long as we are still happy that things are running smoothly we will start moving users over in earnest. We intend to have one migration process running for each VISP, and expect the migration to take something like 2 weeks to complete.

It is also worth mentioning what the checkpoint policy is going to be on the new platform. We are going to have a checkpoint every 6 hours a day, 00:00; 06:00; 12:00; 00:00, and each checkpoint will be held for 23 hours, so we can roll back any change to the system over the last day. I know some of you will be asking about a full back-up of the data, so here is the reasoning behind the system design we have gone with. With checkpoints in place we can roll back any change other than a full volume deletion in the last 23 hours. Added to this we will have the mirrored servers across two distinct geographical locations. For another failure to happen as it did recently we would have to have lost one of the mirrored servers, then delete the remaining volumes. What we are building here is the same set up we had on the NetApp system that has been in place for 2 and a half years.

Backing up a mail platform as busy as ours is never going to be an easy task, with the amount of transient data and the sheer volume of data requiring copying to another location. However there is a debate going in internally about back-ups in general, what we should back-up, how often and to what media and as that debate progresses I will share with you the outcomes.

It is also worth mentioning what the checkpoint policy is going to be on the new platform. We are going to have a checkpoint every 6 hours a day, 00:00; 06:00; 12:00; 00:00, and each checkpoint will be held for 23 hours, so we can roll back any change to the system over the last day.

If you're going to have a checkpoint every 6 hours a day I think you mean 00:00; 06:00; 12:00; 18:00 (not 00:00 again)?.

I know some of you will be asking about a full back-up of the data, so here is the reasoning behind the system design we have gone with. With checkpoints in place we can roll back any change other than a full volume deletion in the last 23 hours. Added to this we will have the mirrored servers across two distinct geographical locations. For another failure to happen as it did recently we would have to have lost one of the mirrored servers, then delete the remaining volumes. What we are building here is the same set up we had on the NetApp system that has been in place for 2 and a half years.

Keeping a checkpoint for 23 hours assumes that PlusNet will spot any data corruption reasonably quickly although I appreciate more disk space would be required to keep checkpoints for longer. Presumably the system that PlusNet intends to use has been fully tested and is fully supported by Sun (wasn't there some mention of customisation to the file system that made the data recovery more difficult ... ?).

Of course we would hope that PlusNet have learnt from their last mistake so that it won't happen again. Designing a system to be able to cope with a known mistake is fine but it's how it copes with an unknown mistake that truly tests it. I assume that all aspects of the system have been thoroughly investigated by PlusNet and that each Sun server is as fault-tolerant and resilient as it can be ... ?

Backing up a mail platform as busy as ours is never going to be an easy task, with the amount of transient data and the sheer volume of data requiring copying to another location. However there is a debate going in internally about back-ups in general, what we should back-up, how often and to what media and as that debate progresses I will share with you the outcomes.

I still think that the cost of suitable hardware (servers, tape/alternative devices) and media is money well spent in comparison to lost/unhappy customers, bad publicity, etc. and therefore PlusNet should just do it.

Perhaps it would be worthwhile investigating how hosting/forwarding companies (like 1&1) handle their advertised daily backups?

> wasn't there some mention of customisation to the file system that made the data recovery more difficult ... ?).

As I understand it, this was Sun's own customisation to a file system, and being fairly new to market, the Data Recovery company hadn't seen it before. I wouldn't think that PN would undertake to customise a file system on a Sun bit of kit, to be honest

What plans does PlusNet have to ensure that each and every one of these backup checkpoints is actually usable for restoring data?

As I understand it, the email outage and loss of email data was due to an engineer accidentally formatting the disc on a live server instead of a trial server. Embarrassing, but not fatal - if you have a recent backup that you can restore. The fact that you spent so long trying to get the data recovered off the disc suggests that you didn't have such a backup.

By the way, I wasn't affected because I always download emails to my PC (using POP) instead of leaving it on the server and using IMAP. That's because I've suffered too many times in the past at work with server-held data (eg MS Outlook and MS Exchange Server) where you lose access to all emails (historic sent and received ones as well as new ones) if the server or your access to it goes down. Not being able to send and receive new emails is bad; not being able even to look at historic emails is much worse.

OIC in which case, ahem, were Sun involved in the data recovery process and if not, why not?

Another question for the E-mail Resiliency and Stabilty Board methinks!

Simon

Sun was involved at the very beginning of the data recovery, and said they could help us no more, so recomended getting a specialist firm involved.

I think you'd better be asking for your money back from Sun for the nice new bits of kit which you have just got then, and put a note about them on your internal accounting system so as they can be avoided as a supplier in future...?