How Dave Does It: Taking a Pulse on your GroupWise System

Being a GroupWise administrator, you may come across situations where you are asked questions by your management or clients, that on the surface may not seem all that important. Sometimes it's hard to see the "hidden meaning" behind these questions, where answers often assist managers and IT directors in providing future projections on cost and project timelines. Other times these types of questions provide detailed information used to assist in designing network infrastructure upgrades or planning system and human resources. And then sometimes those questions actually come at a critical time where identifying the root cause or historical origination of a particular problem is crucial. Have you ever been asked any of these questions:

"How many times has this problem happened before?"

"How long does it take to identify and resolve problems?"

"What is the actual uptime of the GroupWise system?"

"What is the busiest Post Office?"

"How many more users can we add to a specific Post Office?"

"How much disk space will be needed next year for GroupWise?"

So, have you ever been asked any of these questions? Did you have the answers? Having these answers and even more data can really help GroupWise administrators keep a better "pulse" on the GroupWise system. Identifying problems, patterns, projecting growth requirements and need for additional infrastructure can all be factors in creating an extremely reliable and robust GroupWise environment along with bolstering your career. After all, avoiding end user impact is part of your job as an administrator - right? So what can you do to take the system to the "next level"? What should you be watching and tracking that can help you with problem resolution and avoidance? Well, I have some things that I recommend you keep track of and some ideas on how to use the information to the benefit of "everyone" - end-users, support staff, management and yourself.

Visit your doctor regularly

DEFINITION: - Routine Checkup: To check the conditions or health of various parts of our body systems and organs, doctors conduct a physical examination, which is taken regularly whenever you visit the doctor for a health check up. If there is any symptom after testing, he concentrates there, easily tracing the disease.

In comparison, the same methodology we use for our own physical and mental health can be applied to supporting a GroupWise system. Reviewing the various parts of the GroupWise system (POAs, MTAs, Gateways, servers, and other integrated components) periodically will allow you to identify patterns, both good and bad, that can be analyzed to identify ways to increase successes and reduce failures. Of course, there is no hard and fast rule for the number of times the entire system should be reviewed, but in most cases this should be at least twice a year. If you can find the time to run through the processes covered throughout the rest of this article more frequently, you will most certainly gain a more detailed understanding of what is going on in your environment.

Let me provide you with some examples based on my own experience. By using these processes some obvious things were avoided and then some more detailed things were identified and dealt with accordingly:

End-user impact was avoided based on a number of GroupWise volumes being expanded prior to running out of disk space.

Five servers were removed from the environment as the usage had dropped off for certain service requirements.

Accurate timings were identified for servers/services, indicating when outages/changes would be least impactful to end-users.

Resource planning for server hardware, disk space and infrastructure can be predicted a year ahead of time - which assists companies in financial planning.

New projects needs and new technology needs have been identified and implemented to increase overall reliability of the system.

So do any of these items sound like something that would help you or your organization run a more efficient GroupWise system? I would think the majority of administrators would consider these as key factors in managing GroupWise - after all, these items range from cost savings in many facets (less servers = saved administration/support time) through more positive end user experiences (avoiding outages due to problems and scheduled maintenance/upgrades during appropriate times).

Your Medical History

To keep your own physical and mental health in check, you need to have relevant data over a period of time to accurately see trends and abnormalities. As I mentioned before, the same applies to your GroupWise system. The first two components that are required here revolve around capturing that history. One such component, Novell provides "out of the box" with GroupWise - GroupWise Monitor. This application has evolved from something that was once very complex and painful to work with into an extremely useful tool - in so many ways. If you haven't already installed and configured this, I recommend that you stop reading this now and go install it - come back to this article later. If you've already got it installed, tune the threshold alerts and configure the notifications to the point that you're not getting bombarded with alerts - which will result in useful alerts. Once you have them tuned, these notifications are a very easy way to start capturing the "medical history" that you need for future data compilation and analysis. Each alert that you receive should be saved. Over time you'll start to see patterns that can be easily identified and prove very useful. Below is a general recommendation for thresholds, if you're unfamiliar with the product.

I'm actually retaining these alerts a step further and I keep the alert data in a spreadsheet (agent name, alert description and date/time) and then I've added additional information such as root cause, was there an outage and was there end-user impact.

The second component to capturing that history also utilizes the GroupWise Monitor as part of a larger process. This portion is a little more advanced and you may need to come up with something creative on your own to accomplish this same process. I use a visual basic script that runs at scheduled intervals (via the Microsoft scheduler) which goes out and grabs all of the data displayed at the agent screen of the GroupWise Monitor (for each agent) and drops that data into a database for future analysis. This process executes every hour, providing me with 24 snap-shots each day for each agent - sure that's a lot of data but it's really useful information to have. Because I have all of this data, at any given time I can run a query on a multitude of criteria; agent type, agent name, date/time, user connections, up-time, and the list goes on and on.

NOTE: I should mention that there are third party products out there that can assist with capturing this same data, at the time I was working to implement a solution it was beneficial for me to have something "in-house" that can be customized completely and enhanced by internal programmers; which has really taken what I started to the "next level".

The Examination

Every now and then you have to visit your doctor to make sure you don't need your oil changed or spark plugs replaced - wait! Wrong analogy!

As I've mentioned, periodically you need to check things out - the overall health check of your GroupWise system is important and shouldn't be put off. Schedule the time in advance and plan for it just as you would anything else.

The first place to start your examination is with the data captured over time, but you need to understand this is only a small portion of what needs to be done. The overall health check on a system can take anywhere from a few days to a couple of weeks depending on the size of the system. As an example, this process in a large system can take two people an entire week to complete. So let's get started -

I suggest you begin with breaking down the threshold alerts received from the GroupWise Monitor into something that you can use. I say this because depending on how you have tuned your thresholds, you could have a few thousand alerts over a six month time period. Try sorting them in different orders - by agent, by time, by date. If you've gone through the extra steps of keeping a spreadsheet, sort the alerts on root cause - that's a great way to identify problems and measuring what impact they are having on the GroupWise system.

I then suggest you review your more detailed historical data. Review and perform trending analysis for things like:

User connections - this can tell you which Post Offices are most heavily used and during what times of day.

Disk Space (free or available disk space) - is definitely something that can be tracked to explain anomalies in performance and or predict future requirements.

Log file configuration - size and date settings (this can be captured by this type of tool) and making sure you can always go back "x" number of days is really handy.

Agent versions - identifying a POA that is running older or newer versions of code can lead you in many directions; fix it to standardize, identify this as a cause for problems, compare it to other versions having more or less problems.

Web Access peak users - identifying times of usage and system requirements.

All of this data allows you to verify that your system is configured and functioning the way you think it should be, or identify items that need to be adjusted. In the end, some really meaningful data analysis can be done providing the ability to produce some great reports - which is, of course, a cursory ability - but can be useful when showing management an accurate depiction of the system.

The Bloodwork

So, after all of that you may think that you're done– but you're just getting started. The second portion of health checks is a detailed analysis of the remaining elements of the system.

Config Analysis:
Assuming you're running NetWare servers, capturing server configurations for each server and comparing them over time can be a useful thing. Often times even the most proficient engineers can make mistakes when editing files and configurations. I have a Config.txt for each server, and compare it each time I go through this process (with an automated tool of course - try the NetWare Config Analyzer) and see what code and what startup files have been modified. If I see changes, I check into why they were made.

GroupWise Startup Files:
Audit these every now and then. You may even want to keep a copy of them elsewhere. Make sure switches haven't been turned on or off based on your needs. Keep in mind these switches will override anything set in ConsoleOne.

Backup Verification:
Make sure backup jobs are being run against the proper volumes. You should also check to see that the backups are as error free as possible. If you're using the GWTSA you should make sure that the startup files are configured properly (and you may want a copy of those as well).

Agent Log Files:
This one is where it starts to hurt– review all of your agent logs manually (or maybe you can come up with a nice way to automate this process) and look for common occurrences of errors. An easy way to do this is to use the HTTP interface on the POA, and go to the log files. From that point you can select all of the log files and perform two separate searches; one on the word "problem" and another on the word "error". Review each of these sets of results - after you're done you'll be very proficient in Novell's error codes like (D124, D11B, D019, D023, D715, D020, D126, 820E, D05A, 8F07 and so on and so forth–)

Re-index the Entire PO:
Now this is something I learned from one of "The Masters at Novell". It's a really useful way to have the POA clean up your Post Office directories. By nature GroupWise stores index files under the post\ofuser\index directory. These files often times remain out there for an extended period of time - beyond their usefulness. Re-indexing the Post Office removes the older files, consequently cleaning up the disk space. In order to accomplish this process, you need to create a second POA startup file on your GroupWise server. I've listed the switches that should be used below. In short, this process loads a second POA on the server that can only process the index request. What you need to do is this:

Create a second POA startup file; call it something like index.poa. This file should have all of the standard options removed (commented out) accept for the following lines:

Once you've got the file on your GroupWise server, you can load it as you would load any other GroupWise POA (load GWPOA @index.poa)

IMPORTANT: There are two items that you should be aware of when doing this. First and foremost; do not proceed with this next step during production hours - do this just before your last users are going home for the weekend or evening. From the POA screen execute a Control + Q - this will initiate the Quickfinder process, which will run through completion or until the second POA is unloaded. I typically execute this command in the evening and check it the following morning - to find it completed.

The second item that you should be aware of is that in some cases this will cause a backup application to view all of the files in the offiles directory as "new" or "modified" and cause your backups jobs to run longer than a standard incremental.

Once you've identified that the Quickfinder processes have finished, all you have left to do is unload those "index POAs" and you're finished. Hopefully your backups are finished too.

Summary

I'm hoping that this article has at least inspired you to go through your system to some degree and when and if you do that, maybe it will provide you with some type of problem avoidance or end user benefits. Maybe this will even provide you with a new opportunity for learning more about GroupWise and the system that you support.