Archives of the TeradataForum

Message Posted: Wed, 06 Jun 2001 @ 14:59:42 GMT

What I've done is more in the line of a Canary job: It's primary function is to help spot system problems and secondarily, it gives a
consistent query to help keep an eye on system performance. Like a miner's canary, I want it to be sensitive to its environment and die
when things go wrong. Although this note is a bit long, maybe you'll get some ideas.

Whether implemented in MVS or Unix, the canary job really consists of two jobs submitted (with the production scheduler) 10 minutes
apart. This combination can be submitted as needed by your environment (maybe every 15 or 30 minutes).

The first job submitted is the canary program (which, even in worst case situations, should take no longer than a couple of minutes).
The second job tests to see if the first job is still running. If it is, then it's assumed that the first job has hung and an alert is
issued to the operator, an e-mail is sent to a distribution list and a page is sent to the on-call pager - again, it all depends on your
needs.

In the heart of the canary job, I have a manually created a table that is populated with known data. The idea is that during the canary
job, a target table is created and then the source table is copied into the target table. The source table consists of two integer columns:
'from_proc' & 'to_proc', with a unique primary index on 'from_proc'. The target table also consists of the same two columns, but with a
unique primary index on 'to_proc'. What is being done is to copy a known table to another known table with the re-distribution of data.

My own experience is that simply causing a full table scan (FTS) only reflects disk I/O performance and not broader system issues.
Causing data re-distribution makes the canary job more sensitive to degradation problems.

The source table was manually created and populated with test data. At a minimum, there is one row per Vproc (or AMP). Each row, in the
'from_proc' column, is identified with a sequential value. In the 'to_proc' column is the inverse of the 'from_proc' column (using a 4 AMP
machine as an example):

from_proc to_proc
--------- -------
1 4
2 3
3 2
4 1

The duration of the copy is controlled by the number of rows-per-AMP that are contained in the source table. So to make the copy run
longer, I'll have a hundred rows per AMP (or a total of 400 rows for my 4 AMP example). The longer the copy takes, the more it will reflect
system performance - of course, you don't want it to run too long.

In that same vein, I also run the copy at $L priority. Remember, the purpose isn't to have the canary job run as fast possible.
Instead, you want it to run a reasonable amount of time and reflect as much of the system performance as possible.

So the actual Canary job stream has the following steps (using BTEQ), each with its own unique Teradata logon:

1. Logon ('USER1') and drop/create the target table. If the system is having problems related to dictionary locking, it generally shows-
up in this step.

Just a note: Whether you want to do Step #1 or not depends a lot on your environment. If the system workload includes a lot of
drop/creates, then you might find that Step #1 causes more conflicts than it's worth. If you chose not to drop/create the target table,
you'll still need to delete its contents.

2. Perform a logon ('USER2') for each PEP in a host group and then do a logoff:

I've include this step because of experiences where PEPs have failed without causing a restart. In those cases, most queries would go
through the system because of the distribution of logons across the PEPs within a host group. However, every logon/query that went to the
dead PEP would simply hang-up. One purpose of this step is try and find that situation. Another purpose is to make sure that there is
consistent logon/logoff times for the various PEPs and for your various host groups (depending on how thorough you want to be, you'll need
one canary job per host group).

Although these aren't frequent problems, it does happen often enough to be monitored. More importantly, when it happens, it's difficult
to analyze if you don't have something in-place to help identify the situation.

3. Logon ('USER3') and perform the INSERT/SELECT from the source table to the target table. Logoff and then end the job stream.

The duration of 'USER3' can be plotted and should give a pretty good feel for overall system performance.

If you have other things that you would like to have done, then add them as additional steps (again, with their own userids).

Remember that the second job stream will start at some point after the first has started. If the first job stream is still present when
the second job begins, then the first job has most likely hung and you need to do some investigation.

By using unique userids for each step, then you don't need a specific scheme to collect start/stop times - it's already contained in the
DBC.EVENTLOG. The contents of DBC.EVENTLOG can be rolled-up and kept in a history table.

Finally, remember that you'll have to make changes to this Canary job when you undergo changes to your configuration (ie- new nodes or
changes to the host group).