processes and queue loads via a simple UC4 script I wrote. It only alerts via email if any tresholds are exceeded, like "load over last 15 minutes" or such (SYS_SERVER_ALIVE and friends ...)

I monitor key agents via a shell script and UNIX service manager (and restart them if crashed)

we monitor actual Job execution, by executing a heartbeat job periodically that writes a file with the time, which then gets verified by Nagios (because SYS_HOST_ALIVE only goes so far - we had agents hang but still report they're alive ...)

various additional UC4 scripts by MatthiasSchelp to alert in case of unavailable Java agents (SAP, RA)

I monitor changes to the agent list by reading agents from the DB with a shell script, and automatically comparing them against the list of the previous day (using sdiff on Linux: needed because other departments sometimes install agents without telling us, and Automic sadly does not allow full license control purely by the server, so new agents can eat licenses without the Server Admin even allowing them to - bad design!)

I monitor the various MQs with a shell script (via SQL), and alert in case of unusually high levels

another shell script monitors how many jobs each department has active, and alerts me at unusual high levels, so I can tell SAP to cut it out if they spawn 50000 jobs at once

I log (and incidentially analyze) the activation lag of jobs, i.e. time between activation and start of jobs (SQL query)

we monitor various DB parameters

(in preparation) monitoring for an ususually high amount of DB deadlocks with Automic (after recent events)

probably some more shell scripts that monitor various things

I monitor the automic community via lynx, alerting me of newly found Automic issues by looking for any new posts by FrankMuffke (just kidding, I don't :p )

Hth,Carsten

p.s. monitoring is like money, old camera lenses and Battlefield 3 experience points: Amass any amount you can think of, it's still never enough.

I m sure you know about this, but for checking if AE and Agents are alive you can use the SYS_HOST_ALIVE and SYS_SERVER_ALIVE features, but in my opinion the more useful way would not be to check IF the components are alive but a message if they are NOT running, e.g. using EXECUTE_ON_END feature in UC_HOSTCHAR_* variable.

regarding if the client is available and running, i have no idea, i have to insist

THX for the input - especially "I log (and incidentially analyze) the activation lag of jobs, i.e. time between activation and start of jobs (SQL query)"

Thats a very good input I may steal from you :-)how do you differ between AE slowliness and heavy scripting load ?e.g. a Workflow construction with many objects (and scripting effort) in it especially waits - and a job that really gets generated very slow ?

select AH_OH_IDNR as OHID, AH_IDNR as Runid, AH_TIMESTAMP1 as Activation, AH_STATUS as Status, AH_HOSTDST as Destination, AH_TIMESTAMP2 as Launch, to_char(AH_TIMESTAMP2, 'DAY') as Day, round((AH_TIMESTAMP2 - AH_TIMESTAMP1) * 24 * 60 * 60,0) as seconds_diff from AH,dual where AH_OH_IDNR in (select OH_IDNR from OH where OH_NAME like '%JOB%DC1%HEARTBEAT%' AND OH_NAME NOT LIKE '%OLD.%' and OH_NAME not like 'JOBP%') AND AH_STATUS = 1900;EOF

This runs from cron (use UC4? Who am I? :) ) once every week. Every once in a while, I make a really nice Excel spreadsheet from it (you need to de-duplicate the data first, because you will have overlap due to the weekly collection), and it also (I just left that in as a best practice) tells the Linux admins not to touch my sh*t :)

edit: If you look at the SQL, this only logs the times for a Job called "JOBS.DC1.HEARTBEAT". That's my monitoring job for actual agent operation that runs very often per day. Using just this often-running, but well known job as the basis for statistics also eliminates a lot of uncertainty.

To answer your question: Engine slowness is when all jobs on all agents are slow. But usually I see individual agents being slow, that's network issues or heavily loaded agents.

1. Silly old SNMP agent. Few of the TRAP codes has been ignored (Warm start of an agent?!). - TicketIts too bad that the title for "System error of the UC4 Server" is so generic.LINK2. On OS level the psmon is checking for the count of ucsrvwp , ucsrvcp, snmp1, (JWP is not included currently) - Ticket3. SYS_HOST_ALIVE on every 30 minutes against all OS agents - Email4. Simple SAP job (RSUSR000) against all SAP agents on every 30 min. MRT terminator at 3 min (im too generous) - Email5. HEARTBEAT - Unix job is printing timestamp in a file every 10 min. UXMON is checking for file age. Maximum age - 20min - Ticket6. Job failure monitoring - Post-process include. Parsing the jobname against static variable. Depending on the priority and the app team - ticket or email will be generated (OVO_MON.log and Send_Mail)