urn:lsid:ibm.com:blogs:entries-fa852cb9-77c8-440c-8f89-44492192136bAIXpert Blog - Tags - alerts AIXpert Blog is about the AIX operating system from IBM running on POWER based machines called Power Systems and software related to it like PowerVM for virtualisation, PowerVC for Deploying VM's and PowerSC for security plus performance monitoring and nmon03022017-08-17T15:07:25-04:00IBM Connections - Blogsurn:lsid:ibm.com:blogs:entry-5c3eac8f-7ffc-4908-8d19-fb2a224d33fbemail alerts on core dumps via the ODM & errpt systemnagger100000MRSJactivefalsenagger100000MRSJactivefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-07-09T12:40:07-04:002013-07-09T12:41:12-04:00<p dir="ltr">
Following on from my previous AIXpert blog on 1st July 2013 - &quot;Core files filling important filesystems? Want email alerts about each core dump?&quot; it was pointed out to me by AIX guru Mathew Accapadi that we can for many years use the ODM and AIX error reporting framework to send emails on the generation of core files.&nbsp; See the previous blog on how to force the core files to a particular directory - and not allow them to fill up important system directories.&nbsp; On the alternative reporting method see below:</p>
<p dir="ltr">
Check out the following man page for more information: <a href="http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.aix.genprogc%2Fdoc%2Fgenprogc%2Ferror_notice.htm" target="_blank">Error&nbsp;Notification</a>&nbsp;&nbsp;&nbsp; Note: you can also trap and alert of lots of other error reported into errpt using this method.</p>
<p dir="ltr">
1) Create a ODM statement in a file like below:</p>
<pre dir="ltr" style="margin-left: 40px;">
<span style="color:#006400;"># cat /tmp/odmtext
errnotify:
en_name = &quot;corenotify&quot;
en_label = &quot;CORE_DUMP&quot;
en_method = &quot;/usr/lbin/email_core_warning $1 $2 $3 $4 $5 $6 $7 $8 $9&quot;
#</span></pre>
<p dir="ltr">
2) Add this to the ODM and then out of interest list it to check it is OK:</p>
<pre dir="ltr" style="margin-left: 40px;">
# odmadd /tmp/odmtext
# odmget -q &quot;en_name = corenotify&quot; errnotify
errnotify:
en_pid = 0
en_name = &quot;corenotify&quot;
en_persistenceflg = 0
en_label = &quot;CORE_DUMP&quot;
en_crcid = 0
en_class = &quot;&quot;
en_type = &quot;&quot;
en_alertflg = &quot;&quot;
en_resource = &quot;&quot;
en_rtype = &quot;&quot;
en_rclass = &quot;&quot;
en_symptom = &quot;&quot;
en_err64 = &quot;&quot;
en_dup = &quot;&quot;
en_method = &quot;/usr/lbin/email_core_warning $1 $2 $3 $4 $5 $6 $7 $8 $9&quot;
#</pre>
<p dir="ltr">
3) This will run a script called <strong>/usr/lbin/email_core_warning</strong> when the event with a label of &quot;<strong>CORE_DUMP</strong>&quot; is generated and pass lots of information parameters to it. So here is the content of my script:</p>
<pre dir="ltr" style="margin-left: 40px;">
<span style="color:#006400;"># cat /usr/lbin/email_core_warning
# This part writes the information to a local log file and was good for testing
# You might leave this bit out
date &gt;&gt;/tmp/core_log
hostname &gt;&gt;/tmp/core_log
echo $* &gt;&gt;/tmp/core_log
echo &quot;----&quot; &gt;&gt;/tmp/core_log
# Sent the information in email
mailx -s &quot;Core dump on `hostname`&quot; nigelgriffiths@blue.ibm.com &lt;&lt;EOF
1 Seqno = $1
2 ErrorId = $2
3 Class = $3
4 Type = $4
5 Flags = $5
6 Resource = $6
7 rType = $7
8 rClass = $8
9 Label = $9
EOF
</span></pre>
<p dir="ltr">
4) Make this script executable: <strong>chmod u+x /usr/lbin/email_core_warning</strong></p>
<p dir="ltr">
5) Now I run my program to force a core dump and in the test log file I get:</p>
<pre dir="ltr" style="margin-left: 40px;">
<span style="color:#0000cd;"># cat /tmp/core_log
....
Tue Jul 9 16:42:36 BST 2013
gold6.uk.ibm.com
91 0xa924a5fc S PERM FALSE SYSPROC NONE NONE CORE_DUMP
----
</span></pre>
<p dir="ltr">
6) And AIX email output via mailx:</p>
<pre dir="ltr" style="margin-left: 40px;">
<span style="color:#0000cd;">Message 32:
From root Tue Jul 9 16:42:36 2013
Date: Tue, 9 Jul 2013 16:42:36 +0100
From: root
To: nigelgriffiths
Subject: Core dump on gold6.uk.ibm.com
1 Seqno = 91
2 ErrorId = 0xa924a5fc
3 Class = S
4 Type = PERM
5 Flags = FALSE
6 Resource = SYSPROC
7 rType = NONE
8 rClass = NONE
9 Label = CORE_DUMP
</span></pre>
<p dir="ltr">
No details on the core filename or a stack trace but it is near instantaneous.</p>
<p dir="ltr">
The <strong>errpt -a</strong> output has this information and a bit more like the /core file, the name of the crashed program and a bit of a stack trace - I think these are only in newer AIX versions but I could not tell you when they started to appear.</p>
<p dir="ltr">
- - - The End - - -</p>
Following on from my previous AIXpert blog on 1st July 2013 - &quot;Core files filling important filesystems? Want email alerts about each core dump?&quot; it was pointed out to me by AIX guru Mathew Accapadi that we can for many years use the ODM and AIX...019549urn:lsid:ibm.com:blogs:entries-fa852cb9-77c8-440c-8f89-44492192136bAIXpert Blog2017-08-17T15:07:25-04:00urn:lsid:ibm.com:blogs:entry-2ca81f19-2025-4a33-99bb-6ed9ce93a211Shared Storage Pools 2 - Thin provisioning, monitoring free space + Alertsnagger100000MRSJactivefalseComment Entriesapplication/atom+xml;type=entryLikestrue2012-02-21T11:46:56-05:002012-02-21T11:46:56-05:00<div>Hi, I just release a fifth hands-on movie this month and on this interesting topic. <br />You can find the movie here: <a href="https://www.ibm.com/developerworks/wikis/display/WikiPtype/Movies#Movies-sectionSSP2 ">Shared Storage Pools 2 - Thin provisioning, monitoring free space + Alerts</a></div><div /><div /><div> <hr style="width: 100%; height: 2px;" /></div><div> </div><div><span style="font-weight: bold; color: rgb(0, 0, 255);">Shared Storage Pool Thin provisioning is pretty cool and saves a lot of disk space</span>. Effectively the 100's of GBs of unused disks space in 100's of Virtual Machines (LPARs) are bought together in one place in the Shared Storage Pool and then can be used for real. All this without the use of clever disk subsystems. The <b>risk is that you use all the disk space of the pool</b> and the next VM that tries to write to a new chuck of disk (a chunk is 64MB), fails to get it <span style="font-weight: bold;">gets disk errors</span>. Thus monitoring the free space and getting alerts is very important. But the alerts may not work as you expected. The &quot;alert&quot; command sets the threshold simply enough but then it gets complicated:<br /><ul><li>First, we found a bug for which there is a fix now that I tested (see below for more information on the bug). </li><li>Second, we have the Virtual I/O Servers in a cluster and the one reporting the Alert could be any of them. There is a very sound reason for this but to the VIOS administrator it looks effectively chosen at random.<br /></li><li>Third, the message is in the VIOS error log, which you access via errlog which flies of the screen unless you use &quot;errlog | more&quot; - how very DOS like! errlog is very very similar to the AIX errpt commend. Then you have to use &quot;errlog -ls | more&quot; to find the details.<br /></li><li>Fourth, it is not clear that it is an alert for going over the threshold. The current threshold percent nor the current free space percent is in the message! </li></ul></div><div>Here is an example of the <span style="font-weight: bold;">Alert error log entry</span> (it is not pretty): </div><blockquote><div>$ errlog –ls<br />...<br />LABEL: VIO_ALERT_EVENT<br />IDENTIFIER: 0FD4CF1A<br /><br />Date/Time: Wed Feb 15 11:26:32 CST 2012<br />Sequence Number: 86<br />Machine Id: 00F602714C00<br />Node Id: diamondvios2<br />Class: O<br />Type: INFO<br />WPAR: Global<br />Resource Name: VIOD_POOL<br /><br />Description<br />Informational Message<br /><br />Probable Causes<br />Asynchronous Event Occurred<br /><br />Failure Causes<br />PROCESSOR<br /><br /> Recommended Actions<br /> Check Detail Data<br /><br />Detail Data<br />Alert Event Message<br />25b8001<br />A Storage Pool Threshold alert event occurred on pool D_E_F_A_U_L_T_061310 pool id 92d2fd5f2ec45382 in cluster galaxy cluster id 00841e2a422711e194cbf60271715fc2 The alert event received is: Threshold Exceeded.<br /><br />Diagnostic Analysis<br />Diagnostic Log sequence number: 250<br />Resource tested: sysplanar0<br />Menu Number: 25B8001<br />Description:<br />A Storage Pool Threshold alert event occurred on pool D_E_F_A_U_L_T_061310 pool id 92d2fd5f2ec45382 in cluster galaxy cluster id 00841e2a422711e194cbf60271715fc2 The alert event received is: Threshold Exceeded.<br />... <br /></div></blockquote><div>Note: There are other log entries that are very similar but have &quot;<b>Storage Pool Up Event</b>&quot; and don't say &quot;<b>Threshold Exceeded.</b>&quot; They are NOT free space low alerts.</div><div> </div><div>I keep thinking of those jokes about a black cat, at night and dark coal bunker! I am sure the developers have not tried to hid the alert messages but that is what we have got - well hidden high impart warning messages hidden in the cluster.<br /></div><div> </div><div><span style="color: rgb(0, 0, 255); font-weight: bold;">Making the low free space issue more visible</span></div><div> </div><div>So even if you know the free space has just gone below the limit it is a lot of work finding the Alert - not what I would call pro-actively letting you know there is a problem that needs fixing urgently. Various ideas are covered in the movie like some people escalate the VIOS system logs to other tools and could find the error condition there. <br /><br />We could have cron based scripts regularly sending Shared Storage Pool free space stat email messages or only when the free space is low. <br /><br />One alternative was using Systems Director 6.3 (ISD) - which I tried. I Discovered all four VIOS, gained access and ran Inventory - this only took a few minutes. Then I set the free space threshold percentage just below the current use and started the client VMs writing to new large files. Before I could change to the Systems Director browser to check it had already detected the alert event and reported it on the Problem panel. See below:</div><div> </div><div><a href="https://www.ibm.com/developerworks/mydeveloperworks/blogs/aixpert/resource/BLOGS_UPLOADED_IMAGES/ISD_SSP2_alert_480.jpg" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/mydeveloperworks/blogs/aixpert/resource/BLOGS_UPLOADED_IMAGES/ISD_SSP2_alert_480.jpg" style="display: block; margin: 0pt auto; text-align: center; position: relative;" /></a> <br /></div><div>Alternatively we could use &quot;cron&quot; to regularly make checks and escalate email messages. </div><div> </div><div>The movie also covers a script or two to reformat the &quot;lssp&quot; and &quot;alert&quot; command output to calculate percentages and amount of over-commit.<br /><ul><li>Get the scripts here: <a href="https://www.ibm.com/developerworks/wikis/display/WikiPtype/Shared+Storage+Pools+free+space+monitoring+scripts">lspool</a></li></ul></div><div /><div> </div><div style="font-weight: bold; color: rgb(255, 0, 0);">The Alert BUG</div><ul><li>The developers found the bug which caused the alerts not to be captured, pretty quickly and the worked an efix (emergency fix) for us.</li><li>We then installed this on all VIOS nodes and checked the fix. </li><li>It worked first time and every time.</li><li>This will get released at some point - customers need to raise a PMR to get access to this fix.</li><li>One important point to remember is to install VIOS fixes with the updateios command - NOT the AIX emgr command (like I did, oops!).</li><li>I don't know the official number or release mechanism but the file I used included a number: 823942</li><li>If you or IBM Support have trouble identifying the fix ask them to contact me: Nigel Griffiths IBM UK.</li><li>When I get more information, I will add it here.<br /></li></ul><div> <br /></div><div>The next movie may be for SSP2 snapshots and perhaps Live Partition Mobility. I would like to encourage lots of people to try out Shared Storage Pools - I known the technology is not developed to help out poor hard working Power Systems techies like myself but it is so simple to use, quick compared to mucking around with LUNs and Zones and very flexible too as it opens up LPM to every LPAR I will create in 2012. Saves me time, saves disk space, makes for flexible VM environment - it is a win:win:win.<br /></div><div> </div><div><br /></div><div><br /></div>
Hi, I just release a fifth hands-on movie this month and on this interesting topic. You can find the movie here: Shared Storage Pools 2 - Thin provisioning, monitoring free space + Alerts Shared Storage Pool Thin provisioning is pretty cool and saves a...0213981urn:lsid:ibm.com:blogs:entries-fa852cb9-77c8-440c-8f89-44492192136bAIXpert Blog2017-08-17T15:07:25-04:00