Indoubt transactions can identified in UNCOM parameter of the 'dis qstatus'. The details of the indoubt transactions can be identified from 'dspmqtrn -m QM' command. So once we identify in-doubt transactions, how do we handle it. The answer is rsvmqtrn command but what action should be taken, either we commit or backout the transaction. Commit action will complete whatever the action in which transaction was involved and backout will just undo the action in which the transaction is involved. To understand what happens with the message with either of these actions first we need to identify in what operation(put or get) the transaction was involved in. For this we need to capture transaction dumps using amqldmpa command.

The dumps can be verified to check for the indoubt transaction based on transaction id or transaction number. The Waiter section of a transaction in above kern output can be used to identify the operation involved either put or get along with options used. If the connection no longer exists then the transaction will not be available in amqldmpa's kern.out in which case we need to look in atm.out file for the transaction. The long running in-doubt transaction in atm.out file will have SLE structure which gives the operation the transaction was involved. Once we have identified the operation in which the in-doubt transaction was involved in, the user can take appropriate action

- If the indoubt transaction is involved in put operation:

A commit on the transaction will put message back on the queue whereas backout of the transaction will result in loss of the message.

- If the indoubt transaction is involved in get operation then

A commit on the transaction will result in loss of the message whereas backout on the transaction will put message back on the queue.

IBM MQ has always been able to generate event messages when something "interesting" has occurred in the queue manager. These events could be showing that a queue is full, or that there has been an authorisation failure; someone neededing an audit trail might want to capture the command and configuration events. These events are written as MQ messages to well-known queues, using PCF structures which can be decoded to give the full description of the event.

The amqsevt sample program

IBM MQ V8.0.0.4 included a new sample program, amqsevt, designed to format these event messages. That fixpack contained both the executable program and source code for it. You can see more about the formatter in this video. Here is an event as printed by the program:

**** Message #7 (236 Bytes) on Queue SYSTEM.ADMIN.PERFM.EVENT ****

Event Type : Perfm Event [45]

Reason : Queue Full [2053]

Event created : 2016/10/21 07:11:28.58 GMT

Queue Mgr Name : V9000_A

Base Object Name : FULLEVT

Time Since Reset : 0

High Queue Depth : 4

Msg Enq Count : 0

Msg Deq Count : 0

Displaying events in this more readable, English-like, style can be very useful for administrators who want to see what is going on, without using a more formal monitoring product. However, there are times when a format more sutitable for programmatic processing are needed.

A modified version for JSON

The common MQ management and monitoring tools such as Omegamon are all able to decode the PCF messages, and take appropriate actions or generate alerts from these events. But many customers now want to integrate these events with other tools, that perhaps do not have an MQ-specific component or heritage. And so I've produced a modified version of the amqsevt program that can print the events in JSON, which is a simple format, but one which is capable of being parsed and searched with a variety of tools.

This example shows the same event as above, but formatted as JSON:

{

"eventSource" : {

"objectName" : "SYSTEM.ADMIN.PERFM.EVENT",

"objectType" : "Queue"

},

"eventType" : {

"name" : "Perfm Event",

"value" : 45

},

"eventReason" : {

"name" : "Queue Full",

"value" : 2053

},

"eventCreation" : "2016/10/21 07:11:28.58 GMT",

"eventData" : {

"queueMgrName" : "V9000_A",

"baseObjectName" : "FULLEVT",

"timeSinceReset" : 0,

"highQueueDepth" : 4,

"msgEnqCount" : 0,

"msgDeqCount" : 0

}

}

Using the program

The modified amqsevt has an extra command line option, "-j" to indicate that the output should be in JSON. When selected, some of the status (eg program starting or ending) messages from the program are disabled, so that the only information sent to stdout are the JSON events themselves. When needed, error messages are now sent to stderr. That means that the standard output from the program can be sent directly to any JSON-aware program without needing further filtering. There is a blank line between each event if you want to split output at a convenient point.

This example uses the jq command, which is designed to filter JSON, to find which queues have hit their full depth. It reads all the events from the queue manager and waits 2 seconds for any further input (the -w parameter):

Note that the output includes the quotes around the queue name; any further processing may want to remove that. The "'-r" option to jq will do that for you.

In this next example, I redirected the output to a file, and then let the Splunk monitoring product read it. Then I applied a filter to look for all events from the PERFM event queue:

Where to get the program

Source code for the program can be downloaded from this gist. There is a "download zip" button on the page to make it easy. The program can be compiled like any of the other sample programs. You just have to ensure you link with the threaded versions of the MQI libraries (libmqm_r on platforms where there are separate threaded and non-threaded versions).

Feedback

I'd be very interested in how useful this program is, and any further examples of using it. Please leave comments here or on the github gist page.

This article explains the interaction between channel authentication records (CHLAUTH) and connection authentication (CONNAUTH) in IBM MQ. When an application connects to MQ using client bindings, network connections have to be opened up which can have security implications. In the last few MQ releases these have been addressed with the introduction of the CHLAUTH and CONNAUTH features. This is the first in a series of articles that will look at these features. This article will explain how they fit together when a receiving end of a channel is started.

Different types of bindings

IBM MQ supports two ways that an application can connect:

Local bindings: This is when the application and queue manager are on the same operating image. CHLAUTH is not relevant to this type of application connection.

Client bindings: This is when the application and queue manager use the network to communicate. The application and queue manager may be running on the same machine or they may be on different machines. In MQ a client connection is handled in the form of a server-connection (SVRCONN) channel. Both CONNAUTH and CHLAUTH are applicable, and it is this type of connection which is discussed.

The binding steps of a receiving end of a channel

When an application connects to a queue manager there is a lot of checking to perform to ensure that both ends of the channel understand what is supported by the other end. The receiving end does some extra checking to ensure that the client is allowed to connect. This checking involves CHLAUTH and CONNAUTH. This process may also include a security exit as this can affect the result. This channel connecting phase is also referred to as the binding phase.

In MQ version 7 SHARECNV was added to SVRCONN channels so multiple connections/conversations can share the same channel. This article does not cover what happens in terms of CHLAUTH and CONNAUTH when a second and following conversations share the same channel.

Figure 1 shows the steps that a SVRCONN channel goes through when the server end (at the queue manager) starts.

Figure 1: the steps that a SVRCONN channel goes through when starting up.

Step 1: Receive a connection request The channel initiator or listener receives a connection request from somewhere on the network.

Step 2: Is address allowed to connect? Before any data is read from the wire, MQ will check the partner's IP address against the CHLAUTH rules to see if the address is in the BLOCKADDR rule. If the address is not found and so not blocked the flow proceeds to the next step.

Step 3: Read data from the channel MQ can now read the data from the wire into a buffer and start to process the sent information.

Step 4: Look up channel definition In the first data flow MQ sends amongst other things the name of the channel which the sending end is trying to start. The receiving queue manager can then look up the channel definition, which has all the settings that are specified for the channel.

Step 5: Pre CONNAUTH checks when using CHLAUTH and CONNAUTH there is a situation where the user id of the channel rather than the asserted user id too be used in the CHLAUTH mapping. To give an example if CONNAUTH is using LDAP to map an email address that is provided by the application to a serial number you would properly want the result e.g. serial number to be used in the CHLAUTH mapping. By default this doesn't happen. An APAR has been done to add an ini parameter called ChlauthEarlyAdopt which performs an extra check before the CHLAUTH mapping. This APAR went into MQ 8.0.0.5, IT12825: IBM MQV8: A CLIENT APPLICATION FAILS TO CONNECT TO A QUEUE MANAGER WITH ERROR AMQ9777: CHANNEL WAS BLOCKED details how to turn the option on. This was slightly altered at version 9.0 where the option is just Y which covers all the options described in the APAR.

Step 6: CHLAUTH mapping The CHLAUTH cache is inspected again to look for mapping rules (SSLPEERMAP, USERMAP, QMGRMAP and ADDRESSMAP). The rule that matches the incoming channel most specifically will be used. If the rule has USERSRC(CHANNEL) or USERSRC(MAP) the channel continues on binding. Rules that have USERSRC(NOACCESS) means that the channel would be blocked from connecting and the network connection is ended.

Step 7: Call security exit If the channel has a security exit (SCYEXIT) defined, this is called with the exit reason (MQCXP.ExitReason) set to MQXR_SEC_PARMS. If the client application has specified security credentials on a MQCONNX call (via MQCSP on the MQCNO) these will be passed in the exit parameters (a pointer to MQCSP will be present in the SecurityParms field of the MQCXP). The MQCSP structure has pointers to the user ID (MQCSP.CSPUserIdPtr) and password (MQCSP.CSPPasswordPtr). It is possible for the security exit to change these. Example 1 shows how a security exit would access the userId and password fields.

The exit could tell MQ to close the channel by returning MQXCC_CLOSE_CHANNEL or MQXCC_FAILED in MQCXP.ExitResponse field. Otherwise, channel processing continues to the connection authentication phase.

Step 8: Is user authenticated? The authentication phase happens if CONNAUTH is enabled on the queue manager. To check this issue the MQSC command 'DISPLAY QMGR CONNAUTH'. Figures 2 and 3 show the output of this command on MQ for z/OS and distributed MQ.

The CONNAUTH value is a name of an AUTHINFO MQ object. MQ version 8 supports two methods of authentication: using the operating system (AUTHTYPE(IDPWOS)) or using LDAP (not available on z/OS) (AUTHTYPE(IDPWLDAP)). This article concentrates on operating system authentication. In a future article. I will cover using LDAP for authentication. Figures 4 and 5 shows the shipped default object for AUTHINFO type(IDPWOS) in MQ for z/OS and distributed MQ.

The AUTHINFO TYPE(IDPWOS) has an attribute called CHCKCLNT. If the value is changed to REQUIRED all client applications would have to supply a valid user ID and password. While we are looking at the attributes involved with CONNAUTH, I must mention the adopt context (ADOPTCTX) option. The ADOPTCTX attribute controls whether the channel runs under MCAUSER or the user ID the application has supplied If ADOPTCTX is YES then the channel will adopt that user to run under (the active MCAUSER) and the object authorization will be done against this user. This will also be used in step 8 in the CONNAUTH check.

For example, say I don't have a MCAUSER set on the SVRCONN channel and my client is running under 'markw1' on my linux machine. My application specifies user 'fred' in the MQCSP the channel will start running with 'markw1' as the active MCAUSER. After the CONNAUTH check the user 'fred' will be adopted and the channel run with 'fred' as the active MCAUSER.

Step 9: Is user allowed on this channel? If the CONNAUTH checking is successful the CHLAUTH cache is then inspected again to check if the active MCAUSER is blocked by a BLOCKUSER rule. If the user is blocked the channel will be terminated.

Steps 10 and 11: Object authorisation checks The client application has now connected. As with locally bound applications any objects that the application opens e.g. a queue, requires a check will be made to ensure that the active MCAUSER has the appropriate authority for that object.

Conclusion In this article I have described what stages a channel goes through when connecting in terms of security checking. In the next article I will demonstrate how an application supplies user credentials.

Resources

The following links are provided to give more information the topics covered:

APARs of the month

Customers using the MQ V7.x classes for Java and classes for JMS, that have applications which use security exits, should be aware of this APAR as it reports an issue which can prevent the applications being migrated to use the MQ V8.0 or MQ V9.0 classes.

APARs of the month

This APAR fixes an issue where managed transfers submitted to a protocol bridge agent can seemingly disappear. It has already affected a few customers, so if you are using protocol bridge agents this may well affect you too.

IBM recently announced support for XL C/C++ 13.1 on AIX. XL compiler is used to compile the IBM MQ samples like amqsput, amqsget and amqsbcg to put , get , browse samples using both client and server libraries. This document is helps for the user who wants to build their own 32 bit or 64bit applications using IBM MQ to put and get messages.

This document includes the information on XLC compiler, support AIX OS, how to find out the PATH and version of XL compiler installed and setting up the PATH of XL, installation of MQ, PATH of the C/CPP MQ samples, modes of compilation and commands to compile the samples using threaded and unthreaded modes.

Identify the XL compiler installed on AIX machine. Below mentioned are the XL compiler version supported by the AIX

Modes of Compiling MQ Samples: There are Threaded and Unthreaded mode compiling the MQ Samples.

Threaded Mode A thread is considered to be connected to IBM® MQ from MQCONN (or MQCONNX) until MQDISC. UNIX and Linux systems safely allow the setting up of a signal handler for such signals for the whole process. However, IBM MQ sets up its own handler for the following signals, in the application process, while any thread is connected to IBM MQ:

If you are compiling for 32 bit applications use 32 bit libraries to compile the samples whereas for 64 bit applications use 64 bit libraries to compile the samples. Below mentioned are the modes used to compile the samples on server and client.

One more important thing while compiling you should not use threaded and a non-threaded library at the same time, if used then it might cause the compilation failure.

When trying to connect to the queue manager in the client mode using the compiled c/cpp samples, User needs to modify some security changes in Queue manager objects. Please refer the below given link to do the changes

After changing the properties, user can able to do the put and get using the compiled c/cpp samples. Go to the directory where the output of the compiled c/cpp samples are present. Using those samples user can easily put and get the messages

#./amqsput_32 Q QM

Sample AMQSPUT0 Start

target queue is Q

Test message

#./amqsget_32 Q QM

Sample AMQSGET0 Start

message <Test message>

#Likewise you can put and get messages using other compiled samples.

Below is the attached snapshot for the put and get messages using compiled samples

Please check for System generated errors /var/mqm/errors or in Queue manager error logs for any errors or FDC's.

Generally errors like mqrc 2035 error (This error will generally appear when client tries to connect to the queue manager running on the machine) when user tries to put or get the messages to the queue. To overcome this problem, need to modify the Queue manager objects as mentioned above in step 7.

In system / queue manager error logs usually gets the error something like :

EXPLANATION:
The channel program running under process ID 28901550 for channel 'C' ended
abnormally. The host name is '9.20.29.82'; in some cases the host name cannot
be determined and so is shown as '????'.
ACTION:
Look at previous error messages for the channel program in the error logs to
determine the cause of the failure. Note that this message can be excluded
completely or suppressed by tuning the "ExcludeMessage" or "SuppressMessage"
attributes under the "QMErrorLog" stanza in qm.ini. Further information can be
found in the System Administration Guide.
----- amqrmrsa.c : 930 --------------------------------------------------------

The AMQERR01.LOG shows that an application ended without a clean disconnect. This kind of error might occur when compilation of sample is in process and user terminated the samples before it had exited, which would be sufficient to cause this error in the log.

If we expect the application to exit cleanly, it should be rerun and given time to do so, in order to check that it does not hang indefinitely.

Note: While compilation of samples is in process, don’t terminate until unless it exit after completing the compilation.

I have pushed to GitHub part 2 of my series of samples showing how to use DRBD with IBM MQ. Part 2 adds a symmetric three-way Pacemaker cluster to provide automatic failover if there is a problem with a queue manager.

A couple of colleagues and I recently received a query that asked for a comparison of the pros and cons of using distributed (XA) transactions with an HA queue manager on a distributed platform, such as the IBM MQ Appliance, versus using a queue-sharing group (QSG) on z/OS with Group Units of Recovery (GROUPUR). Both options are valid. As this was an interesting question I thought a blog post on this subject might be useful.

What is a distributed (XA) transaction?

Transactions are units of work that either complete in their entirety or not at all; they are the basis of many business applications. A distributed transaction is one that updates multiple resources, so coordination is required to ensure it completes successfully or it is universally backed out. A typical example of a distributed transaction in messaging is whereby a message is sent that represents an update that must be made to a database, such as a banking transaction or the placement of a sales order. If the database cannot be updated then it is important that the message is restored to the queue so the update request is not lost and it can be processed again later. To accomplish this a 2-phase commit protocol is typically used whereby the transaction coordinator requests all parties (known as resource managers) first prepare to commit (phase 1). If all parties give a ‘green light’ then the transaction is committed, else it is backed out (phase 2). The XA standard, which is commonly used, is a specification by the Open Group (http://www.opengroup.org) that defines a protocol to accomplish this. A transaction that has completed phase 1 but not phase 2 is considered to be in-doubt. The transaction is in-doubt because each resource manager cannot unilaterally decide to commit or back-out without compromising the integrity of the unit of work. The collective response of all parties during phase 1 is required to determine how to resolve the transaction consistently. The transaction coordinator collates these responses then notifies each resource manager of the required resolution.

If a transaction coordinator is disconnected from a resource manager during commit processing then the coordinator is not always able to predict the state of each resource. To allow the transaction to be resolved correctly the coordinator must first query the resource manager upon reconnecting to determine the current state. In the XA specification the xa_recover request is used for this purpose. If the coordinator unknowingly connects to a different, independent, resource manager it will receive a response stating the transaction is not known. This compromises the integrity of the transaction because the coordinator must infer the resource manager has either committed or backed out its part of the unit of work, when in fact it might still be in-doubt or even resolved in the opposing way.

What are HA queue managers?

Queue managers on distributed platforms (Windows, Linux and UNIX) can operate in a high availability (HA) configuration. An HA configuration allows a queue manager to be kept available for applications during a planned or unplanned outage of a single system, by starting it on a secondary system instead. The IBM MQ Appliance includes support for HA queue managers ‘out of the box’ (see http://www.ibm.com/support/knowledgecenter/en/SS5K6E_1.0.0/com.ibm.mqa.doc/overview/ov00020_.htm). Two appliances can be connected in a high availability group and data is synchronously replicated between them. Queue managers are automatically started on the secondary appliance when the primary appliance is quiesced or is otherwise offline. On platforms other than the appliance a multi-instance queue manager can be used instead, whereby the data is not replicated but it is stored on a network file server. Alternatively, an HA cluster can be established using PowerHA for AIX (formerly HACMP) or the Microsoft Cluster Service (MSCS). An F5 router, or similar, is often required to route applications to the system where the queue manager is currently active, or applications must use a list of IP addresses and attempt to connect to each system in turn.

What are queue-sharing groups and GROUPUR?

Queue managers on z/OS can be configured in a queue-sharing group (QSG). In a QSG queues and channels can be shared by all queue managers in the group using the z/OS Coupling Facility. This capability allows the QSG to present itself as a single highly-available queue manager. The SysPlex Distributor on z/OS can be used to route applications (or queue manager channels), which connect using a single IP address, to any one of the available queue managers in the group. This means they can always access the shared queues and the messages on them, provided that at least one queue manager is active. Queue managers can therefore be quiesced in turn for either general maintenance operations or upgrades without impacting business applications.

As of MQ version 7.0.1 queue-sharing groups support XA transactions, such as those used by WebSphere Application Server (WAS), that are logically owned by the group instead of a single queue manager. This capability is known as Group Units of Recovery (GROUPUR). If connectivity to a queue manager in the QSG is lost when a transaction is in-doubt then the transaction coordinator can reconnect to any member of the group to resolve it. A transaction coordinator that issues xa_recover is returned a list of all in-doubt XA transactions throughout the QSG that have a group unit of recovery disposition. Similarly, a xa_commit or xa_rollback request can be made for any of the returned transactions. For more information see http://www.ibm.com/support/knowledgecenter/en/SSFKSJ_9.0.0/com.ibm.mq.pro.doc/q004240_.htm).

Comparing HA queue managers with GROUPUR

XA transactions can be safely used with either an HA queue manager or a queue-sharing group with GROUPUR enabled. With HA the queue manager moves from one system to another so applications still connect to the same resource. With GROUPUR applications might be routed to a different queue manager each time they connect, but the queue managers cooperate so they appear as a single resource instead of separate entities.

The main differences between HA queue managers and GROUPUR with respect to support for XA transactions relate to scale and the degree of availability.

HA queue managers have a smaller footprint than a queue-sharing group and they are not limited to a single platform (QSGs are only available on z/OS). However, the scalability of an HA queue manager is limited by the capacity of the system it is running on. There is only one queue manager instance so all applications are routed to the same resource. When using a QSG application connections can be spread across the available set of queue managers. The queue managers in a QSG can run on different LPARs and even different physical hardware, which provides the potential for greater scale and improved performance.

The availability of a QSG is also likely to be higher than an HA queue manager. This is because while an HA queue manager is failing over from one system to another applications are unable to connect until it has restarted. When using a queue-sharing group there is no fail-over of an individual queue manager required, the application only needs to reconnect to be routed to another queue manager in the group. Availability is also greater with a QSG because it is possible for two or more queue managers to always be active to serve applications. With an HA queue manager if the secondary system is unavailable for maintenance there is nowhere else for the queue manager to fail-over to should an error occur.

Summary

This blog post has introduced HA queue managers and Group Units of Recovery (GROUPUR) with a queue-sharing group (QSG) and compared them with respect to support for distributed (XA) transactions. Whether an HA queue manager or a queue-sharing group is the preferred solution is likely to depend on the degree of availability and scale that is required by the connecting applications. QSGs provide for greater scale and greater availability during maintenance or in the event of a failure. However, an HA queue manager is likely to be more than sufficient for many use cases.

What are specialty processors?

From the IBM System z9 server, IBM introduced 2 types of specialty processors, zIIP and zAAP.

The IBM z Systems Integrated Information Processor (zIIP) is designed to help free-up general computing capacity and lower the overall total cost of computing for select data and transaction processing workloads for business intelligence (BI), ERP and CRM, and select network encryption workloads on the mainframe.

The IBM z Systems Application Assist Processors (zAAP) are designed to provide an environment for web-based applications and SOA-based technologies such as Java™.

From the IBM z13, zIIP specialty engines will also run workloads that are eligible to run on zAAP specialty engines.

zIIPs and zAAPs are designed to work asynchronously with the general processors to execute Java programming under control of the IBM Java Virtual Machine (JVM), and this can help reduce capacity requirements on general purpose processors, which in turn may be available for reallocation to other mainframe workloads.

The amount of general purpose processor savings will vary based on the amount of Java application code executed by zAAPs (or zIIPs).

How does MQ benefit from zIIP (or zAAP)?

Typically the queue manager, channel initiator or AMS address spaces will not directly benefit from zIIP or zAAP availability - instead it is the applications connecting to MQ that can benefit from being offloaded to the specialty processors.

Of course, this offloading does potentially reduce the load on the general purpose processors which can have a positive effect on those address spaces that are unable to use specialty processors.

Within MQ there are some areas which can exploit specialty processors:

Managed File Transfer. This Java-based product is able to offload some of its work to zIIP. For more information, see performance report FP11.

Using IBM MQ JMS classes for JMS in a CICS OSGi JVM Server and using IBM MQ classes for JMS in IMS, which were implemented in MQ version 9.0 and is discussed in performance report MP1K.

Java / JMS applications running in a Unix System Services environment and connecting to MQ.

Anything else to consider?

Not all of the Java/JMS workload will always be offloaded - some of the work may not be eligible, or there may not be a suitable specialty processor available.

For example, if the zIIP(s) are busy, your zIIP-eligible workload may be run on a general purpose processor, although there are 3 IEAOPTxx parameters which may be of interest

IFAHONORPRIORITY=NO which can set the zAAP-eligible work to execute on standard CPs but at a lower priority than non-Java work.

IIPHONORPRIORITY=NO which means that standard processors will not execute zIIP-eligible work unless it is necessary to resolve contention for resources with non-zIIP processor eligible work.

PROJECTCPU=YES can be used to give some guidance as to how much work could be offloaded from regular CPs to specialty processors.

Our apologies that it's a bit late coming (given it is now heading towards the end of September!).

As usual, however, in each section you will find a table showing a list of all of the APARs that were closed in July, as well as additional information about some of the key ones the MQ Level 3 Support Teams would like to highlight.

APARs of the month

If you're using the MQ Appliance, this is an APAR that will be of interest to you. It addressing an issue of intermittent reboots of a correctly running Appliance and will cause queue managers configured in a HA pair to failover.

APARs of the month

This is an APAR that quite a few customers have been hitting in their WebSphere Application Server (WSAS) environments recently, particularly with WSAS V8.5.5.9 and V8.5.5.10. It causes message delivery to a message-driven bean (MDB) application from an Activation Specification to stop, due to a deadlock when attempting to clean -up idle resources in the ServerSessionPool. (ServerSessions are used to deliver messages to MDB instances). After a thorough code analysis, we haven't been able to identify any recent changes that might have caused this deadlock. However timings between the async-consume thread of an Activation Specification and the "poolScavenger" thread that closes idle ServerSessions seem to have changed that is causing this problem to be quite a common occurrence in WSAS V8.5.5.9 and later systems.

If you're running recent levels of WSAS V8.5 and observe that messages delivery from Activation Specifications to MDBs suddenly stop until the WSAS JVM is restarted, this is likely to be the issue. Even if you're not seeing the issue currently, MQ JMS L3 would advise preemptively applying an interim fix for this problem.

APARs of the month

This APAR forward ports the previous WebSphere MQ V7.5 Managed File Transfer APAR IC90498 to MFT V8 and V9. It corrects an issue when using Apache ANT scripts whereby the fteAnt command was not using the
connectionQMgrChannel and connectionQMgrPort properties from the command.properties when this information was not specified in the ANT script itself. Therefore, if you use Apache ANT scripts with MFT and rely on the connection information for the command queue manager to be read from the command.properties file, this APAR will be of interest to you.

Introduction - separating applications

A common reason for deploying the MQ appliance is as part of a consolidation of queue managers currently running across multiple hosts into one place (sometimes referred to as a ‘messaging hub’.) The MQ appliance is an attractive target for this for a number of reasons – ease of maintenance, a solid hardware/performance foundation, and the built in High Availability capabilities, for example.

However, when hosting a large number of applications on the same appliance, various new considerations around ‘multi-tenancy’ come into play. One of these considerations is network separation, which I’ll address in this article.

You can argue that the basic unit of ‘tenancy’ in MQ has always been the queue manager – it makes sense to co-locate applications which work with the same set of queues/topics, are accessed by the same pool of users, etc. on the same queue manager. This minimises both runtime issues (e.g. needless transmit queue/channel hops) and administration headaches (e.g. moving a queue manager from one host to another in future, or managing connections to LDAP).

Using multiple network interfaces

With that level of separation in place, a common requirement is next to control which networks these queue managers interact with. For example, your organisation may have test/acceptance/production networks which traffic must not ‘hop’ between. Or you may have queue managers dedicated to interactions with particular lines of business, by way of specific subnets.

The appliance provides a lot of flexibility in how you achieve this. In the simplest form you can actually rely solely on two traditional MQ controls – the ‘IPADDR’ attribute on a listener definition, and the ‘LOCALADDR’ attribute on a sender channel. The appliance is well provisioned with network interfaces, and you could simply physically connect a 1GB Ethernet cable for each network which you need to access, and then bind a queue manager to that IP address. If you are using MQ clustering, the ‘MQ_LCLADDR’ environment variable can be configured for each queue manager in place of setting LOCALADDR (for your auto defined sender channels).

When configuring multiple interfaces in this manner, it is best practice to only assign a default gateway for oneinterface (to provide a default route from the system via your preferred subnet – perhaps on your management network or used by a particular set of queue managers). For other networks, you will need to assign static routes using the ‘ip-route’ command. See this DataPower technote for more discussion of this topic.

Aggregated links and VLANs

However, this obviously only works if you have one physical interface available for each network, and is limiting in terms of dependency on that single connection. Two concepts can help us improve on this model: Link Aggregation Definitions and VLAN Interface definitions.

A Link Aggregation (sometimes known as a ‘bonded’ interface) combines multiple physical interfaces into a single ‘virtual’ interface with one IP address. The physical interfaces must NOT be assigned individual addresses (you must in fact mark the physical interfaces as ‘available for link aggregation’ which suppresses the option to define an individual IP). There are various mechanisms for link aggregation (see KC), but generally LACP is used to allow the switch to spread packets to and from this aggregated interface across all physical connections – so for example 6 1GB connections can be treated as a single 6GB connection.

So, now we have one highly available, high performance link. However, how can we treat this as a connection to multiple, ‘physically separate’ networks, and connect our queue managers only to the appropriate subnets?

The answer is that the appliance also supports native VLAN tagging (sometimes called ‘trunked’ or ‘trunk mode’). A VLAN interface is configured on top of either a single physical interface or an aggregation, and packets sent through the VLAN interface are ‘tagged’ such that the switch knows which physical network they must be routed to – and vice versa for incoming data. The VLAN interface has its own IP address, which can again be used in the IPADDR/LOCALADDR fields to bind a queue manager to a particular physical network.

One further thing to note is that rather than explicitly using IP addresses in your MQ definitions, it is good practice to define ‘host aliases’. This allows you to modify the IP address/interface without changing your definitions (other than the alias), and is crucial in HA deployments where the same MQ definitions will be used on two systems with different interface addresses.

Big Picture

Bringing it all together, at a simplistic level the final picture might look something like this:

There are many options in the configuration of this, and of course many more combinations of these features are possible, but hopefully this has given a flavour of how you can achieve clean separation between networks, applications and queue managers on the MQ appliance.

I have recently received a couple of queries from customers regarding the use of multiple cluster transmission queues with IBM MQ. One customer wanted to know how best to update an existing cluster to use them. The other customer had a cluster-sender channel that wouldn’t start because they’d accidentally deleted its transmission queue and wanted to know how best to recover. It is actually quite easy to resolve that problem and fortunately it was only on a test system so the customer wasn’t unduly impacted. Given these queries I thought a blog post on this subject might be useful. This post describes the multiple cluster transmission queue feature, why you might wish to configure multiple transmission queues and how to do so. I also explain how to resolve some problems you might encounter. There is quite a lot of information in this post so I’ve divided it in to sections so you can jump straight to content that interests you.

Introducing multiple cluster transmission queues...

A transmission queue is used by MQ to store messages until they can be transmitted over the network to their destination. Regular sender channels have a dedicated transmission queue and they send all messages put to their transmission queue to their remote receiver. Cluster-sender channels are more cooperative. The default behaviour is for all cluster-sender channels to share a single transmission queue called SYSTEM.CLUSTER.TRANSMIT.QUEUE. The correlation ID of messages put to this queue identifies the cluster-sender channel over which they should be sent. Cluster-sender channels use MQGET by correlation ID to remove only those messages they should send. I mention this because many users don’t realise this difference compared to other transmission queues. It is actually extremely important that cluster channels work this way. If they didn’t then a single message at the head of the queue would block all cluster communication if its destination queue manager was unavailable.

Although a single transmission queue for all cluster communication is simple for administrators to understand and is sufficient for many users it does have some drawbacks. Therefore, support for multiple cluster transmission queues was introduced on Windows, Linux and UNIX in version 7.5. IBM i and z/OS did not have a version 7.5 offering so the same feature was introduced in version 8 on these platforms. Many people assume that the introduction of this capability was to improve performance. Some users may notice an improvement if they are constrained by queue contention and/or message buffering but the use of multiple transmission queues predominantly provides the following benefits:

Separation of message traffic
When a single transmission queue is used it is possible for messages destined for one channel to interfere with those for another. For example, if messages cannot be sent over one or more channels then a shared transmission queue can eventually become full.

Management of messages
Administrators often like to use queue attributes such as MAXDEPTH to manage available resources. When all cluster channels share a single transmission queue these attributes become less useful, especially when a queue manager is a member of multiple clusters and the transmission queue is used to service multiple applications.

Monitoring
When a single transmission queue is used it is not possible to use queue monitoring to track the number of messages processed by each channel, although channel statistics provide some of the same information. Administrators must also perform investigative work to identify why the depth of a single transmission queue is growing when it is used by multiple applications and channels. If message traffic is separated it is much easier for administrators to determine the cause and what is affected.

Configuring multiple cluster transmission queues

The transmission queue a regular sender channel uses is configured using the XMITQ channel attribute. A similar attribute cannot be used for cluster communication because most channels are automatically defined based on the cluster-receiver definition of each remote endpoint. It would be undesirable and difficult to manage if remote definitions affected the transmission queue local channels use. It might also cause problems for back-level queue managers coexisting in the same cluster. Therefore, an alternative means to configure the transmission queue each cluster-sender channel should use has been implemented.

A new queue manager attribute called DEFCLXQ has been introduced, which stands for ‘default cluster transmission queue’. This attribute has two permissible values, SCTQ and CHANNEL. The value SCTQ, which is the default for backwards compatibility, indicates that by default cluster-sender channels use SYSTEM.CLUSTER.TRANSMIT.QUEUE. The value CHANNEL indicates that by default each cluster-sender channel uses a dynamically created transmission queue called SYSTEM.CLUSTER.TRANSMIT.<channel-name>. Using the value CHANNEL provides administrators with a simple option to use a separate queue for each channel. The queue manager automatically creates and deletes transmission queues as necessary to serve cluster channels.

For many users the use of DEFCLQX alone will be sufficient. However, it is recognised that in large clusters a separate transmission queue for every channel might be too granular. Administrators might also prefer to use a different naming convention for the transmission queues. On z/OS, administrators might wish to control which storage class (page set) and buffer pool is associated with each queue. Therefore, a new queue attribute called CLCHNAME has also been introduced. Instead of defining the transmission queue on the channel, the CLCHNAME attribute allows an administrator to define on a transmission queue which cluster channels should use it. The attribute supports wildcards in any position to allow many channels to use the same manually defined queue. For example, a value of ABC.* matches any channel with a name that starts with ABC followed by a dot. If the common naming convention of <cluster>.<queue-manager> is used for cluster channels this makes it easy for administrators to configure a separate transmission queue for each cluster a queue manager is a member of, or a separate transmission queue for specific remote queue managers.

The transmission queue a channel uses is determined by searching for a matching CLCHNAME value. If multiple matches are found then the most specific match takes precedence. If no match is found then the value of the DEFCLQX attribute is used to determine which queue to use.

Consider the following example, assuming no other transmission queues have a non-blank CLCHNAME value:

A channel called AAA.BBB uses CLUSTER.XMITQ2 because that transmission queue has a specific CLCHNAME value that matches the name of the channel.

A channel called AAA.CCC uses CLUSTER.XMITQ1 because that transmission queue has a generic CLCHNAME value that matches the name of the channel and there is not a more specific match.

The transmission queue used by a channel called XXX.YYY depends on the value of the DEFCLXQ queue manager attribute because no CLCHNAME value matches its name. It will use either SYSTEM.CLUSTER.TRANSMIT.QUEUE or a permanent-dynamic transmission queue called SYSTEM.CLUSTER.TRANSMIT.XXX.YYY.

Switching transmission queue

The transmission queue associated with a cluster sender channel can be potentially modified by performing any of the following actions:

Altering the value of the DEFCLQX queue manager attribute

Manually defining a transmission queue with a non-blank value for the CLCHNAME attribute

Altering the value of the CLCHNAME attribute on an existing transmission queue

Deleting a transmission queue that has a non-blank value for the CLCHNAME attribute

To avoid channels switching transmission queue when they are running, or when multiple configuration changes are made in quick succession, no immediate action is taken by the queue manager when a DEFINE, ALTER or DELETE command is processed. Instead, each channel queries the transmission queue it should use when it starts. If a configuration change has been made since it was last active a switch of its transmission queue is initiated. The process used to switch transmission queue is:

The channel opens the new transmission queue for input and starts getting messages from it (using get by correlation ID)

A background process is initiated by the queue manager to move any messages queued for the channel from its old transmission queue to its new transmission queue. While messages are being moved any new messages for the channel are queued to the old transmission queue to preserve sequencing. This process might take a while to complete if there are a large number of messages for the channel on its old transmission queue, or new messages are rapidly arriving.

When no committed or uncommitted messages remain queued for the channel on its old transmission queue then the switch is completed. New messages are now put directly to the new transmission queue.

Further changes to the transmission queue configuration for a cluster-sender channel do not take effect while the channel is switching, even if the channel is restarted. The existing switch must complete first to avoid messages being dispersed over more than two queues. This is important to remember should you wish to back-out the change that resulted in a switch occurring.

Administrators might not wish cluster-sender channels to switch transmission queue when they next start because this might be at a time when application workload is high. When workload is high there is an inherent race between messages arriving and the queue manager moving them from the old to the new transmission queue in order to complete the switch operation. Although the queue manager will eventually win out CPU consumption (and potentially I/O) will increase during this time. Administrators might also want to avoid a lot of channels switching simultaneously so they can avoid the queue manager spawning many processes to accomplish this. To help avoid this eventuality MQ provides a command that provides the ability to switch the transmission queue of one or more channels that are not running. On distributed platforms the command is called runswchl. On z/OS the CSQUTIL utility can be used to process a SWITCH CHANNEL command instead. Using these commands administrators can explicitly switch one or more channels, either manually or using a script or job. The command processes each channel in turn instead of them all in parallel and waits for each switch to complete before starting the next. This is particularly useful because it avoids administrators having to monitor the status of background switching operations. It is also a good idea to explicitly set the status of the channels that are to be switched to STOPPED beforehand to avoid them being started while the command is running. If a channel is running it will be skipped by the command. Each channel may be started once the switch of its transmission queue has been initiated, even if the moving messages phase has not yet completed. This helps avoid an extended outage for the channel. Messages will be sent by the channel as soon as they have been moved by the queue manager to its new transmission queue.

Monitoring the status of switch operations

It is important for administrators to be able to understand the state of systems they manage. To understand the status of switch operations administrators can perform the following actions:

Monitor the queue manager error log (AMQERR01.LOG) where messages are output to indicate the following stages during the operation:

The switch operation has started

The moving of messages has started

Periodic updates on how many messages are left to move (if the switch operation does not complete quickly)

The moving of messages has completed

The switch operation has completed

On z/OS, these messages are output to the queue manager job log, not the channel initiator job log, although a single message is output by a channel to the channel initiator job log if it initiates a switch when starting.

The DISPLAY CLUSQMGR command can be used to query the transmission queue that each cluster-sender channel is currently using

The runswchl command (or CSQUTIL on z/OS) can be run in query mode to ascertain the switching status of one or more channels. The output of this command identifies the following for each channel:

Whether the channel has a switch operation pending

Which transmission queue the channel is switching from and to

How many messages remain on the old transmission queue

This is a really useful command because in one invocation an administrator can determine the status of every channel, the impact a configuration change has had and whether all switch operations have completed.

Potential issues

Here is a list of some issues that might be encountered when switching transmission queue, their cause and most likely resolution.

Insufficient access to transmission queues on z/OS

Symptom: A cluster-sender channel on z/OS might report it is not authorized to open its transmission queue.

Cause: The channel is switching, or has switched, transmission queue and the channel initiator has not been granted authority to access the new queue.

Resolution: Grant the channel initiator the same access to the channel’s transmission queue that is documented for the transmission queue SYSTEM.CLUSTER.TRANSMIT.QUEUE. When using DEFCLXQ a generic profile for SYSTEM.CLUSTER.TRANSMIT.** avoids this problem occurring whenever a new queue manager joins the cluster.

Moving of messages fails

Symptom: Messages stop being sent by a channel and they remain queued on the channel’s old transmission queue

Cause: The queue manager has stopped moving messages from the old transmission queue to the new transmission queue because an unrecoverable error occurred. For example, the new transmission queue might have become full or its backing storage exhausted.

Resolution: Review the error messages written to the queue manager’s error log (job log on z/OS) to determine the problem and resolve its root cause. Once resolved, restart the channel to resume the switching process, or stop the channel then use runswchl instead (CSQUTIL on z/OS).

A switch does not complete

Symptom: The queue manager repeatedly issues messages that indicate it is moving messages. The switch never completes because there are always messages remaining on the old transmission queue.

Cause 1: Messages for the channel are being put to the old transmission queue faster than the queue manager can move them to the new transmission queue. This is likely to be a transient issue during peak workload because if were commonplace then it is unlikely the channel would be able to transmit the messages over the network fast enough.

Cause 2: There are uncommitted messages for the channel on the old transmission queue.

Resolution: Resolve the units of work for any uncommitted messages, and/or reduce/suspend the application workload, to allow the moving message phase to complete.

Accidental deletion of a transmission queue

Symptom 1: Channels unexpectedly switch due to the removal of a matching CLCHNAME value.

Symptom 2: A put to a cluster queue fails with MQRC_UNKNOWN_XMIT_Q.

Symptom 3: A channel abnormally ends because its transmission queue does not exist.

Symptom 4: The queue manager is unable to move messages to complete a switch operation because it cannot open either the old or the new transmission queue.

Cause: The transmission queue currently used by a channel, or its previous transmission queue if a switch has not completed, has been deleted.

Resolution: Redefine the transmission queue. If it is the old transmission queue that has been deleted then an administrator may alternatively complete the switch operation using runswchl with the -n parameter (or CSQUTIL with MOVEMSGS(NO) on z/OS). Use the -n parameter with caution because if it is used inappropriately then messages for the channel can be orphaned on the old transmission queue. In this scenario it is safe because as the queue does not exist there cannot be any messages to orphan.

Permissions for accessing MQ functions have traditionally relied on using operating system definitions for users and groups. That could mean you having a requirement to define those users and groups on each system individually, which is challenging enough in a static topology, but becomes even worse in a dynamic environment such as a cloud where systems may be being defined and deleted regularly. And so some central definition of the identities becomes essential.

For Windows systems, the standard way of sharing identities is Active Directory (AD).

For Unix systems, one way of sharing user and group information is through the configuration of services in the PAM and nsswitch interfaces. Perhaps the most common mechanism for sharing on those systems is to configure nsswitch to use of the NIS (or NIS+) services; implementations also exist for storing the information in LDAP among other stores. Using those nsswitch and PAM services means that MQ does not know how the users or groups are stored; it can treat them as if they were locally defined in /etc/passwd or /etc/group. All of the operating system services such as getpwnam() behave transparently, hiding the underlying source of the data.

An alternative now exists in MQ, which can directly access LDAP servers instead of relying on OS services. A common store can be used not just within an operating system family, but across all the MQ Distributed platforms. MQ V8 added the ability to authenticate users against an LDAP directory. Fixpack 8.0.0.2 on Unix and System i extended that to allow authorisations to be managed using LDAP-defined users and groups; the equivalent Windows support for LDAP authorisation arrived in V9. Using LDAP explicitly means that users do not actually need to be defined or available on the operating system where MQ is running. The OS userid that is running an application program has no direct relationship to the identity used for authorisation checks.

This article shows how MQ can be configured to have LDAP refer to users and groups defined in a Windows Active Directory system, even though the queue manager is not running on Windows. In this exercise, the whole environment was running on AWS, using an Amazon-provided AD server, removing the need to configure such a server myself.

MQ supports the use of any LDAP server, including AD and IBM Directory Server, but one thing I was particularly interested in testing here was the use of "standard" Windows accounts in AD, instead of simply using it as a generic LDAP server, where users and groups may be defined in a separate part of the directory tree.

The architecture

AWS Directory Service is used to run an Active Directory instance.

A Windows 2012 server image which is used to administer the directory, including definitions of users and groups.

A Linux image where MQ is installed and runs the queue manager. This was also where MQ applications were run for these tests. I used the packer JSON files and procedures shown in Arthur's article as the starting point for this image.

Configuring the directory service

All that was needed was to provide a domain name ('mq.hursley.ibm.com') and a password for the administrator. You may want to set a VPC and some subnet information to control networking, but I left this to default. Once created, two IP addresses are available which are needed for access to the directory.

Configuring WIndows

When I created the Windows 2012 instance, I did not initially join the domain. I left that until later. Once the instance is running, and you have logged on as the administrator, then you need to enable the feature that allows you to administer the directory.

Next, the system must be configured to point at the DNS addresses created for the directory service:

Finally, we can join the domain:

Administering AD from Windows

The "Administrative Tools" panel should now contain an item for "Active Directory Users and Computers". That gives a UI for inspecting, defining and modifying users and groups. Much of MQ's LDAP configuration relies on knowing details of the directory schema and where objects are located in the directory. This is most easily done by selecting the "Advanced Features" of the program as it turns on some additional items when you drill into directory entries.

In particular you can now use the Attribute Editor to see all of the attributes of an entry, including the full Distinguished Name and the field names associated with other data. As well as the DN, you need to identify a field in the entry that can be used as a unique short identifier for the user. Looking at the properties of this entry, it seems that employeeID might be a good candidate here. The short name is used in MQ to fill in the 12 character MQMD UserIdentifier field, and will also appear in the output of commands such as DISPLAY CONN, showing who is using a queue manager.

Users and groups

I created several users and groups to which I could then grant different levels of access.

The first new user, mqmldap, is intended to be used by the queue manager for its connection to the directory. This user does no MQ work itself, and is purely there to be able to search the directory for identities. Once created, and a password has been set (you will need to know the password later, I used "MQpassw0rd'"), the user needs to be granted read authority on the directory. No further authorities are needed here.

I also created an MQUser group and an MQAdmin group, along with a few test ids that were made members of one or both of these groups.

Configuring MQ

Once the directory is configured, with suitable identities, we can turn to the queue manager configuration. In this MQSC command, we are giving several sets of information:

How to connect to the directory (CONNAME, LDAPUSER, LDAPPWD)

How to find users and extract the shortname (BASEDNU, CLASSUSR, USRFIELD, SHORTUSR)

How to find groups (BASEDNG, CLASSGRP, GRPFIELD)

How to discover which groups a user belongs to (AUTHORMD, FINDGRP)

It is not a mistake that the BASEDNG and BASEDNU in this definition have the same value - in AD, both users and groups can be found in the "Users" part of the tree. You may configure AD with more layers of folders between the root and the actual user or group record; that will work as long as MQ can still find unique values lower in the tree; only the highest level container needs to be supplied.

Using the attribute editor for the entities on Windows made it easy to find the elements that make up this MQ command. The DISPLAY QMSTATUS ALL command should show that the queue manager has successfully connected to the LDAP server.

Setting and checking authority

We can now set authority for users and groups using the setmqaut or SET AUTHREC commands. The full distinguished name for the group or user does not need to be used; instead the queue manager uses the information in the AUTHINFO object to search for and derive the DN. For example, I can use

to grant myself connect authority to this queue manager, and all members of a group put authority on a queue. If I then run dmpmqaut, I will see records referring to the complete Distinguished Name instead of the shortened 'met'. More examples of the different syntax options for specifying users and groups can be found in the Knowledge Center. If you get an "invalid principal" error, then there will be more information in the queue manager error log.

MQ Administrators

When MQ is using LDAP for its authorisation model, OS users do not get administrative authority on the queue manager simply by virtue of being in the local mqm group. Only the user who starts the queue manager has that authority automatically. The mqm group membership is used only to control starting, stopping and deleting the queue manager. Instead, every operation such as defining a queue or altering a channel is processed by the OAM to see if someone has suitable authority. To simplify the task of having a group of MQ administrators, without needing to share an account, a script is provided in the product that will grant full permissions to a specific group. Running

/opt/mqm/samp/bin/amqauthg.sh QMLDAP MQAdmin

executes all of the setmqaut commands needed to give members of the MQAdmin AD group full administrative control over MQ objects. As it is a script, you can easily change it, for example to create "read-only" administrators. The script does not grant "message" authorities to this group, so they cannot put or get messages. Again, that's something you may choose to add.

Running MQ programs

The userid on the operating system can now be irrelevant. If my applications use the userid/password feature during the MQCONN, it is that provided userid - resolved to a DN - that will be used for all authorisation, and for filling in the MQMD context. (If the application does not provide a userid/password, and it is not enforced by the AUTHINFO rule, then the OS userid is mapped to the DN as if it were the SHORTUSR, so it is not required that applications be updated, but it is recommended.) The common sample programs allow userids to be given to demonstrate the process. For example:

This also of course works with client connections, such as with amqsputc.

Limitations

There are two main limitations with this configuration

Amazon does not support SSL/TLS communication to the directory server. That means that you need to ensure the network configuration is suitably secure, and cannot be snooped on. From an MQ configuration perspective, this means that the AUTHINFO object's SECCOMM attribute must be set to NO. Of course, if you are running the AD service yourself, then TLS would be recommended.

Active Directory does not support complex authentication configurations when using the LDAP bind APIs. MQ is not calling Windows APIs, just the standard LDAP APIs. So that rules out configurations such as authenticating through cross-forest trust relationships.

And remember that all explicit use of LDAP for MQ authentication and authorisation, regardless of which directory implementation is being used, must have a way to return a unique short name for the user. Depending on your directory schema there may or may not be a convenient field already in existence.

Summary

In this article I've shown how you can use a Windows AD configuration to provide user and group directly to a non-Windows queue manager. This can simplify provisioning of operating systems and queue managers, with no need to create new OS identities.

In an MQ subsystem, this means that archive data sets are eligible for compression.

What benefits might I see?

There are a number of benefits which using zEDC to compress MQ archive logs may bring:

Reduced storage occupancy of archive volumes, meaning more archives can be stored on the same number of 3390 volumes. The compressibility of the messages logged will be the largest factor in the archive data set size reduction.

Reduced load on the IO subsystem, which in a constrained subsystem could improve response rates on other volumes.

In our tests with dual logs and dual archives where the I/O subsystems’ cache was seeing increased disk fast write bypass (DFWBP) on the control unit used by both log copy 2 and the archive volumes, enabling archive log compression resulted in the response times from the I/O to log copy 2 reducing, with DFWBP becoming 0, which manifested in a 45% improvement in peak throughput with large messages.

What impact might I see?

The process of compressing, or indeed attempting to compress the MQ archive logs will result in an increase in queue manager TCB costs. For a queue manager running a dedicated persistent workload with the intent to drive the MQ log process to its limit for a range of message sizes, we observed the queue manager TCB cost increase for the zEDC enabled measurements:

Message Size

4KB

32KB

1MB

4MB

Increase in QM TCB over non-zEDC measurement

+7%

+26%

+90%

+100%

Increase in peak throughput

1%

2%

30%

45%

Note that some of the increase in queue manager TCB is associated with the increased log capacity rate, but there is some additional cost on MVS from the allocation / releasing of the target compression buffer plus some costs in setting up the call to the zEDC hardware.

Reading the MQ archives data sets, such as when the ‘RECOVER CFSTRUCT’ command was issued, was impacted when the archives were compressed using zEDC. This impact took the form of both a reduced read rate coupled with an increase in queue manager TCB costs for decompressing the data.

The following table summarizes the results of a ‘RECOVER CFSTRUCT’ command resulting in the recovery of 4GB of data.

Uncompressed Archives

Archives compressed using zEDC

Recovery Rate (MB/sec)

88

27

Cost per MB (CPU ms)

860

1900

The numbers in the table are based upon a MQ V9 queue manager with 4GB of data stored on shared queues with data offloaded to SMDS. For the purposes of the backup and recovery tests, the queue manager is configured with single logs and single archives. The data on the queues is highly compressible. The costs are based on the measurements running on a z13 (2964-703).

When MQ is writing archive logs at a high rate, the RMF PCIE report indicated that the single configured zEDC processor for the logical partition was running up to 65% utilized when compressing the dual archive logs for a single MQ queue manager. This peak usage occurred when the message was incompressible. With highly compressible messages at the peak rate, the zEDC processor was 50% utilized.

The PCIe I/O drawer that the zEDC Express feature can be installed can support up to 8 features with each feature able to be shared across 16 logical partitions. Sufficient zEDC features should be available to avoid impacting other users of the feature.

Within this document sections 4.2 to 4.5 were of particular interest as they discuss DB2 log archive data sets.

For our testing purposes we already had a storage group for the MQ archive data sets, so we defined a dataclass of ZEDC, specifying ZR (zEDC required) for compaction, and the data set name type of EXTENDED.

We also defined a storage group MQZEDC that was based on the existing MQMARCH storage group and added a similar number of volumes to the group.

Note, that the zEDC feature was already enabled on the test system.

Gotcha’s:

The initial set up did not specify a value of EXTENDED for the data set name type – as a result measurements showed similar size archive and log data sets – indicating either the data was incompressible or the no compression was attempted.

Subsequent review of the PCIE XML report produced by program ERBRMFPP indicated the zEDC processor was not being used.

The PCIE report can be generated by specifying ‘REPORTS(PCIE)’ and viewing the contents of the XRPTS DD card, which contains data generated in XML format.

Where else might zEDC benefit MQ?

zEDC offers a number of compression benefits that can benefit MQ:

-Channel compression using ZLIBFAST. The performance benefits are discussed in performance report (MP1J), found at

We have recently released a MQSC syntax highlighter on the Atom text editor. This syntax highlighter is built in and easily installed through the Atom package manager and adds basic syntax highlighting to help with development of MQSC scripts. An example of a highlighted MQSC script is show below:

How to install the package

To install the package, go to File->Settings (Windows), Atom->Preferences (Mac) or Edit->Preferences (Linux) on your toolbar and then select Install. In the search bar type in language-ibm-mqsc and install the package released by ibm-messaging. Atom will then download the necessary files and install it.

Once this has finished you should see that any files which end in .mqsc are automatically highlighted. If you want a file that does not end in .mqsc to be highlighted with the MQSC syntax highlighting, then you can specify it by pressing Ctrl->Shift->L and searching for MQSC.

I have pushed to GitHub the first of a series of samples showing how DRBD(R) can be used to replicate data for an IBM(R) MQ queue manager in various Disaster Recovery (DR) and High Availability (HA) scenarios.

This first sample shows how DRBD can be used to replicate data between Amazon Web Services (AWS) availability zones. It sets up a DRBD cluster of instances running in three availability zones within a region and shows how a queue manager can be moved from one availability zone to another without losing persistent messages.

Something that crops up in PMRs every so often is customers reporting that managed transfers fail, with one or more transfer items in the managed transfer reporting the error:

BFGIO0341E: The rename of temporary file <destination filename>.part to <destination filename> failed because the temporary file does not exist.

In this blog post, we'll look at why this error can occur, and how to prevent it.

How a Destination Agent uses temporary files

By default, when a managed file transfer takes place, the Destination Agent will perform the following steps:

Create a temporary file, called <destination filename>.part.

Lock the temporary file.

Write file data into the temporary file, when it is received from the Source Agent.

Unlock the temporary file after all of the file data has been received and written out.

Rename the temporary file, from <destination filename>.part. to <destination filename>.

If a managed transfer goes into recovery, then it is possible for the Destination Agent to create temporary files called <destination filename>.part<number>. The Destination Agent will then write the file data to this file instead of the one called <destination filename>.part.

Why a BFGIO0341E error can occur.

A BFGIO0341E error will be generated if the Destination Agent attempted to rename the temporary file, only to find that that file is no longer there. Here is the typical scenario that can cause this:

A "staging directory" has been set up on the target file system.

An external process is configured to monitor the "staging directory" , and move any files that it finds to a new location.

The Destination Agent creates and locks the temporary file <destination filename>.part in the "staging directory".

The Destination Agent writes file data into the temporary file.

After all of the file data has been written to the temporary file, the Destination Agent unlocks the file.

The external process finds the temporary file, and moves it to the new location.

The Destination Agent attempts to rename the temporary file, and finds that it is no longer there. As a result, the transfer item is marked as "Failed" with a BFGIO0341E error.

How to prevent the error from happening

There are two ways to prevent the BFGIO0341E error from occurring:

Temporary files written by a Destination Agent always end with the .part or .part<number> suffix. If the external process can be configured to ignore those files rather than moving them, then the files will still exist in the target directory when the Destination Agent performs the rename operation.

An alternative approach would be to configure the Destination Agent so that it does not use temporary files, and writes directly to the destination file. The destination file will only be unlocked once all of the file data has been written to it, at which point it can be picked up by the external process. To configure the Destination Agent to write directly to the destination file, set the agent property doNotUseTempOutputFile=true. More information about this property can be found in the The agent.properties file topic in the MQ sections of IBM Knowledge Center.

I hope this has given you some insight into how a Destination Agent uses temporary files, and why the BFGIO0341E error can occur. If you have any questions, feel free to ask them as comments to this post, and I'll be happy to answer them.

I've been working with a customer recently who had a number of web-based enterprise applications running inside of WebSphere Application Server that connected to MQ. They wanted to implement the reconnection logic described in the technote called Using WebSphere MQ automatic client reconnection with the WebSphere MQ classes for JMS to allow those applications to reconnect to a queue manager in the event of the queue manager becoming unavailable.

The example code in the technote would drive the reconnection logic for all JMSExceptions that were thrown. The customer wanted to know if there was a subset of MQ reason codes which just represented "connection broken" or "queue manager unavailable" errors that their application should check for, which is a very good question!

Level 3 have done some investigation, and here is a list of reason codes that indicate that a queue manager is no longer available or cannot be reached:

Most JMSExceptions that are thrown back to enterprise applications will contain a linked MQException which holds the reason code. To implement the retry logic for the reason codes listed above, enterprise applications should check this linked exception using code similar to the example shown below:

In Arthur’s previous proof of concept, he set up an auto scaling group using Amazon’s Elastic File System (EFS) to provide a High Availability and Disaster recovery solution for IBM MQ. In this proof of concept, we will be taking the same AWS setup and modifying it to use a Ceph storage cluster as our storage system instead of EFS. Although here we will be using AWS as our cloud provider, in theory the topics covered here could be applied to any cloud system.

Ceph is an open source clustered storage solution, it is not tied to any particular cloud provider and so can be used with AWS, OpenStack, Bluemix etc. Although this proof of concept uses Ceph we will not be discussing how to set up a Ceph storage cluster, instead we assume you already have a storage cluster setup and you’re now configuring AWS and MQ to use it. I also assume that you have read Arthur’s previous blog post and that you understand the concepts covered there, I will not be explaining auto scaling groups or the configuration here in detail – just the differences and problems I had to solve in order to convert Arthur’s example to work with Ceph.

I’ve included links to the Cloud Formation template and scripts that I used to create this proof of concept, it should be noted that you may not be able to take the scripts and run them without first modifying them to use your Ceph storage cluster. You can find the example template and configuration files on this github gist.

The Ceph Storage Cluster

The diagram below shows the storage cluster that I had set up prior to creating the auto scaling group.

My Ceph Cluster contained six machines over three different zones. Each of the machines contained one Ceph object store (OSD), in addition one of the machines also had the Ceph monitor (that controls the Ceph OSDs) and the Ceph Dashboard webserver running on it. The Ceph dashboard was a handy GUI webserver I found, written in python, that could be used to provide simple statistics about my Ceph cluster. It was useful for seeing the status of the 6 nodes and also what effect was had on the cluster when I “destroyed” one of the nodes to simulate a sudden loss of machine/zone.

When choosing whether to use the Ceph file system or Ceph block storage I opted to use Ceph block storage for this proof of concept. The reasons for this was that for the initial investigation into using Ceph with MQ I wanted to make sure that the file system underneath MQ was a file system that I knew would work with MQ. Using block storage, I would have control over the type of file system created on the storage and so could chose a file system that I knew was supported, with CephFS the file system is provided for you and so additional investigation would be needed to ensure that it worked with MQ. Although we are using block storage in this example, it is likely we will be re-visiting Ceph in the future and trying CephFS with MQ.

Creating the MQ with Ceph Image

Once the storage cluster was up i created the image that would be used on the EC2 instances. This instance would need to the following installed on it in order to be usable:

IBM MQ

Ceph Client (to connect into the Ceph storage Cluster)

Two configuration scripts that will be executed by the auto scaling group upon instance creation:

The config.mqsc script is the same as in Arthur’s proof of concept and simply configures the Queue Manager ready for client applications to connect in via the PASSWORD.SVRCONN channel. The aws-configure-mq-ceph.sh script however is a new script and is split into 4 main sections:

First, the script uses the AWS command line to retrieve two Ceph configuration files form S3. These files are needed by the Ceph client to connect to the Ceph storage cluster, they are uploaded to S3 ahead of running the script.

Second, we use the configuration files to connect to the Ceph Cluster and ask it to allocate storage for our MQ data root (if it doesn’t exist already) and give it a supplied label. We then ‘connect’ the storage disk to our instance ready to be used.

Thirdly, we then attempt to mount the storage device, if that fails we assume it is because the file system does not exist on it and so we create a file system that MQ supports on it. Once created we then mount the file system in the /var/mqm directory

Finally, with the file system mounted under /var/mqm we then create the MQ Data root on it before creating and starting a Queue Manager (unless it already exists). We then run the configuration script against this Queue Manager to configure it for Client connections.

Explaining the Cloud Formation template

Once I had the image I needed to create the EC2 instances with I then created the Cloud Formation template, I used Arthur’s version as a baseline but removed all of the sections on creating and configuring the EFS volume that obviously was not going to be used in this proof of concept. I also altered the UserData to pass in the new parameters and added those new parameters as requirements to run the Cloud Formation template. Ceph requires additional ports to run (6789 and also the port range 6800 – 7100) so I opened those in the Security group.

The final change I made revolved around solving the problem of “How do I get the required Ceph files to the Ceph client?”. A Ceph client requires two files in order to connect to the storage cluster: A configuration file to tell it where the storage cluster is and a key file to allow it to connect. The solution I chose was to use AWS S3 storage and IAM Roles to control access.

The S3 bucket (with its two objects) was created before running the Cloud Formation template, the IAM Role and policy (permissions) to access the S3 bucket are created in the Cloud Formation template. By applying a IAM Role with policies to an instance we are able to connect to the S3 bucket via the AWS CLI without having to supply username/passwords or access keys. This allowed the instance on creation to retrieve the files it needed for the Ceph client to connect to the storage cluster.

Once all of the necessary changes were made I ran the template in AWS to create the stack. Once created the auto scaling group and instance worked the same as in Arthur’s blog, the only difference in operation was that the MQ Data Root was mounted on a file system that was being stored in a Ceph Cluster as opposed to Amazon’s EFS.

Conclusion

In conclusion this proof of concept was designed to show that MQ could be used with Ceph, which I’m happy to say it can. Because Ceph is accessed over a network, your Ceph Cluster could be anywhere as long as AWS can access it and in theory Ceph could work on any platforms whether it is on-prem platforms or cloud platforms. Although there is still some investigation that needs to be performed before you select Ceph as your storage cluster technology of choice (for example performance, cost, etc), it is good to know what Ceph can work with MQ and as such is a viable as a storage technology you can use to replicate MQ data across multiple availability zones in your HA infrastructure.

Keeping the demo simple

For the purposes of this demonstration, I stayed away from running any actual code on AWS. That avoided having to deploy programs into images, or configure much security. The queue manager runs on my local Linux workstation, as does the grafana service. The data is pushed to, and retrieved from, CloudWatch running in AWS but that's all. In reality, someone using CloudWatch is likely to want to include the monitoring collection program in an EC2 image to run alongside the queue manager in that instance. And that in turn will require a security role to be configured with that instance, in a similar way to how Arthur has described using CloudWatch to collect MQ error logs.

Security configuration

For my simple setup, I had to have accessible credentials so that I could write to CloudWatch. Since the MQ service runs under the mqm account, I put a copy of a credentials file containing the AWS_ACCESS_KEY and AWS_SECRET_KEY into ~mqm/.aws/credentials. That file may also contain the user's default AWS region (eg "us-west-2") for connection; if not provided there, the region can be given as a parameter on the mq_aws command line. That same file was also put under the home directory of the grafana userid so it can use CloudWatch as a data source.

The MQ program

All of the MQ programs for monitoring are in the same github repository. Cloning that repository gives copies of all of the MQ source code, configuration scripts, and example Grafana dashboards. You will need to get a copy of an error-logging package using

go get -u github.com/Sirupsen/logrus

To build the CloudWatch collector, mq_aws, you also need the AWS client packages:

go get -u github.com/aws/aws-sdk-go/service

You can then build the collection program with

go build -o mq_aws cmd/mq_aws/*.go

Just as for the other monitoring programs, configure the queue manager through the MQSC and shell scripts in that directory, so that the main program is started as an MQ Service.

Looking at CloudWatch data

Once the monitoring program is running correctly (you can look for errors in /var/mqm/errors/aws.out if you keep the same MQSC script), the metrics are sent to CloudWatch.

When you log into AWS and go to the CloudWatch Management Console you can see them in the Custom Metrics section, under the IBM/MQ namespace:

From there, metrics can be selected. You can see two filters, one for the queue-related metrics (which of course also includes the queue manager name as a field) and one for the queue-manager-wide metrics:

Selecting some of these metrics allows creation of a dashboard in CloudWatch.

Using Grafana

The CloudWatch dashboard is not very sophisticated graphically, but it does at least allow graphs to be created and shown. Better as a visualisation interface is Grafana. And for that, I created a dashboard similar to that used for all the other databases in this series of articles. The biggest difference is that there does not seem to be a way to directly query using wildcards for the queue or queue manager names ("dimensions" in CloudWatch terms); using Grafana templates might be a way to deal with that. The queries used in my sample dashboard explicitly name the queues to be monitored. Also, refresh times are set to be much less frequent because of the rates used in CloudWatch, and the pricing model that charges for queries and updates.

Summary

This series of articles has shown how MQ V9 can feed data into a variety of monitoring solutions, including those commonly-used for cloud deployments of queue managers.

I would welcome feedback on these tools. Please leave any feedback here, or in the GitHub issue tracker, whether bugs, enhancements, or thoughts on the value of the monitoring.

In this final article on MQ logging and metrics, I will explain how you can send MQ usage metrics to the Bluemix Logmet service and visualise them with the Grafana web front end. We'll be using the code from Mark Taylor's article to extract MQ V9 statistics from the queue manager using his MQ Go client, and combining it with collectd, an application for collecting system statistics. This will will forward the metrics on to Logmet. collectd has various input and output plugins. We will configure it with Mark's MQ Go client as an input, and Logmet's mulit-tenant lumberjack client as an output.

Configuring collectd

The first thing we need is to install collectd and the Logmet lumberjack output plugin. Download and install the latest version of collectd for your platform from https://collectd.org/download.shtml

The Bluemix article describes how to configure the collectd lumberjack plugin collectd-write-mtlumberjack with your Bluemix access token and logging token. This is much the same process that we used when configuring Logstash, issuing a cURL command to retrieve your access_token and logging_token:

The returned tokens need to be added to a collectd configuration file in /etc/collectd/collectd.conf.d. When you installed collectd-write-mtlumberjack it will have installed a sample configuration file in /etc/collectd/collectd.conf.d called mt-metrics-writer.conf.sample. Rename this mt-metrics-writer.conf and edit it to have your credentials. Remember to uncomment the necessary lines. The configuration file should look similar to this:

This tells collectd to invoke mq_coll.sh with QM1 as the only input parameter. You can set QM1 to the name of any version 9 queue manager you wish to monitor.

The final step is to tell collectd about the data types mq_coll will produce. This is done by adding a file listing the different types, into a .db file. If you cloned the MQ Go Git repository earlier you will find a file called <repo-root>/cmd/mq_coll/mqtypes.db. Put a copy of this file in /usr/local/bin/mqgo/mqtypes.db.

Restart collectd (sudo service collectd restart) to pick up the new configuration files. You can check that mq_coll is running using ps. By default collectd does not write logs to a file, but you can configure it to do so by adding the LogFile plugin to a .conf file in /etc/collectd/collectd.conf.d/

Once collectd is running it will automatically start pushing MQ metrics to Logmet via the lumberjack plugin. To see the metrics we will create a Grafana dashboard.

Grafana

Before we do that we need a quick introduction to Graphite and Grafana. Graphite is an open source technology for storing metrics and statistics. It performs a similar job for metrics and stats that Elasticsearch does for log records. Grafana is a web front end for Graphite which allows you to query and visualise the data stored in Graphite.

Navigate to Logmet and choose Grafana rather than Kibana. You will be presented with the default metrics dashboard:

There is a useful overview of Grafana in this Bluemix article. We will recap some of the key features here.

Grafana divides the screen into rows - you can see 3 defined on the default dashboard - and then lets you put one or more panels onto each row. You can add rows, remove rows, and change their heights. There are 3 different types of panel that can be added to a row: graph; singlestat; text.

Graph

As you would expect, a graph panel lets you plot one or more data series on a graph that you can customise and format however you wish. The default dashboard has a sample graph on the bottom row, with a random number generator as the data source.

Singlestat

This is used to display the value of a particular metric. It is useful for displaying meter-like values such as the current depth of a queue, or the current rate at which messages are being put. The statistic displayed can be the current, live value of a metric, or a function of the value over time, such as an average or sum.

Text

You can use the text panel to display textual information, including hyperlinks to related docs and material that a user of the dashboard might need.

Rows are edited by hovering over the slideout green icon to the left of each row:

To check that MQ statistics are being delivered to Graphite correctly, add a new graph to one of the rows. Grafana will automatically resize the other panels on that row to accommodate it. Click on the title of the new graph and a menu will appear - choose Edit to add data to the graph:

Once in the edit view, add a data series to the graph by selecting "Add Query". Metrics are listed in hierarchies, Selecting the first level in the hierarchy causes a new box to appear where you select the next level in the hierarchy and so on. The first level in the hierarchy is the GraphitePrefix you entered in the collectd configuration file. Clicking on the first "select metric" box in the query should list your Bluemix space ID as an option. Select it, then choose "localhost" as the next level, and "qmgr-<QMNAME>" in the next. The final box should list all of the metrics available for that queue manager:

You can scroll through the list to choose the metric you wish to display. Because the list is repopulated from Graphite every time it may take a while to load. To add multiple data series to the graph, add more queries and select the metric to display for each of them.

singlestat panels are populated in a similar way, choosing the metric you wish to render in the panel.

Once you have edited and configured several panels you can view a snapshot of your queue manager network in a single dashboard:

Note: Make sure you have saved your dashboard using the save icon at the top right of the dashboard.

You can create multiple dashboards and switch between them. Here I've created a dashboard with no graphs on to display some basic system health:

Note: It's possible to show data for more than 1 queue manager by selecting a wildcard when you configure the metric to show, as shown below:

There are several options for centralizing your MQ error logs, when running on Amazon Web Services (AWS). For example, you could use the AWS ElasticSearch service, and forward logs just like Matthew Whitehead discussed in his recent blog entry. You could also forward your logs to a third-party log service, such as Splunk, Loggly, Elastic Cloud, or the IBM Bluemix Logmet service. Of course, you could use Amazon's own service, CloudWatch. Although CloudWatch is arguably less powerful and feature rich than many of the alternatives, it is easy to set up, and working with CloudWatch is far better than not centralizing your logs at all. In this blog entry, we'll take you through the simple steps required to centralize your MQ error logs.

Authorizing your EC2 instance for CloudWatch

The first thing you need to do is to make sure that your EC2 instances are authorized to talk to the CloudWatch service. This is done by creating a policy in the Identity and Access Management (IAM) service, and then assigning that policy to a role. Roles can then be assigned to your EC2 instances. AWS provides certain pre-canned policies, but they offer very coarse-grained access to CloudWatch. The following is a more fine-grained example policy:

Once you've create a new role with the appropriate policies you then assign it to any EC2 instances you want to use with CloudWatch.

Sending error logs to CloudWatch

CloudWatch requires a CloudWatch agent to be installed on your EC2 instance. This is a small Python program which will monitor log files, identify separate log entries, and then send those entries to CloudWatch. After you've installed the CloudWatch agent, you can add the necessary configuration for the MQ error logs:

This configuration can either be written to the main CloudWatch configuration file in /var/awslogs/etc/awslogs.conf, or to a standalone file in /var/awslogs/etc/config/mq.conf. The configuration identifies two files to watch: the main MQ error log, and the error log for queue manager QM1. It configures the date/time format, tells the agent to send any pending log messages to CloudWatch at least every five seconds, and write them to a "log stream" based on the EC2 instance ID.

After writing the above configuration, you will need to restart the CloudWatch service:

$ sudo service awslogs restart

You can then view the logs in the AWS management console (under CloudWatch -> Logs):

CloudWatch allows you can search individual log streams (per instance), and set up filters and alarms for certain events. The search facilities offered aren't anywhere near as advanced as (say) ElasticSearch, but you can choose to forward log events on to ElasticSearch if you want to. Importantly, you can also trigger AWS Lambda functions based on certain messages being logged, which will allow you to take programmatic action.

Recently I was contacted with a query on how to access the data in Java, specifically for the "CPU/QMgrSummary" topic - some of the "default" assumptions about how to interpret data in PCF response messages don't apply in this context.

As this area was new to me, I did some rapid scanning of the C sample, wrote some simple Java programs to explore the topic tree and put together a basic sample.

I really need to do a more detailed examination of this mechanism and write a comprehensive sample, equivalent to the C sample, but in the meantime here's a summary of what I found and what I produced.

NOTE: This is probably based on incomplete understanding, so I reserve the right to come back and edit this post or produce new and improved posts on the subject :-)

The Crucial Point

The most important thing to understand here is that the resource usage statistics topics are dynamically defined.

The topics are structured into classes(for example "CPU", "DISK") within which there are types (for example "CPU" currently has types called "SystemSummary" and "QMgrSummary").

Within a class and type there will be various usage statistics published, for example the "CPU/QMgrSummary" topic currently has statistics such as "User CPU", "System CPU", "RAM used".

Note my use of "currently" above - this can change over time. So we do not want to hard code much...

The Topics are organised into a self-describing form, under a hierarchy based at "$SYS/MQ/INFO/QMGR/%s/Monitor/METADATA/", where "%s" is the Queue Manager name.

A Quick Look At The Metadata

A good way to get an initial understanding of the hierarchy, apart from by perusing the C sample program, is by running it in debug mode (use the "-d 1" option).

Here's a small part of the output of such a run showing the data relating to "CPU/QMgrSummary" :-

We see from the "CLASSES" topic that the "CPU" class is class 0, DISK is class 1 etc...

From the "CPU/TYPES" topic we see that "CPU/SystemSummary" is class 0, type 0 while "CPU/QMgrSummary" is class 0, type 1.

Moving on to "CPU/QMgrSummary", this is where the actual statistics are defined - for each statistic we have five pieces of information :-

The class identifier - because these are CPU statistics, the class is 0.

The type identifier - because these are "CPU/QMgrSummary" statistics the type is 1.

An identifier for the statistic - User CPU time has identifier 0, System CPU time has identifier 1, and so on.

The datatype of the statistic. All statistics are presented as 32-bit or 64-bit integers (PCF type MQCFIN or MQCFIN64) but may need to be interpreted in various ways.
In the current case User and System CPU time have datatype 10000 (MQIAMO_MONITOR_PERCENT) which means they will be a value from 0 - 10000, which needs dividing by 100 to give a percentage figure to 2 decimal places, while RAM has datatype 1048576 (MQIAMO_MONITOR_MB) which means it represents a figure in megabytes.

A description for the statistic. (Note: There is internationalization available - this example has taken the default hierarchy, but there are locale-specific versions - for now, see the C sample for details).

Messages published to a topic of this type (defining statistics for a class and type) will contain (not necessarily in this order) :-

A parameter (MQCFIN) for the class (MQIAMO_MONITOR_CLASS).

A parameter (MQCFIN) for the type (MQIAMO_MONITOR_TYPE).

A parameter group (MQCFGR) for each statistic (MQGACF_MONITOR_ELEMENT).
Within the group will be

A parameter (MQCFIN) containing the identifier for the parameter (MQIAMO_MONITOR_ELEMENT).

A parameter (MQCFIN) containing the datatype for the parameter (MQIAMO_MONITOR_DATATYPE).

A parameter (MQCFST) containing the description of the parameter (MQCAMO_MONITOR_DESC).

We can save this information away and then use it to interpret messages actually published to the statistic topic.

Here's a possible data structure for storing statistic definitions for a specific class and topic :-

Introduction

In part 2 we saw how you can use Heat to quickly deploy Queue Managers with your specific configuration needs using OpenStack’s Heat templates. However, once you have created this Queue Manager how do you quickly share its connection information with client applications? When Queue Managers are deployed dynamically depending on need and if you are unable to predict or control the IP address that will be associated to the Queue Manager, or you have existing Queue Managers outside of OpenStack, the Client Connection definition tables (CCDT) file could be a viable solution for you. In this blog we will explore the CCDT solution using the new MQ v9 feature - CCDT URL.

Prerequisites

Before starting this blog, you will need to have completed part 1 (to understand how to create an MQ image) and part 2 (to become familiar with using heat templates to deploy MQ). You should also ensure that you are familar with both MQ CCDT files (what they are and how they are used) and also the new CCDT URL feature. You can get a good summary of both from this blog post.

Additionally, because part 1 created a MQ v8 image you will also need to create a MQ v9 image. The reasons for this is that the CCDT URL feature in MQ Clients is only available in version 9 of MQ. You will need to create a new image that contains the client packages and also uses the MQ 9 drivers, you can use the same steps and scripts as before but you will need to replace the MQ_URL and MQ_PACKAGES variable in the packer script with:

You should also head over to GitHub and download the sample code. The code is split into two folders: Part 1 and Part 2. These folders contain the files you will need for the relevant sections below.

Finally, in this blog I will be using the OpenStack command line tool, you will need to make sure that both the OpenStack client and also the Heat client are installed on your machine. You can install both by using the pip install commands:

pip install python-openstackclient
pip install python-heatclient

Part 1 - Setting up the CCDT Host

Before we start registering newly created Queue Managers, we need somewhere to register and host the CCDT file. As the CCDT file is a binary file you have 2 choices for creating, deleting or modifying entries within it: “runmqsc -n” or by using a Queue Manager. In this example we will be using a Queue Manager to create and manage the CCDT file that will be used by a simple python HTTP Server. We will use a single heat stack to create and manage the CCDT server and its required resources, so first you should grab a copy of the following three files and familiarize yourself with what they do:

CCDT_Setup.yaml – This file is the heat template that we will be using, like the previous files it requires 4 parameters to run but will setup a full OpenStack network and the instance that runs the MQ Queue Manager and HTTP Server.

CCDT_Setup.mqsc - This file is passed into runmqsc during the creation of the instance to configure the Queue Manager ready for other Queue Managers to connect to it and register their connection details.

createCCDT.sh – This file is used to create, start and configure the Queue Manager that will manage the CCDT file. It is also used to setup the HTTP Server that will be used to host the file so the clients can obtain connection details

Once you’re happy with the files you should run them in the same way that you created the stack in part 2. Alternatively, you can use the OpenStack client to create the stack instead of the standalone heat client. Either run:

Once the stack has finished creating you will have a CCDT hosting server which Queue Managers can connect to and register their connection details with. In the next part we will create a Queue Manager and register it with the CCDT server we just created. The HTTP Server on this machine simply hosts the CCDT file that will be retrieved by the MQ v9 clients via a HTTP GET request, in order to make sure that you don't have to keep updating the hosted file, and to make sure you don't host the entire MQ data root, the heat template creates a symlink between the CCDT file and the HTTP hosting directory.

Part 2 – Queue Manager creation and registering

With the CCDT Server up and running we can now create our Queue Manager. Although there are a few possible ways to do the “registering”, in this example we will be using a heat template that does the following:

Because we are using the client version of runmqsc you must ensure that the MQSeriesClient.rpm package has been installed on the image you use in this section.

To get started you will need to grab the following 4 files and familiarize yourself with them:

Create_QM.yaml – This is the heat template we will be using, it requires the same 4 parameters as CCDT_Setup.yaml but also requires one extra parameter. The new parameter required is the IP Address of the CCDT Server you created in Part 1, you can obtain this easily by executing “openstack stack output show CCDT CCDT_ip_public” and looking at the value returned in “output_value”.

createQM.sh – This shell script is used to create, start and configure the Queue Manager before executing runmqsc -c to register with the CCDT Server.

QM_Setup.mqsc - This file is passed into runmqsc during the creation of the instance to configure the Queue Manager ready for client connections.

AddCCDT.mqsc – This file is used by the createQM.sh script, it is passed into the runmqsc -c call to register the Queue Manager with the CCDT Server. It has a single MQSC definition and this is the definition to create the Client Connection Channel Definition.

Once you’re happy with the files you should run them in the same way as you ran the previous heat templates. Either run:

Once the stack has completed you should check the logs to make sure that the Queue Manager has been created and you can see the runmqsc -c connection creating the Client connection channel. Provided both of those tasks have completed successfully then MQ v9 client applications will be able to connect to the Queue manager using the new CCDT URL feature. In the next (and final) section we will show an example of how you can use the MQ sample applications to test this.

Part 3 – Using the hosted CCDT in client applications

To test the CCDT hosting server we will be using the sample programs “amqsputc” and “amqsgetc”. These are available with the MQ installation so long as you installed the MQSeriesSamples.rpm package; as these are client applications you also need to ensure you have installed the MQSeriesClient.rpm package.

First create an instance on a network that can access both the CCDT Server and the Queue Manager, you can do this either by hand using the GUI or command line or create a heat template to automate it for you. Once the instance has created log onto the instance and change to the mqm user before navigating to the following directory “/opt/mqm/samp/bin” (This assumes that MQ has been installed in a default location).

If you look in this directory you should see several sample applications ready to be executed including the two we are going to run. Before you run them though, you need to set an environment variable to tell the application where it needs to get the CCDT file from. Execute the following, replacing <CCDT IP> with the IP address of your CCDT Server:

export MQCCDTURL=http://<CCDT IP>:80/AMQCLCHL.TAB

This environment variable is recognised by MQ Clients to alert them to use the CCDT file located in this location. It is important that you do not forget the http:// at the front as this is required to tell MQ how to obtain the file. If you do not specify it then MQ will attempt to obtain the file locally, instead of via HTTP. Once the environment variable is set we can use amqsputc to put messages to the Queue on the Queue Manager we created in section 2. Execute the following and place a few messages onto the queue before hitting return/enter twice to quit out of the program:

./amqsputc LOCAL.QUEUE QM1

Before you retrieve the messages you can prove to yourself that they really do exist on the Queue manager by logging into its box and querying the CURDEPTH of the local Queue to make sure it shows the number of messages you put. When you are ready to retrieve the messages you can execute the following:

./amqsgetc LOCAL.QUEUE QM1

Conclusion

This blog post focused on a possible concept of tackling passing Queue Manager connection information to a Client application using existing features. As a concept it should be modified and adapted to meet your requirements but can be used as a starting point.

Following on from my previous article Storing and searching MQ error logs in Elasticsearch, this post will help you achieve everything described in that article but without running your own Elasticsearch or Kibana server. I will give an overview of the logging service built in to IBM Bluemix - Logmet - and demonstrate how with a few changes to the Logstash configuration from the previous article, MQ logs can be stored in the IBM Logmet service and visualised with Kibana.

Logmet

Logmet is Bluemix's logging service. It is closely integrated with the Bluemix Container and Bluemix VM services, making it easy to push logs from containers and VMs running in Bluemix to a central logging service, ready for querying and visualising with the built in Kibana front-end.

Like generic Elasticsearch stacks, Logmet supports applications and services pushing their own logs to it using a Logstash agent. Logmet only supports the Lumberjack protocol as an input so the Logstash agent must have a Lumberjack output plugin configured instead of the elasticsearch output plugin we used in the previous article. However Logmet uses a slightly modified version of the Lumberjack protocol which adds a mulit-tenant token to the payload. This means that it isn't possible to use the standard Logstash agent. To remedy this Logmet has provided a custom Logstash agent that supports the multi-tenant version of the protocol. It still uses the same Logstash pipeline of input, filter, and output, so most of the configuration we created in the previous article can be re-used. In this Logmet blog post more details are given about downloading the custom Logmet agent and configuring it with your own Bluemix credentials.

Logstash Configuration

Once Logmet's Logstash agent is installed, copy the configuration files from the previous article (00-input.conf, 20-amqerr.conf, and 99-output.conf) into the Logstash config directory. If you are using the same config directory that Logstash uses by default (/etc/logstash/conf.d/) you can skip this step. Then modify the files ready for use with Logmet:

00-input.conf

No changes are needed to the input configuration. The agent will still read from /var/mqm/qmgrs/QM1/errors/AMQERR01.LOG

20-amqerr.conf

Again no changes are needed because the format of AMQERR01.LOG is the same, and we want to parse it in the same way.

99-output.conf

This is where we need to replace the elasticsearch output with a mtlumberjack output. Remove the elasticsearch output configuration:

Logmet requires the Logstash agent to present a tenant_id and to authenticate itself with a tenant_password. You can retrieve both by using a simple cURL request to the Logmet login URL. The tenant_id is your Bluemix space_id, and the tenant_password is your Bluemix logging_token. To retrieve them issue the following cURL request:

Note: Curl should use the system's certificate store to verify the identity of the logmet.ng.bluemix.net service. If your system's CA store does not contain the relevant DigiCert CA certificate, you can either add the correct DigiCert certificate or extract the Logmet certificate itself and add it to your certificate store.

Note: You can use the same tenant_id and tenant_password on multiple systems, so if you have several queue managers running in different places you can monitor all of their logs from one central Kibana dashboard.

With those changes made you can start the Logstash agent and it should start to read from the MQ error log and push log records to the Logmet service. It will also write the parsed log records to /tmp/debug-filters.json as before. To see your logs in Kibana, log in to Logmet using the same space and organisation that you used to generate your logging token. Make sure you choose Kibana instead of Grafana.

Kibana

Here is what you should see once you've logged in to Logmet:

Note: Logmet currently uses Kibana v3 so these screenshots look a little different to the screenshots in the previous article which were made using Kibana v4.5.

As in the previous article, we can tidy up the fields that are displayed in the list so that the only fields being shown are the AMQ error code, the error description, the process name and the queue manager name:

Finally by opening the query tab at the top of the page the results can be filtered so only the log entries we are interested are displayed:

Graphite and Grafana

Graphite is an open source technology for storing metrics and statistics. It performs a similar job for metrics and stats that Elasticsearch does for log records. Grafana is a web front end for Graphite, allowing you to query and visualise the data stored in Graphite, in the same way that Kibana allows you to query and visualise data in Elasticsearch.

As well as offering Elasticsearch and Kibana, the Bluemix Logmet service also provides Graphite and Grafana for storing and querying metrics. Logmet uses the same multi-tenant token and credentials for Graphite as it does for Elasticsearch.

In the next article I'll show you how you can push MQ usage metricsinto Logmet and visualise them with Grafana.

An earlier blog entry showed how to integrate MQ with the Prometheus database, capturing statistics that can then be shown in a Grafana dashboard. In this article, I'll show how that initial work has been extended to work with more databases and collection tools.

The MQ programs

All of the MQ programs discussed here are in the same github repository. Cloning that repository gives copies of all of the MQ source code, configuration scripts, and example Grafana dashboards. You will need to get a copy of an error-logging package using

The main change from the original Prometheus-only solution has been to separate common code into a Go package (mqmetric), leaving a relatively small amount of monitor-specific code to handle configuration and how to write data to the database. The monitor programs are in individual subdirectories of the cmd tree - mq_coll, mq_influx, mq_opentsdb and mq_prometheus - and can be compiled with the go build command. README files in each subdirectory show any special configuration needed.

InfluxDB and OpenTSDB

Both InfluxDB and OpenTSDB are examples of time-series databases. Entries in the databases consist of a timestamp, metric name, value, and optional tags. The MQ metrics are created with a tag of the queue manager name, and (for the queue-specific elements) the object name. Those tags enable queries to be made that show the current status for sets of resources.

Configuring the monitor program

The main programs are all intended to be started from a shell script. The script itself is also included in the source tree. Put the script in a suitable directory (for example, /usr/local/bin/mqgo) and make sure it is executable by the mqm id. Edit the script to provide parameters for the main program. Required parameters include the address of the database, any userid and password information required to connect to that database, and the list of queue names which you want to monitor. In these scripts, the password is given as a hardcoded string, but you would probably want to change that to extract it from some other hidden file.

The monitor always collects all of the available queue manager-wide metrics. It can also be configured to collect statistics for specific sets of queues. The sets of queues can be given either directly on the command line with the-ibmmq.monitoredQueues flag, or put into a separate file which is also named on the command line, with the -ibmmq.monitoredQueuesFile flag. An example is included in the startup shell script. For example,

starts the monitor to collect the statistics for all queues whose names begin APPA and APPB.

For InfluxDB, the name of the database into which the data will be written is also required. You may want to create an MQ-specific store to keep these statistics separate from others. No database name is needed for OpenTSDB; the data is written to the single store.

Configuring MQ

An MQSC script is included in the monitor's source directory, which defines a shell script as an MQ service. Edit the MQSC file to point at the directory where you have set up the script that will get run, and then apply it through runmqsc.

The monitor should start immediately, and on every subsequent queue manager restart. if there are problems, they should be reported in the output of the service program which is sent to a location defined in the MQSC script.

Grafana queries

One key visible difference between the different time series databases is how queries are constructed in Grafana, and creating the content for the legend. The principles are similar, but the query editor varies slightly depending on the datasource, and the way in which tags are exposed.

This picture shows the Grafana query editor for InfluxDB, selecting data about queues that match the "APP.*" pattern. The selected field name (mqget) is the metric as reported from MQ, and it is then given an alias of MQGET. Using the alias in the SELECT component allows us to reference $col in the ALIAS BY line. Similarly, using tag(object) in the GROUP BY line allows use of $tag_object in the ALIAS BY line. Combining these gives a legend for all of the lines shown in the graph, MQGET: APP.0 MQGET: APP.1 and so on, without needing to explicitly name each queue.

The Grafana editor for data held in OpenTSDB, extracting exactly the same information, looks like this:

Collectd

The integration with collectd follows a different pattern than the integration with the previously-discussed databases. Instead, collectd is more like a controller and router, calling plugins to ask for metrics for their components, and calling other plugins that write the metrics to a range of data stores. The data store might be something as trivial as a flat file, or it might be another database. The metric-providing plugins (MQ in this case) are unaware of how the data is forwarded and stored, analagous to a publish/subscribe system.

Configuring collectd

The collectd system needs to be told two things: how to invoke the MQ collector, and which metrics MQ provides. A short configuration file, provided in the github tree as mq.conf, can be dropped into the collectd configuration directory (typically /etc/collectd.d) for this. It causes collectd's exec interface to call the shell script - make sure the configuration file is pointing at the correct directory - and loads the list of metrics generated by MQ. The shell script also has to be modified, similarly to the previous examples, to give the list of queues to be monitored, and again to point at the correct directory. The MQ program simply writes data to stdout, where the metrics are collated and sent to whichever backend has already been configured in collectd. Once the configuration file has been updated and put in place, restart collectd so that it reads the new configuration.

On my system, I had set up collectd to route data to Graphite, another database. Grafana knows how to query information from Graphite and so adding it as a further datasource allows formatting of the MQ statistics.

Here is a similar query as shown above, this time to pull data from Graphite:

The default configuration for graphite on my collectd system adds the string "collectd" at both the front and back of the machine hostname; that seems a bit much but it does make it clear where the data is coming from. It is visible in this query string. Another difference from the other examples is that each data point with collectd contains the queue manager and queue name as part of the overall metric name instead of being tags on more generic data points. That means that there does not seem to be quite the same level of field replacement for the legend, at least when showing it in Grafana, but it is still usable. There are also constraints on where wildcards appear in the query, so this is showing information for all queues instead of just the "APP.*" names. There may be better filtering possible, but I could not find it easily in the Grafana query editor. Some of the output plugins for collectd may also support rewriting of the metric name to split out fields that can be used as tags.

There is no MQ configuration needed here; the collection program does not run as an MQ service as it is started and stopped under the control of collectd.

Summary

This picture brings together all the Grafana dashboards for the different collectors. The same metrics are shown in each dashboard.

From this, we can see how a variety of monitoring solutions can be used with MQ. This follows the philosophy that MQ has had for all its lifetime - it will not enforce a particular implementation of an aspect on you, whether that's operating system, programming language or management tool. Instead it tries to work with whatever you are already comfortable using. These open-source solutions are popular for monitoring other components of IT infrastructures, and it makes sense to integrate MQ into those same designs.

Kibana - a browser based visualisation tool for displaying and searching log data

This article assumes that you have installed your own Elasticsearch and Kibana server locally and are ready to send data to them.

We will focus on Logstash, how you can configure Logstash to store and process MQ logs (i.e. AMQERR*.LOG) in Elasticsearch, and how to use Kibana to view and search through the logs.

Sending logs to Elasticsearch

Logs are written into the Elasticsearch engine by Logstash. Most users will have multiple machines producing log records. Logstash needs to get the log records from those machines, process them to filter out unwanted data, and send them to Elasticsearch. Getting logs from each of your machines into Elasticsearch is done in one of two ways:

Sending the complete log to a central Logstash agent, where it is processed

Running a Logstash agent on each machine and processing them locally

In option 1, logs are sent unchanged to a remote Logstash agent. They are sent using something called the Logstash-forwarder, which has now been replaced by a project called Filebeat. Filebeat runs on each node where logs are produced and distributes them to a remote Logstash agent.

With option 2, logs are read and processed by a Logstash agent running on the machine where they are produced. Logstash processes the log records locally, using filters configured to manipulate the log records so they are more easily searchable, and then forwards the edited log records to Elasticsearch.

To send my MQ logs to Elasticsearch I've used option 2 but much of the configuration we'll use is applicable to option 1. A remote Logstash agent is configured in the same way as a local agent, although the inputs and outputs may differ based on how log files are received by it.

Logstash agent configuration

The Logstash processing pipeline has 3 stages: Inputs (the different ways it reads log records); Filters (sets of expressions, some of them similar in style to regex, that manipulate log records); Outputs (the different ways Logstash can output the edited logs). You can put all three into a single configuration file, or separate them out. Here I've created files 00-input.conf (containing all of my configured inputs), 20-amqerr.conf (detailing how MQ log records should be parsed), and 99-output.conf. Each of the files is put in the /etc/logstash/conf.d directory.

The input stage specifies the location of an MQ log file to read and gives it a type that can be used to identify it from other types of log in Elasticsearch. Setting start_position to "beginning" tells Logstash to read the file from the beginning instead of only processing new log entries:

Note: The Logstash agent runs under the user 'logstash' so you will need to make sure it has permission to read the MQ error logs (which are -rw-rw----). You could do this by adding the logstash user to the mqm group, but doing so would give the logstash user full administrative rights to MQ. Alternatively there is a queue manager config option (http://www-01.ibm.com/support/docview.wss?uid=swg21228976) that reverts MQ to the previous behaviour of making error logs world-readable.

The filter stage breaks down each MQ log entry read from AMQERR01.LOG into specific fields. The multiline filter is used to ensure Logstash treats MQ's multiline log entries as a single record. It looks for the line of dashes that delimit every MQ log entry and counts that as a new record. The filter is used with the mutate filter to remove the dashes and newlines from the log record. The grok filter is Elasticsearch's mechanism for parsing log files with arbitrary formats and is used twice in the filter. It breaks out the individual fields in the log entry allowing them to be indexed and searched by Elasticsearch. Firstly it extracts the date and time information from the MQ log so that the Elasticsearch timestamp is based on the time MQ wrote the log record, not the time Logstash read it. Secondly it is used to extract the remaining key fields from the log record such as the queue manager name and VRMF. Later on when you configure the Kibana dashboard you can search and filter on these fields, for example by choosing to show only queue managers at a certain level.

Thanks go to Paul Mandell for providing most of the MQ filter code. Different filters can give very different performance characteristics so when writing filters be wary of how complicated you make them.

Finally in the output stage I've added 2 outputs. Firstly I've used the Elasticsearch output to write the edited log records into Elasticsearch. Then I've added a second debug output which will write the edited records to a local file, which is useful for quickly testing changes to filters. It uses the JSON code to write the log record as JSON to the file (see the example later on in the article).

Note: If you find that records appear in /tmp/debug-filters.json but not in Elasticsearch or Kibana, check that the timestamps of your log entries are accurate and in the correct time zone. Elasticsearch is sensitive to time differences between systems.

Kibana

Now that we have MQ records being written to Elasticsearch we can use Kibana to view and search through them. By default Kibana will show you a list of records ordered by timestamp. The graph at the top of the page shows how many log entries were written at a particular time.

At the moment it looks quite messy because every log record and every field is included in the output.

We can change which fields Kibana shows by expanding one of the log records and selecting the fields to include. In my case I'm only going to show the AMQ error code, the error description, the MQ process name, and the queue manager name:

With that done the list looks a little cleaner:

Kibana also lets you search for records of a particular type. Let's pick two specific error conditions we might be interested in: MQ listeners being unable to start (AMQ9218) and sender channels missing a transmit queue (AMQ9531). By specifying the search term "errCode: AMQ9218|AMQ9531" in the search box Kibana returns only the records where the errCode matches the query.

Note also that you can select the time period to search on by clicking on the current time range and modifying it.

You can add and remove different components to the dashboard, showing various information about the log records stored in Elasticsearch. You can save dashboards once you're happy with their layout, and create various charts and graphs of your data. This article only touches on the basics but there is plenty more information on what's possible online.

The next article in this series will show you how to achieve everything above but without installing and running your own Elasticsearc/Kibana server. We will demonstrate how the Bluemix logging service Logmet can be used as a place to store and view MQ logs.

Amazon recently declared its Elastic File System (EFS) as ready for production. This enables a shared, networked file system, which (importantly) is replicated between multiple physical data centers (availability zones). On paper, this makes EFS a good candidate for running MQ in a highly available way. In this blog entry, I'll take you through our proof of concept (PoC) of running a single IBM MQ queue manager which can be automatically moved between availability zones in the case of a failure.

An EFS file system is scoped to a particular AWS region. You can create "mount targets" for VPC subnets in different availability zones within that region. Once the mount target has been created, EC2 instances in those subnets can successfully mount the file system using NFS v4. You can read more about EFS in the AWS EFS documentation.

In our PoC, we used CloudFormation to run a single EC2 instance running MQ, as part of an Auto Scaling Group of one server. This ensures that if the MQ instance is determined to be unhealthy, then AWS will destroy the instance and replace it with a new one, connected back to the same file system. You can span multiple availability zones with an Auto Scaling Group. The Auto Scaling Group has a policy applied to ensure that there are only ever 0 or 1 instances available: during an update to the CloudFormation stack, the existing instance is always terminated before starting a new one.

When the MQ EC2 instance first boots, it mounts the file system as /var/mqm, and adds a rule to /etc/fstab to ensure that it gets mounted again if the instance were re-booted. If there's already data for a queue manager in the file system, then it sets up a systemd service to run the queue manager, and creates a dependency on the mount point being available. This systemd service will also ensure that the queue manager is restarted upon re-boot.

We also used an Elastic Load Balancer (ELB) to provide a single TCP/IP endpoint for MQ client applications to connect to. In some ways, an ELB is overkill here - alternatives include using an Elastic IP address which can be re-bound to a different EC2 instance, or using Route 53 to handle it via DNS. With the ELB, we can also add a health check, to ensure that MQ is listening on port 1414, and mark the instance as unhealthy if not. In addition, we added a health check to the instance which periodically runs dspmq to check that the queue manager is running. If it is ever found to be down, then the AWS command line interface is used to mark the instance as unhealthy. Any unhealthy instances will be terminated and replaced by the Auto Scaling Group.

Reproducing our PoC

If you'd like to try this out for yourself, then you can find all the code on GitHub. The PoC requires Packer to be installed on your local laptop or workstation.

1. Download the files from the Gist
2. Run packer build packer-mq-aws.json to build an AMI in the us-west-2 (Oregon) region. If you'd like to use a different region, you can edit the JSON file, making sure to also replace the source_ami with the equivalent RHEL 7.2 AMI in your chosen region. Note that, at the time of writing, EFS is not available in all regions.
3. Create a stack using the CloudFormation template cloudformation-mq-efs.template. This can be done through the AWS web console, or via the command line if you have the AWS CLI tools installed. For example, the following command line runs the CloudFormation stack in us-west-2 (Oregon). Be sure to replace or set the variables MY_KEY (to the name of an EC2 key pair for SSH) and MY_AMI as well:

The CloudFormation template includes many resources, including a VPC network, subnets, an Internet Gateway, the Auto Scaling Group and Launch Configuration, and an IAM role to enable the EC2 instances to report their health.

If you inspect the created resources (Edit: aws ec2 describe-images --owners self), you will see an Auto Scaling Group with a single instance. You have several options to test out the fail-over:

SSH into the instance and stop/kill the MQ queue manager (with user ec2-user). This will cause the local health-checking script to invoke the AWS CLI to mark the instance as unhealthy.

Terminate the instance entirely.

Mark the instance as unhealthy, either in the web console or on the command line.

Once the instance is marked as unhealthy, the AWS Auto Scaling Group will create a new one. Note that as the instance is in an otherwise-healthy availability zone, the instance may be re-created in the same zone. If you keep trying though, eventually, AWS should randomly assign the instance to the secondary zone.

Note that if you want to connect to the queue manager using an MQ client, the supplied scripts set up a PASSWORD.SVRCONN channel, with a user of johndoe, and a password of passw0rd. It is, of course, recommended that you (at the very least) change this password, which can be found in the configure-mq-aws.sh script.

Next steps and conclusion

This is just a PoC, but so far, EFS seems to provide the right characteristics for running MQ. There is clearly more to do here, including comprehensive testing of fail-over under load, and performance testing. With this particular set up, the fail-over between zones seems to take a between one and three minutes, but that's nothing to do with EFS, and everything to do with the fact that we're creating a brand new EC2 instance when the old one fails - alternative solutions might use multi-instance queue managers, or an otherwise pre-created EC2 instance. There's also some scope for better tuning the health check grace periods, to ensure things return to "healthy" status as quickly as possible.

A fail-over time for a single-instance queue manager measuring in a small number of minutes may well be enough for many people. Either way, with EFS it's relatively easy to set up high availability across multiple availability zones without having to run your own replicated storage subsystem, which is definitely a positive thing.

I was involved in an MQ in the cloud proof of concept. ( I was impressed that we clicked the button and got 10 Linux images deployed with MQ in under 10 minutes). I asked how do I, as MQ administrator, logon to these boxes without needing a password.

Fortunately this is relatively easy to do with SSH. I was running on Redhat.

You can set up SSH to use private/public certificates, and so you do not need to have passwords.

The basic process is

Create your public and private keys using ssh-keygen command. This creates a public and private key.

Security copy the public key to the remote system into the user's ~/.ssh/authorized_keys file. ( Append it if the file exists already) In my case this was /home/paice1/.ssh/authorized_keys

From my redhat machine I typed in ssh -i ~/.ssh/id_rsa paice1@winlnxn7.hursley.ibm.com . Where

-i is the location of my private key in my case it was in ~/.ssh.id_rsa

paice1@winlnxn7.hursley.ibm.com is the userid and remote system name

You copy the file ~/.ssh/authorized_keys to all of the systems, and can logon to them without needing the password.

It was not as trivial as I thought as you have to make sure the permissions are correct. For example ~/.ssh/authorized_keys needed permissions -rwx------ . With other permissions the logon did not work. Check the ~/.ssh directory as well as the ~/.ssh/authorized_keys file.

You can put the public keys of multiple people into the ~/.ssh/authorized_keys file, so they can all logon to the userid.

If you are in a cloud environment, you may want to mount the file systems with these ~/.ssh directories, and mount it read only. If you need to change the people who can access these servers, you change the ~/.ssh/authorized_keys file.

I overheard some discussions about a customer who 'improved' their application and the performance went slower. I'll explain what happened.

The symptom was when the queue gets deep, the CPU usage up, and the throughput goes down. This was for a channel

It is generally more efficient to do multiple puts or gets of persistent messages before doing the commit. For example 10 * put followed by commit, has one I/O requests. With 10 * (put, commit) has 10 I/O requests.
This can impact the application getting the messages as there is logic like - get first message - cant - as it is within unit of work, get second message ..... The application has to process many messages before it finds one it can process - or finds there are none available.

The 'improvement' to the application was to put 1000 messages before a commit. This processed many more messages a second, but it had a major impact on the channel. Throughput went down, and CPU cost went up.

Another example was a transaction that would get a message, and check the message content. If it was transferring more than 1 gazillion dollars then ask the end user for confirmation. When the end user confirmed it, the transaction committed the get.

The problem was the lunch break. The end users went to lunch from 1200 to 1300. This meant by 1259 there were perhaps 30 or more messages in syncpoint, and so a get had to check and skip over these messages. So every get-first had to skip 30 messages before getting one it could process.

By 1259 the CPU usage was high, at 1301 after the messages were replied to, the CPU usage dropped down.

So the lessons are

Having multiple persistent messages in syncpoint before a commit can improve throughput of the thread - but impact the getting applications. 50 messages under 10KB per commit is a good number.