SAP NETWEAVER MDM PERFORMANCE ANALYSIS ON RHEL 6.3

Background

One of our customers implements a set of business processes that span over two following SAP technical systems:

SAP NetWeaver BPM Enterprise Portal

SAP NetWeaver MDM server

Beginning of 2014 it was created a new initiative which eventually became a new project for us aimed basically at:

Add and improve some features this end-to-end solution is offering to its user community

Move the infrastructure to a newer and more efficient hardware in terms of power consumption and space optimization compliant with the global High Density Data Centers strategy

Continue improve the solution stability and support according to the best practices recommended by SAP in this case as software vendor

The plan was to move each existing environment (i.e. sanpit, development, consolidation, etc.) supporting this solution according to following general infrastructure approach:

Non-productive systems will be moved to virtualized servers

Productive systems will be moved to physical blade servers

On both cases, virtual and physical, a new OS layer based on Red Hat Enterprise Linux 6.3 (RHEL 6.3) will be also installed on new servers along the move. The solution was initially running on Red Hat Enterprise Linux 5.5 (RHEL 5.5) platforms.

To accommodate to that new OS release and be compliant with SAP requirements published in Product Availability Matrix (PAM), such moves require as well upgrading underlying Oracle databases. This altogether will bring us up to the right position to move a step further in the plan and technically upgrade SAP products to SAP NetWeaver 7.31 and SAP NetWeaver MDM 7.1 including latest support packages.

However, theory is theory and real world is real world. If you have been dealing with IT for some years you should know what this means.

Upon completion of development environment moves and upgrades there was a noticeable loss of performance identified by the development team. Yet there was some kind of uncertainty at that time. It was not however until the upgrade of the test environment when performance issue was clearly identified once team begun testing both current and new developments.

It was at this point when some initial actions were performed at the infrastructure in order to try to improve performance. Among others:

Review and improve I/O speed

Increase hardware resources

Review OS parameters

Review Oracle parameters

Increase Oracle buffers

Activate Oracle HugePages according to recommendation in note 1672954

Reorganize most accessed Oracle tables and indexes

There was a gain of performance after all above actions. Nonetheless overall upgraded system response time was not yet acceptable and as good as expected by the team.

So we got SAP support on board to help us on this issue.

In parallel, based on the development team’s daily experience, the MDM server side was identified as a potential root cause of the issue. This is definitively helped us as SAP Basis experts to narrow down the issue a bit.

They then developed a Java performance test that simulated how the SAP BPM portal called up different methods using the SAP MDM API to connect and extract information out of the SAP MDM server. It was designed to be executed completely independently of the SAP NetWeaver Java AS taking in this way we could take SAP BPM portal fully out of the picture. It also measured and showed the response time each of most representative MDM API calls took.

This performance test was of great help for us as we will see next. Additionally after several performance test executions and comparisons with production, i.e. our reference of what needs to be considered as acceptable, it was confirmed that the issue should come from the SAP MDM server side.

Upon this assumption there were several other actions aimed at narrow down the real root cause at little bit more:

On one hand we discarded the virtualization layer as the guilty since same behavior occurred on non-virtualized servers (physical servers)

On the other, we discarded also the Oracle and MDM 7.1 upgrades since after several performance test executions on old RHEL 5.5 servers they all ended up with an acceptable response time

Therefore, the immediate outcome out of above points was clearly telling us where the real root cause came from. It was not the virtualization layer, the SAP MDM patch level nor the Oracle release but the issue must then be on the different RHEL releases.

SYSSTAT and SYSDIG to the rescue

We now knew something at OS level happened. But the one billion question was… What could it be?

There are so many other tools available in Linux but we picked those two basically because:

SYSSTAT offers a very wide list of different OS system statistics in terms of CPU, I/O, memory, paging and networking

SYSDIG is a tool that combines strace, tcpdump and lsof in just one tool with a very powerful user friendly filter capability and amazingly easy to be extended with the so called chisels that are nothing else that Lua scripts on top that allows you to analyze the SYSDIG event stream

Additionally both tools offer a very easy way to run them in background while the process to be debugged is executing and then analyze the trace data later on offline.

The idea was then simply to observe how OS behaves while a performance test was executing on new RHEL 6.3 server. So we ran a test execution some day between 17:35 CET to 17:42 CET while gathering tracing information using those tools above.

We will show next how helpful those system tools could be even for SAP administrators.

We started looking at the overall performance KPI gathered by sadc daemon in terms of CPU usage, disk I/O, memory, paging and networking throughout.

Out of them the only relevant one that showed up an indication of something potentially wrong was the overall CPU usage as shown in figure 1. At a very first glance it didn’t really show any overall CPU shortage since usage was steadily idling at around 90% of the overall computer power.

Figure 1

Just as confirmation that there was no such a CPU shortage, we depicted the rate of the run queue length in figure 2. Taking into account that the Linux server had 16 Intel Xeon cores we could take a breath and finally assumed our server was properly sized in terms of CPU (we would even dare to say it was oversized indeed).

Figure 2

However a quite noticeable fact out of figure 1 was the relatively high percentage of CPU usage in kernel mode (%system) during performance test execution. Zooming in a bit this fact can be even notice clearer (figure 3). System CPU usage measures the portion of CPU time OS processes and threads spend on kernel mode. This could be mainly due to interrupt handling or system calls. Additionally in figure 3 we could also checked whether there were any other CPU usage components we could not perceive in figure 1 due to the graph scale. For instance the portion of CPU time in user mode (%user) was quite low. There was also no CPU load in terms of interrupt handling either hard or soft (%irq and %soft).

Figure 3

Based on these inputs, we focused next on analyzing that apparently high percentage of kernel mode CPU consumption. So we went a step deeper and depicted the percentage of system CPU usage split by physical core (figure 4). Recall the server had 16 Intel Xeon cores.

Figure 4

Figure 4 showed some of the cores kept relatively busy executing in kernel mode during the performance test execution. As we pointed out before we could not observe any remarkable CPU load in figure 3 used for handling interrupts which made us to suspect on system calls as a potential cause.

Another question we made ourselves was whether that CPU consumption was really caused by the SAP MDM server and not some other process or processes. We may think at first glance after looking at figures 3 and 4 that the SAP MDM server (which at OS level could be identified by process name mds-r) was behind since we can clearly see the edges when the performance test started (~17:35) and ended (~17:42). However, to demonstrate process mds-r was really the one we used the SYSDIG trace together with the set of chisels provided by this tool.

So we started running the chisel topprocs_cpu which displays the top processes by CPU time during the time window the trace spans:

Note that this chisel is including the CPU time consumed by each OS process and all its child threads or tasks.

We could immediately see our assumption was kind of right. CPU usage in kernel mode came from the SAP MDM server. But there was also an unexpected visitor, OS process sh i.e. the standard OS shell, which was the second most CPU demanding process. Both together took almost 60% of CPU scheduling time during performance test execution.

Question now… was this related somehow to the very bad performance response time of SAP MDM server on RHEL 6.3? To answer that question we asked again to SYSDIG trace.

First of all, since SAP MDM server (mds-r) is a multithreaded application, we figured out the PID and TID (thread IDs) of the mds-r processes during MDM test execution:

The SAP MDM server threads were spending a quite nice time on cloning themselves for creating a new child process out of them (system call clone). But what was the purpose behind? Let’s see what mds-r was doing while calling system call clone.

Firstly the timing matched the start of the performance test execution (17:35) which meant the SAP MDM server started doing system calls as soon as it got work to do.

The three first highlighted events above #1611796, #1611917 and #1611930 showed how a SAP MDM thread with TID=9203 (one of the child task from root MDM process with PID=9135) was cloning itself and as a result we got a new process/task with TID=24208. Apart from this, same clone call pattern was repeating all over the whole performance test execution even though we only showed the very first events. If you don’t believe us see figure 5 that shows the rate of new tasks created by the OS.

Figure 5

All repeating clone system calls had for sure an impact on the high % of system CPU usage.

We took next the new cloned process with TID=24208 as a sample to try to see what happened once it got cloned and ready to run. We got first the raw time of event #1611930 in SYSDIG trace:

Note that event #1612036 has been highlighted above. This is because it explained pretty well now what the SAP MDM server was really doing behind the scene:

SAP MDM server thread with TID=9203 made a system call to clone to create a new child task with TID=24208

Child task with TID=24208 was then making another system call to execve to execute following OS command:

sh -c “/usr/sbin/uuidd -t >/dev/null 2>/dev/null“

Once again, this clone & execve call pattern was repeating over and over again during performance test execution which contributed to the number of newly created tasks shown in figure 5 and had a direct impact on the high % of system CPU usage.

If you are not convinced yet processes mds-r and sh were related we can go back to SYSDIG trace again and try to see whether there were events captured by SYSDIG and generated by process name sh which had SAP MDM server root process with PID = 9135 as any of its process ancestors. The answer was yes:

Therefore we could conclude at this point the SAP MDM server eventually ended up creating an OS shell at some point to run uuidd commands to return a unique id. The chain of system calls it generated together with the fact this was a repeating step all along the performance test execution had a direct impact on the high % of system CPU usage observed at figures 1, 3 and 4 above.

This was in fact a very important finding. Binary /usr/sbin/uuidd was provided by RPM package uuidd.x86_64. Comparing RHEL 6.3 (where SAP MDM server behaved not as expected) and RHEL 5.5, this RPM package was only installed on RHEL 6.3 while it did not exist on RHEL 5.5. This led us to obviously think the present of such a RPM package eventually made the SAP MDM server to misbehave somehow. Unfortunately the reason of such misbehavior could not be found in the OS traces and it is something to be asked to SAP as software vendor. Nonetheless after uninstalling RPM package uuidd.x86_64 deployed on RHEL 6.3 server and ran the performance test execution once again we could prove our finding was fundamentally right which in turn made managers and team members very happy as well since we could go live with new infrastructure without jeopardizing the project deadlines.