Pythian - Data Experts Blog » Timur Akhmadeevhttp://www.pythian.com/blog
Official Pythian Blog - Love Your DataSat, 01 Aug 2015 00:28:28 +0000en-UShourly1High System Time on Windows Running in a VMhttp://www.pythian.com/blog/high-system-time-on-windows-running-in-a-vm/
http://www.pythian.com/blog/high-system-time-on-windows-running-in-a-vm/#commentsWed, 22 Oct 2014 13:59:57 +0000http://www.pythian.com/blog/?p=68696Recently I’ve seen an issue with CPU usage on a server running Windows 2003 Server in a VMware. This is a small Virtual Machine with just 2 cores allocated (which are possibly mapped to “threads” on a host level but I don’t know the details). For some reason very high System CPU time was reported...

]]>Recently I’ve seen an issue with CPU usage on a server running Windows 2003 Server in a VMware. This is a small Virtual Machine with just 2 cores allocated (which are possibly mapped to “threads” on a host level but I don’t know the details). For some reason very high System CPU time was reported in a Statspack report.
Here is how it looks like in a 1 hour Statspack report:

Note that the System CPU time is more than twice the User CPU time on average (remember the averages could be misleading sometimes). This caught my attention as usual. Although the average CPU used is not really high, this server is somewhat sluggish even for a one hop RDP connection over the VPN.
I have tried to find out some details about what is going on. Since I’m not a Windows guy, I did not know what kind of tools could be used to track places in the OS kernel that take too much time. On Linux this is relatively easy starting with strace/pstack/perf utilities and other command line tools. Windows is different.

I’ve started to search for the options available, and the first thing to find is of course Perfmon, which allows to track and visualize different OS related metrics (counters in Perfmon terminology) on a system, CPU, or process levels. I’ve used it to capture a few key metrics such as User Time, System Time (which is apparently called Privileged Time on Windows), Queue length and Context Switches per second. From a graph of the CPU usage the issue is visible:
Here the white line is representing Privileged (or System) CPU, and yellow line is Total CPU. It’s clear that almost all used CPU is accounted to the Privileged part.
By the way it is actually very easy to see a similar picture in a standard Performance tab of Task Manager, you just need to select View then Show Kernel Times and Privileged part of the used CPU will be displayed in red.

After that I have searched for details of where to find why Privileged CPU time is so high. A good article that I have found is here. Although it is relatively old, it fits my case as the OS is a 32 bit Windows 2003 Server. The article points to a tool called KernRates. This is a command line tool with a very easy interface: you run it, wait for some time and stop it with Ctrl-C. After that the tool prints the profile of system calls by module. Here is what I’ve seen:

The default output contains some basic information about the system, CPU usage, memory and context switching. The kernel modules profile is the most interesting part here. It lists some modules with internal names and the profile data: number of times the module was running during a sample; this is the most important information. So in mycase intelppm was the top running kernel module.
I’ve searched again, now for intelppm, and found a few posts describing similar symptoms. Apparently intelppm is a CPU driver. Sometimes it causes issues such as BSOD or high CPU usage, especially if it is a cloned VM and CPU architecture changes in between. It was not clear if this something which can be disabled, but there were posts suggesting that stopping this service (which is not listed in Services) helped a few people. So I have recommended the client to try to disable this driver with the following commands:

sc config intelppm start=disabled
sc stop intelppm

Theoretically this should disable Intel CPU driver and Windows should try to use another if it is available. When we tried to run it, the 2nd command (to stop the driver) failed with the following message:

[SC] ControlService FAILED 1052:
The requested control is not valid for this service.

So it is not possible to stop the driver online, and Windows restart is necessary.
We did a restart of the VM. After that, the situation was a bit different: the CPU time was somewhat reduced; but the privileged part was still quite high with hal (Hardware Abstraction Layer) on top instead of intelppm:

It looks like the change helped, but there is no sign that it helped on the OS level. This makes me think that such an improvement in performance may be attributed to something else, such as OS, hypervisor or combination of them and Oracle. In any case, high system time is not good and it usually indicates that something is wrong.

]]>http://www.pythian.com/blog/high-system-time-on-windows-running-in-a-vm/feed/4A Simple Way to Monitor Java in Linuxhttp://www.pythian.com/blog/a-simple-way-to-monitor-java-in-linux/
http://www.pythian.com/blog/a-simple-way-to-monitor-java-in-linux/#commentsThu, 03 Apr 2014 13:49:38 +0000http://www.pythian.com/blog/?p=64275A quick and easy way to know what is it inside Java process that is using your CPU. Using just Linux command line tools and JDK supplied command line utilities. Introduction Here are a few things you need to know before starting. Following the links is not necessary, they are available for the reference. there...

]]>A quick and easy way to know what is it inside Java process that is using your CPU. Using just Linux command line tools and JDK supplied command line utilities.

Introduction

Here are a few things you need to know before starting. Following the links is not necessary, they are available for the reference.

there are different vendors of Java Virtual Machine. This post is about Oracle’s JVM which is called HotSpot. Linux x86-64 is considered as OS platform. Most of the things about HotSpot are applicable to other vendors too but with slight changes. OSes other than Linux may add some more complications

it’s called Virtual Machine, because it virtualizes runtime environment for a Java application. So to know where to look at you need to know a few things about how specific VM is organized. For a detailed overview of the HotSpot, please refer to this article

on Linux, a thread inside HotSpot VM is mapped directly to an OS level thread. Well, it may not be true on all OSes, but for modern Linux kernels this is true. So every thread on the OS level is a thread inside a Java application

there are generally two types of threads inside a HotSpot VM: native and application threads. Application threads are those that run some Java code, and usually this is what applications are using to run their code. Native threads run something which is not written in Java, usually it’s code in C/C++ and usually all these threads are special utility threads started by a VM itself.

Identifying Threads

Since a Java program may start many threads, each executing some program code, it is necessary to understand which threads are using CPUs. On Linux, top -H will show you CPU usage on a per-thread basis. Here is an example. First, a process which consumes CPU:

So there is 1 execution thread inside a Java process, which is constantly on top and is utilizing around 85% of a single core.

Now the next thing to know is: what is this thread doing. To answer that question we need to know two things: thread stacks from a Java process and a way to map OS level thread to a Java thread. As I mentioned previously, there is one to one mapping between OS and Java level threads in HotSpot running on Linux.

To map OS level thread to a Java thread in a thread dump, we need to convert native thread ID from Linux to base 16, and search for “nid=$ID” in the stack trace. In our case thread ID is 7601 which is 0x1db1, and the Java thread had following stack trace at the time of running jstack:

As you can see here, the thread is executing a main method of RandomUser class – at least at the time of taking a thread dump. If you would like to see how this changes over time, then a simple watch command may help to get an idea if this thread stack changes frequently or not:

If a native thread ID has been supplied as $2, then searches for the thread stack of this thread in the thread dump

Concatenates each thread stack trace into a comma-separated string

Aggregates strings and sorts them by the number of occurrences

Prettifies the output: removes tabs, commas, and adds new lines back to the thread stack

Conclusion

With a few little things it is possible to understand quite a lot of things in almost any situation related to Java: you can find out the most frequent stack trace by sampling thread dumps.
With this knowledge it is then easy to understand why an issue happening. In my test case, the application instantly generates random numbers without a pause, and 1 thread is occupying 1 CPU core.

]]>http://www.pythian.com/blog/a-simple-way-to-monitor-java-in-linux/feed/2Statistics gathering and SQL Tuning Advisorhttp://www.pythian.com/blog/statistics-gathering-and-sql-tuning-advisor/
http://www.pythian.com/blog/statistics-gathering-and-sql-tuning-advisor/#commentsThu, 02 Jan 2014 13:50:51 +0000http://www.pythian.com/blog/?p=62719Our monitoring software found a long running job on one of our client’s databases. The job was an Oracle’s auto task running statistics gathering for more than 3 hours. I was curious to know why it took so long and used a query to ASH to find out the most common SQL during the job...

]]>Our monitoring software found a long running job on one of our client’s databases. The job was an Oracle’s auto task running statistics gathering for more than 3 hours. I was curious to know why it took so long and used a query to ASH to find out the most common SQL during the job run based on the module name. Results were surprising to me: top SQL was coming with SQL Tuning Advisor comment.
Here is the SQL I used:

So most queries are coming with “SQL Analyze” comment right in the beginning of SQL which is running from DBMS_STATS call, which is confusing. After some bug search I have found a MOS Doc ID 1480132.1 which includes a PL/SQL stack trace from a DBMS_STATS procedure call, and it is going up to DBMS_SQLTUNE_INTERNAL, which means DBMS_STATS has a call to the SQL Tuning package; very odd:

Turns out, this call has nothing to do with the SQL Tuning. It is just a call to a procedure in this package, which happen to look like an SQL from SQL Tuning Advisor. I have traced a GATHER_TABLE_STATS call in a test database first with SQL trace and then with DBMS_HPROF and here is how the call tree looks like:

Which is basically approximate NDV calculation. So, nothing to be afraid of, it’s just the way the code is organized: DBMS_STATS uses API of SQL Tuning framework when you are using DBMS_STATS.AUTO_SAMPLE_SIZE as the ESTIMATE_PERCENT (which is the default & recommended value in 11g+).

]]>http://www.pythian.com/blog/statistics-gathering-and-sql-tuning-advisor/feed/2Do you have enough Redo?http://www.pythian.com/blog/do-you-have-enough-redo/
http://www.pythian.com/blog/do-you-have-enough-redo/#commentsMon, 23 Dec 2013 17:48:49 +0000http://www.pythian.com/blog/?p=62541The question of whether a database has enough redo logs available is quite common. The documentation suggests to use FAST_START_MTTR_TARGET and V$INSTANCE_RECOVERY.OPTIMAL_LOGFILE_SIZE to identify “the optimal” redo log size based on the target recovery time. I’ve never used it and can’t comment if it is a correct/reasonable way or not. The easiest way to identify...

]]>The question of whether a database has enough redo logs available is quite common. The documentation suggests to use FAST_START_MTTR_TARGET and V$INSTANCE_RECOVERY.OPTIMAL_LOGFILE_SIZE to identify “the optimal” redo log size based on the target recovery time. I’ve never used it and can’t comment if it is a correct/reasonable way or not. The easiest way to identify if you need to increase the redo log files size is to check for ‘log file switch (checkpoint incomplete)’ waits, and, depending on the total wait time, decide how much more redo you need. The obvious source of such information is Statspack or AWR. I have written a query which pulls the necessary information and calculates the amount of redo you need to add based on the total wait time.
Here is the query and an example output from 10.2.0.4 database:

This query is large and can print not only redo-related statistics, you just need to uncomment then necessary columns. The columns which are relevant to this post are “CKPT waits, s” and “Redo To Add, MB”. First is total time wasted due to ‘log file switch (checkpoint incomplete)’ waits, and the second is the projected additional amount of redo which might have been enough to reduce the waits down to minimum. When calculating this projected size, the average redo rate per hour is used which is then multiplied by the total time waited. Of course this is an incorrect approximation: if you had 10 sessions waiting for the checkpoint to complete, then the total time waited by sessions is obviously more than wall clock time, which means that multiplying it by an average redo rate is totally wrong; but it is almost correct with a database that at any point in time has only 1 session waiting for checkpoint to complete. Nevertheless I think this approximation is a good starting point.
In this particular case, the database had 4 groups of redo of 60MB each, and on December 15th it has been changed to 4 groups of 150MB each. As you can see it helped to reduce the number of ‘checkpoint incomplete’ waits significantly, although not completely. If you look at the redo rate per hour, it is clear that after the redo log size change the rate of database changes has also increased, so the waits are somewhat consequence of the ability to generate more changes per second now.

And just for the sake of completeness here is a version of this query for AWR. I didn’t test it much but it should be fine as most things are similar to Statspack tables. Enjoy!

]]>http://www.pythian.com/blog/do-you-have-enough-redo/feed/0Batched Table Accesshttp://www.pythian.com/blog/batched-table-access/
http://www.pythian.com/blog/batched-table-access/#commentsFri, 29 Nov 2013 19:33:28 +0000http://www.pythian.com/blog/?p=61281When I first saw the suffix BATCHED in an Oracle execution plan from 12c, I was curious to see what is hidden behind and how it works. I had some spare time for testing and wanted to share my findings with you here. Intro Here is what I’m talking about: Line 1 of this very...

]]>When I first saw the suffix BATCHED in an Oracle execution plan from 12c, I was curious to see what is hidden behind and how it works. I had some spare time for testing and wanted to share my findings with you here.

Line 1 of this very simple execution plan shows how Oracle 12c added a suffix BATCHED to the table access by a B*Tree index rowsource. I was thinking about the reasons behind this change and how it could be implemented before starting my tests.

Why Oracle would want to “batch” table access
Usually large index range/full/skip scans with subsequent table access running serially cause lots of single block reads of a table. Depending on the clustering of the data, the number of table block reads could be as high as the number of ROWIDs fetched from index leaf blocks up to the next rowsource. In case of a serial execution plan it means that query performance depends on how fast single random table read is. Say you need to read 1000 random table blocks located far away from each other and average read of 1 block takes 5ms, then you need about 5 seconds to execute such query. But if the storage subsystem can handle concurrent IO requests well enough, and you were able to ask it for 1000 blocks someway concurrently or in parallel, transparent for the end user session, then it could take less wall clock time for a user while putting more pressure on the OS, storage and connectivity to the storage.

How Oracle optimizes IO already
As far as I know, Oracle can and does a few cunning things with IO even in pre-12.1 releases. Here is a(n incomplete, most likely) list with example optimizations you may see:

Within NESTED LOOP joins there are couple of strategies Oracle uses: NL-join batching and moving TABLE ACCESS out of a join (I’ve no idea how it is called exactly).

“Prefetching” with ‘db file parallel read’ – as described by Tanel Poder here (it gives you a very nice idea of what ‘db file parallel read’s are)

In case of a “cold” buffer cache Oracle may choose to read ahead, and instead of reading just a single block when you think it is enough, Oracle may opt to reading multiple physically adjacent on disk blocks to the cache (aka ‘db file scattered read’). Sometimes it could hurt (a lot) the application performance, sometimes it doesn’t matter, but the thing is: it’s a “normal” thing to experience multi-block buffered reads on what should probably be single block reads.

Test Case

Based on my understanding of what Oracle can possibly do I have created a test scenario which could be used to find out more things behind BATCHED table access. Here is the setup:

Very easy. I have created a sufficiently wide table holding 1 million rows with two integer columns following a very bad (T1.X) and very good (T1.Y) clustering of data. Usually it is also important where are you creating this table. Initially I created it in a standard USERS tablespace (i.e., ASSM, non-uniform extent size), but then switched to a MSSM tablespace with uniform extents of 1MB. Looking ahead, it does not make a difference to the test results (at least I could not identify it.)

So the test is also really easy, except for some diagnostic & preparation steps. Basically I’m tracing two statements, which are accessing T1 by two indexes respectively, both at OS and Oracle levels, and then parse Oracle trace file with OraSRP. You may want to use tkprof. I also used it initially but OraSRP has one feature which helps to see the waits with breakdown by object, like this:

Testing

I was using VirtualBox with 64-bit OEL 6.4 and Oracle 11.2.0.4 & 12.1.0.1. I also did (partial) tests on 11.2.0.3 running in OVM on a faster storage, and the results were similar to what I’ve observed with 11.2.0.4.
Both instances were running with a pfile, with following parameters specified:

Test Process & Observations

Initially I started testing with a default database config, and filesystemio_options set to DIRECTIO. After some random tests, I realized that this cache warm up thing is not what I’m interested in right now and turned it off with a hidden parameter. Overall I think that the test results could be explained in the following:

Test results are inconsistent. This is the most irritating thing. However, after I ran the test multiple times in a row, I get a pretty stable outcome. So I consider the results after multiple consecutive runs of the same test. Usually it is just 2 runs, but sometimes more, especially after an instance restart. I’ve no understanding why it happens and what’s behind the scene of the decisions. Maybe it has something to do with CKPT as Tanel mentions in his post on oracle-l, but I did not check (and honestly don’t want to :))

Both 11g and 12c show that for a table access of scattered data (by T1_X_INDX index) Oracle may batch table access IO using db file parallel reads; on the OS level it is using io_submit/io_getevents calls to run IO with async API if it’s turned on of course; in case of just DIRECTIO in place it uses a bunch of single block reads using pread

Both 11g and 12c can use multi-block access of clustered data (by T1_Y_INDX index) for index range scans (and most likely, full/skip scans too). This is one of the most amusing things: even though I turned off cache warm up, Oracle still can identify that the data I am accessing is well placed altogether, and it decides to read multiple adjacent table blocks at once. 12c, However, behaves differently and by default does not use buffered multi-block table reads

The size of multi-block IO (db file parallel read) is different between 11g and 12c: in 11g it is usually 39, sometimes 19. With 12c, by default the number of requests depends on the client’s fetch size: it is equal to the minimum of fetch size and 127

Looks like the parameter _db_file_noncontig_mblock_read_count does not control the actual number of blocks read with db file parallel read; any value greater than 1 turns this feature on and the size of read requests stays the same (I have tested only setting it to 1, 2, 3, 5)

The word BATCHED appeared in execution plans of 12c is controlled with a new hidden parameter _optimizer_batch_table_access_by_rowid. By default the parameter is set to TRUE, so plans tend to include BATCHED in table access rowsource. In the run-time this setting acts very much similar to 11g behavior, so it reads scattered table data with db file parallel reads, except for the number of IO requests which is min(fetch_size, 127). If _optimizer_batch_table_access_by_rowid is set to FALSE on a session level, for example, then the plans generated by Oracle do not include BATCHED suffix in table access rowsource, but in run-time Oracle still uses multi-block IO in the same way as 11g does, i.e. 39 or 19 IO requests per one call and scattered reads of clustered table data are there as well!

Summary

In 12c Oracle changed some internal code path which deals with the batched table access. But important thing is that the batched table access is not new, so even if you disable it either explicitly with _optimizer_batch_table_access_by_rowid or implicitly with optimizer_features_enable, Oracle will still be able to utilize a similar approach as it was in 11g.

One important thing, of course, is that by default the size of vector IO now depends on the client fetch size. And I can imagine a situation in which this change could make an impact on the application performance after an upgrade.
I have uploaded test script & trace files from 11.2.0.4 and 12.1.0.1 here so if you would like to repeat my tests and compare results – feel free to do that.