IT Technology; Oracle, linux, TCP/IP and other stuff I find interesting

This post is about memory management on the operating system level of an Oracle database. The first question that might pop in your head is: isn’t this a solved problem? The answer is: yes, if you use Oracle’s AMM (Automatic Memory Management) feature, which let’s you set a limit for the Oracle datababase’s two main memory area’s: SGA and PGA. But in my opinion any serious, real life, usage of an Oracle database on Linux will be (severely) constrained in performance because of the lack of huge pages with AMM, and I personally witnessed very strange behaviour and process deaths with the AMM feature and high demand for memory.

This means that I strongly advise customers to use Oracle’s ASMM (Automatic Shared Memory Management) feature. In the newer versions of 11.2 I found this to be working very well. Earlier versions like 10.2 could suffer from an ever growing shared pool (which also means an ever shrinking buffer cache), especially when bind variables weren’t used. This still could happen, but it seems the SGA memory management feature in 11.2 handles this well in most cases. The ASMM feature means a fixed memory area is allocated for the SGA. SGA allocation has always been fixed outside of the AMM feature, as far as I know.

When ASMM doesn’t work, meaning the memory areas are getting sized wrong and performance is influenced by that, the last option is to size the memory area’s yourself. However, since version 11.2.0.2 Oracle will resize when the memory manager thinks it’s feasible. See Kurt van Meerbeek’s article about that.

That leaves the PGA (Process Global Area) as a memory area on itself. Most databases are using the automatic PGA memory management, which is enabled once the PGA_AGGREGATE_TARGET parameter is set to a non zero value. A common misunderstanding is this setting is actually limiting the overall PGA usage of an instance. The truth is automatic PGA memory management will make attempts to adhere to the PGA_AGGREGATE_TARGET value. These are the actual words in the official Oracle documentation: ‘attempts to adhere’!

This means sort memory, hash memory and bitmap memory will be actively limited in size per process by automatic PGA memory management, any attempt to allocate more than automatic PGA memory management allows will result in moving some contents of these memory areas to the assigned temporary tablespace of the database user, to make room for new data.

However, there are more memory area’s allocatable per process, which are never swapped to disk, thus always will stay in memory, and these could not be limited in an officially supported way prior to Oracle version 12. Two structures which are allocated in PGA and never swapped to disk are PL/SQL collections and PL/SQL tables. Creating and filling these requires the usage of PL/SQL (hence their names); the reason for mentioning this is that if your database is not used by PL/SQL but only SQL, you almost certainly will not run into the problem I describe below.

You might be thinking: wait a minute! Does this mean a developer can just create such a structure, and allocate whatever he/she likes, with all the consequences that it can have, like the operating system starting to swap, and can do that for every single process? Yes, this is what this means. This is why Oracle introduced a parameter called PGA_AGGREGATE_LIMIT with Oracle 12, to effectively limit the overall PGA heap size.

In case you wonder what this means, or even doubting my words, I have written a little program to demonstrate this behaviour.

The begin sizes of the UGA (vbu) and PGA (vbp) are 1’103’192 and 3’535’368. The PGA_AGGREGATE_TARGET size is set to 524’288’000 (500MB). I did set the grow_until variable to 700’000’000 (roughly 700MB), which is more than PGA_AGGREGATE_TARGET. After running this, it’s easy to spot the values of vcu (UGA allocation) and vcp (PGA allocation). vcu grew to 4’755’704 during the run, however vcp grew to 700’051’976, a little more than 700MB! This shows that the collection is stored in the PGA, and that the collection grew beyond the value set with PGA_AGGREGATE_TARGET.

This behaviour is consistent in versions 12.1.0.1, 11.2.0.4, 11.2.0.3, 11.2.0.2 and 11.2.0.1.

Let me emphasise once again that the above proof of concept code managed to allocate more memory than was set for the overall PGA usage of the entire instance. This can have an enormous, devastating impact on a consolidated database setup (meaning having multiple instances running on a single machine). Typically, once memory consumption of all the processes exceeds physically available memory, the operating system tries to use the swap device, to which it will swap memory pages in and out depending on memory usage of active (=on CPU) processes. Mild swapping shows as severely slowed-down processing (because a number of memory pages for processing need to be read from the swap device and placed in memory, from which the former contents need to be written to the swap device), heavy swapping shows as the machine coming down to a standstill.

Please mind that a diagnosis on the state of memory usage (alias swapping), just by looking at the amount of used swap (as can be seen in the ‘top’ output, or ‘swapon -s’) could be misleading. It’s also important to look at actual swapping in and out, as can be seen with ‘vmstat 1′ (si/so columns) or swap -W. I’ve found several systems which had been running for some time (approximately longer than a month) that had swap usage, sometimes up to 40%, while no ‘active swapping’, so memory pages being transfered to and from the swap device, was happening.

Luckily, starting with Oracle 12 you can actually limit overall PGA usage using the parameter PGA_AGGREGATE_LIMIT. The default value is the greater of (list from Oracle documentation):
a) 2GB
b) 200% of PGA_AGGREGATE_TARGET parameter (or lower if 200% > (90% of physical memory – total SGA size) but not below 100%)
c) 3MB * PROCESSES parameter
The parameter can not set below it’s default value, except when set in a pfile or spfile.

Let’s set the PGA_AGGREGATE_LIMIT to 600MB and see what happens when we start doing a large allocation again:

Okay, let’s run the pga_filler.sql script again, and try to allocate 900MB. This means the “grow_until” variable must be set to 900000000.
PLEASE MIND this is done as a regular user, the SYS user and background processes other than job queue processes are not subject to the limiting.

Okay, essentially, this tells us nothing interesting, except for the tracefile. Let’s look in/u01/app/oracle/diag/rdbms/v12102/v12102/trace/v12102_ora_4147.trc, being the tracefile as indicated in the above alert.log snippet:

Actually, this is the end of the tracefile. It seems that the pga limit dump (the text in between “Process may have gone over pga_aggregate_limit” to the private memory summary heap dumps) occurs several times before an actual ORA-4036 is triggered. In my private test instance, where I am obviously the only user process doing something, I get a pga limit dump approximately 20 times before the ORA-4036 is actually triggered:

Remember the PGA_AGGREGATE_LIMIT was set to 600M. Now look at the process’ PGA/Private heap summary dump above: it says 781M. Please mind the 781M is the PGA heap of a SINGLE process! When looking at the total PGA allocated for the entire instance, it’s even more:

So…despite PGA_AGGREGATE_LIMIT set to 600M, according to the v$pgastat view, there’s 1041MB allocated for PGA. Please mind I haven’t looked into how accurate v$pgastat is, but I tend to believe this.

Summary.
I’ve seen PGA_AGGREGATE_TARGET being used as a calculation value for actual PGA usage of an instance. This is simply wrong. The actual amount of PGA memory allocated by the instance is highly depended on what is done, and can be less than PGA_AGGREGATE_TARGET, or more. Automatic PGA can control three per process memory area’s: the sort, hash and bitmap memory area’s. These are sized based on the setting of PGA_AGGREGATE_TARGET and the actual PGA memory usage instance wide. If more memory is needed for sort, hash or bitmap memory than is made available by the memory manager, excess memory needed is allocated in the temporary tablespace. Any other PGA memory allocation is always done, regardless of the setting of PGA_AGGREGATE_TARGET.

Starting with Oracle 12, it seems the actual PGA allocation now can actually be limited with the new parameter PGA_AGGREGATE_LIMIT. However, during some simple testing it shows that actually more memory is allocated than set with PGA_AGGREGATE_LIMIT as limit. I haven’t tested it in more situations, this post is meant to grow awareness that the actual limit as set by PGA_AGGREGATE_LIMIT might not be that hard as you would expect.

Please mind, PGA_AGGREGATE_LIMIT seems to truly limit PGA usage instance wide, not limit the PGA heap per process, as event 10251 (PGA usage limiting way for Oracle 11.2) does. However, once again: PGA_AGGREGATE_LIMIT seems to try to be smart and actually does not limit at the exact size set, but beyond that.

The next post will introduce a way to limit PGA usage in Oracle 11.2. Stay tuned!

Share this:

Like this:

I posted a fair amount of stuff on how Oracle is generating IOs, and especially large IOs, meaning more than one Oracle block, so > 8KB. This is typically what is happening when the Oracle database is executing a row source which does a full segment scan. Let’s start off with a quiz: what you think Oracle is the maximum IO size the Oracle engine is capable of requesting of the Operating System (so the IO size as can be seen at the SCI (system call interface) layer? If you made up your answer, remember it, and read on!

The real intention of this blogpost is to describe what is going on in the Oracle database kernel, but also what is being done in the Linux kernel. Being a performance specialised Oracle DBA means you have to understand what the operating system does. I often see that it’s of the utmost importance to understand how an IO ends up as a request at the NAS or SAN head, so you understand what a storage admin is talking about.

Many people (including myself in the past) would state that the maximum IO size on Linux is 1MB. For the Linux 2.6 kernel and higher this statement is incorrect because there is no such thing as a single maximum IO size on Linux. There used to be one in the Linux 2.4 era, which was set with the maxphys parameter, but that time is long gone. In order to find out what it is now, let’s test and see!

First let’s get a Linux system and a big table!
The system I got is a VMWare Fusion VM, running Linux 3.8.13-44.1.5.el6uek.x86_64 (UEK3) on Oracle Linux 6u6. On top of that I am using the Oracle database and grid infrastructure version 12.1.0.2. This system is using udev for providing disk access (as opposed to asmlib)
The redundancy mode of ASM is external, although for reading (what I will be covering) this doesn’t matter.

Now let’s get a normal database session, and use a combination of sql_trace with waits (10046/8) and strace to see how the Oracle database interfaces with the kernel. Please mind I’ve prepared a heap table with no indexes on it, so a count(*) on it always will result in a full table scan. Also, the buffercache is sized small enough (or the table is created large enough, it depends on how you look at it) to have the session make the decision to do a direct path read, instead of a buffered read. If you don’t know what that means: please search this blog on direct path reads, or even better, download my presentation ‘about multiblock reads’.

The direct path read decision is visible via the ‘direct path read’ wait event. If you get a full table scan operation and see ‘db file scattered read’ waits, you are doing a buffered scan.

In the most cases, you will get a maximum value of 1MB if possible, which seems to support the generally assumed 1MB maximum operating system induced IO size. Why? Well, because you probably set the DB_FILE_MULTIBLOCK_READ_COUNT parameter to 128, which means you have explicitly set the Oracle process not to do IO with a size more than 1MB (8192*128). In this blogpost, I explain that Oracle can request IOs bigger than 1MB.

In the blogpost series on extra huge database IOs, I show that Oracle can do huge (1MB+) IOs, but the physical request size (what Oracle actually requests at the SCI layer, visible with the pread/pwrite/io_submit/io_getevents functions) still is 1MB. This limit is imposed by the physical storage structure which the database uses with ASM, called allocation unit (often called ‘AU’). The default size of an allocation unit is 1MB. The allocation unit can be seen in both the database and the ASM instance with the following query:

How about doing an unbuffered read on a filesystem? I’ve created a database on an (XFS, but this doesn’t matter AFAIK) filesystem, and tried to set the maximum value to DB_FILE_MULTIBLOCK_READ_COUNT. I’ve done this by setting DB_FILE_MULTIBLOCK_READ_COUNT to 10000 (ten thousand), and then bounce the database to see what the number has become. In my case, the value became 4096. I think this is the limit for Oracle 12.1.0.2 on Linux x86_64, but love to hear if you have gotten different results:

I set 10000:

SYS@fv12102 AS SYSDBA> select name, value from v$spparameter where name like 'db_file_multiblock%';
NAME VALUE
-------------------------------------------------- ----------------------------------------------------------------------
db_file_multiblock_read_count 10000

But Oracle limits this to 4096:

SYS@fv12102 AS SYSDBA> select name, value from v$parameter where name like 'db_file_multiblock%';
NAME VALUE
-------------------------------------------------- ----------------------------------------------------------------------
db_file_multiblock_read_count 4096

Okay. Let’s start our investigation at that point: a database which is set up with a DB_FILE_MULTIBLOCK_READ_COUNT set to 4096, alias 32MB (with a block size of 8KB), and a table which got extents large enough to accommodate huge (32MB) IOs.

Okay, we are setup and ready to go, but there is one additional thing: the way direct path reads work, they would probably give little waits with fast IO capabilities. One way to get the waits back, is to limit the IO capabilities of the process. Doing so is documented in this article.

Now issue the full table scan on a large table in sqlplus while strace is attached:

SQL> select count(*) from bigtab;

Now take a peek at the strace output!
The output first shows IOs as we expect:

What is visible here, is first the reap of an I/O request (with asynchronous IO on Linux this is typically the io_getevents() call). If you take a close look at the arguments of the io_getevents() call (taken from the manpage of io_getevents):

struct io_event {
__u64 data; /* the data field from the iocb */
__u64 obj; /* what iocb this event came from */
__s64 res; /* result code for this event */
__s64 res2; /* secondary result */
};

The above description is taken from the annotated Linux kernel source, as available here: http://lxr.free-electrons.com/source/include/uapi/linux/aio_abi.h#L58 I use this site for navigating the Linux kernel source. What is important, is that the third field (io_event.res) contains the size of the IO request. Having learned this, now look again in the io_getevents call. The size of the IO reaped above is 33554432, which is 33554432/1024/1024=32 MB. Yes, that’s a single IO of 32MB! Also, this is consistent with the wait line a little lower:

So, I wonder what you thought was possible, the correct answer on my operating system (Linux x86_64) with Oracle 12.1.0.2 is 32MB. It turned out the big IOs in the ASM case were limited by the allocation unit size of 1MB.

The next thing I’ve wondered is how this matches with the maximum IO size of the disk devices as visible by the Operating System. You can request 32MB, but a normal SCSI disk doesn’t do 32MB IOs. Of course in my case the SCSI disk really is a VMWare virtual disk device.

Let’s keep the 32MB IO in mind, now dive from the top layer, the SCI (system call interface) where an IO enters the kernel to the bottom of the kernel from an IO perspective, to the block device. The block device settings are found in /sys/block/DEVICE/queue. The maximum IO size the device is capable of is found in max_hw_sectors_kb. This is in kilobytes, and read only (can’t change hardware, right?). In my case this is:

[root@bigmachine queue]# cat max_hw_sectors_kb
4096

My disk supports a maximum of 4M for an IO size! But this is not what is used, the actual setting is in max_sectors_kb:

[root@bigmachine queue]# cat max_sectors_kb
512

That’s half a megabyte!

So…we got (up to) 32MB sized IO requests coming in, and a device that is set to 512KB IOs. This means that somewhere between the SCI and the device, there is a mechanism to scatter the request size to the device’s maximum IO size, and once the IO requests are done, going back to gather the IO results to the original request.

There are a couple of layers in the Linux kernel through which the call travels (including common functions):

(note: there seems to be consensus the above mentioned layers exist, although there is different wording and different numbers by different sources. Also, there doesn’t seem to be a very clear description of what is done by which layer, and what typically defines a kernel layer. For some functions it is clear they belong to a certain layer (for example aio_run_iocb in Linux/fs/aio.c, bulk_finish_plug in Linux/block/blk-core.c, etc.), for some layers, like the SCI layer, it seems there isn’t a clear layer definition by looking at where the function is defined. Also please mind the SCSI layer is implemented as a driver, just like the actual device driver for the hardware. This is very understandable, but makes it a bit harder to see it in a layered way)

System Call Interface (SCI)
The request enters kernel space via the SCI. The function of the SCI is to elevate a process to system priority to perform a kernel mode task, like (but not limited to) doing I/O. The system call implementation on Linux makes use of a wrapper function in glibc, which executes the system call on behalf of the user systemcall request. The reason for mentioning this, is that sometimes the glibc wrapper “hides” the real system call, for example calling the semtimedop() function:

Above is a gdb (GNU debugger) session which attaches to an Oracle background process, which I know is sleeping in the system call semtimedop() when idle, A breakpoint is set on the semtimedop function, and the execution of the attached process is resumed. It then breaks on the function, showing the source code at which the break happened. Instead of showing the actual semtimedop function, it shows the pseudo function in glibc which wraps this system call. This hides the arguments of calling the semtimedop() function. My current workaround is to read the kernel registers which “carry” the arguments (RDI, RSI, RDX, RCX, R8, R9 for the first 6 arguments in most cases).

Virtual File System (VFS)
The next layer is virtual filesystem. Here we see functions specific to asynchronous IO or synchronous IO, and doing direct IO or not, and also actual filesystem specific functions (in my case xfs, when ext4 is used, you will see specific functions for that. I highly recommend XFS!). This layer also uses a structure called ‘request_queue’, which keeps track of the actual IO requests for a block device, of which each individual request is a struct ‘request’, which contains one or more structs called ‘bio’ which contains a description of the request, which points to structure called ‘bio_vec’, which points to pages for storing the disk request contents. This is all setup and created in kernel memory by the user process in system mode. It’s my assumption that the properties of the disk device (=maximum advertised IO size) are taken into account when the VFS filesystem implementation creates requests and all necessary structs and memory area’s. Please mind it’s important that enough memory is available to setup the necessary structures, and enough CPU to make this happen. Also some of the crucial structures for doing IO (request, bio, bio_vec) seem to be setup in this layer. An IO can’t be done without a memory area for the IO request to hold the data for sending it to the device (alias a write), or a memory area for the IO request to hold the data which is fetched from the device (alias a read).

The funny thing is that when you use ASM (the simple version 11.2 ASM with a local ASM instance and local disk devices), you will still see some functions of the VFS layer, because you use a disk device which is opened using the local filesystem. Examples of these functions are: aio_run_iocb, do_aio_read.

Block Layer
The next layer is the block layer. Here the request queue is handled, and I/O scheduling is done. Oracle advises the deadline scheduler in all cases. The scheduler works by plugging a request queue, much like a plug in your bathtub, letting the requests enter the queue. Having multiple requests in a queue means it can be optimised by reordering the requests, and merging adjacent requests up to the device’s advertised maximum IO size. Once a request’s timeout expires, or the requesting process finishes submitting IO, the queue is unplugged.

SCSI layer
The SCSI layer is responsible for communicating with SCSI devices to do IOs.

Device driver
The device driver layer is the layer that truly physically communicates with a device, and implements the device specific communication. In my case the functions start with mpt, which is the driver for LSI PCI adapters.

To see how the flow of IO going through the block layer, there is a tool called blktrace. Actually this is a mini-suite of tools consisting of blktrace (tracing the IO requests through the block layer), blkparse (parsing the output of blktrace to make it human readable), btrace (script to combine blktrace and blkparse, and btt (a blktrace output post processing tool)), among others.

In order to use blktrace, the debug file system of the Linux kernel needs to be mounted. Here is how that is done:

I use blktrace in this article for looking at the IO requests to understand what is going on. The workflow for this use of blktrace is:
– create a trace file of the block flow using blktrace
– make the trace file human readable via blkparse or analyse via btt (block trace times)

Actually, you can parse the output of blktrace directly via blkparse using ‘blktrace -d DEVICE – | blkparse -i -‘. To make that even simpler, the script ‘btrace’ is created, to do exactly that.

Here’s how that looks like (depending on the number of processes using it, the output can be huge, this is only a snippet):

What is shown here, is the typical flow of an IO in the block layer:
Q – Queue. A request starts off sending a notification on the intent to queue at the given location.
G – Get request. A struct request is allocated.
P – Plug. When the block device queue is empty, the queue is plugged in order to receive further IOs and have the ability to optimise (merge and/or reorder) them before the data is sent to the device.
I – Insert. A request is sent to the IO scheduler for addition to the internal queue and later service by the driver. The request is fully allocated at this time.
U – Unplug. The start of sending requests to the driver.
D – Driver. A request has been sent to the driver and removed from the queue.
C – Complete. A previously issued request to the driver has been completed.

The main point is here, that you can truly see how the IO requests flow through the block layer and are issued to the storage device, in other words, you can see how the block layer receives the IOs, and what is exactly submitted to the driver as request for the physical storage layer.

This is a microscopic view of the disk IOs. In most cases, when you want to gain information on block layer IO processing, another view on it is provided by processing blktrace output with btt. This is an example output of btt:

In this example I captured IOs for 60 seconds. You can exclude ‘-w 60′, and press interrupt (ctrl-c) when you deem IO recording is enough. This produces a binary file ‘sdb.blkparse’, which can be used btt:

This is the first part, the flow through the block layer until IO completion:

Note: time is in milli seconds.
Q2Q – Time between IO requests.
Q2G – Time it takes for a request struct to be allocated.
G2I – Time it takes for the request to be inserted in the device’s queue.
I2D – Time spend in the device queue waiting to be issued to the driver.
D2C – Time spend between issuing to the driver and completion of the request. This includes controller, storage. This is the same figure as the ‘svctm’ column with iostat -x.
Q2C – Total time spend in block layer and physical IO. This is the same figure as the ‘await’ column with iostat -x.

This is partly the same as the IO flow table above. This is expressed as a percentage of where the total time of the IO is spend.
Q2G – Request struct allocation.
G2I – Insertion in the device queue.
Q2M – Total time until merge.
I2D – Time spend in the queue until it was dispatched to the driver.
D2C – Time spend on doing the IO after submitting the request to the driver.

Share this:

Like this:

This is a quick post on using git on a server. I use my Synology NAS as a fileserver, but also as a git repository server. The default git package for Synology enables git usage on the command line, which means via ssh, or via web-DAV. Both require a logon to do anything with the repository. That is not very handy if you want to clone and pull from the repository in an automated way. Of course there are ways around that (basically setting up password-less authentication, probably via certificates), but I wanted simple, read-only access without authentication. If you installed git on a linux or unix server you get the binaries, but no daemon, which means you can only use ssh if you want to use that server for central git repositories.

Running git via inetd
What I did is using inetd daemon to launch the git daemon. On any linux or unix server with the inetd daemon, and on Synology too, because it uses linux under the covers, it’s easy to setup git as a server.

What you should look for in your setup is:
– gituser: this is the user which is used to run the daemon. I created a user ‘gituser’ for this.
– /usr/bin/git: of course your git binary should be at that fully specified path, otherwise inetd can’t find it.
– git daemon:
— –inetd: notify the git executable that it is running under inetd
— –export-all: all git repositories underneath the base directory will be available
— –base-path: this makes the git root directory be set to this directory. In my case, I wanted to have all the repositories in the home directory of the gituser, which is /volume1/homes/gituser in my case.

And make the inetd deamon reload it’s configuration with kill -HUP:

# killall -HUP inetd

Please mind this is a simple and limited setup, if you want to set it up in a way with more granular security, you should look into gitolite for example.

Share this:

Like this:

This post looks like I am jumping on the bandwagon of IT orchestration like a lot of people are doing. Maybe I should say ‘except for (die hard) Oracle DBA’s’. Or maybe not, it up to you to decide.

Most people who are interested in IT in general will have noticed IT orchestration has gotten attention, especially in the form of Puppet and/or Chef. I _think_ IT orchestration has gotten important with the rise of “web scale” (scaling up and down applications by adding virtual machines to horizontal scale resource intensive tasks), in order to provision/configure the newly added machines without manual intervention, and people start picking it up now to use it for more tasks than provisioning of virtual machines for web applications.

I am surprised by that. I am not surprised that people want boring tasks like making settings in configuration files and restarting daemons, installing software with all the correct options, etc. being automated. Instead, I am surprised that people are now picking this up after it has been around for so long.

A little history.
As far as I know, IT orchestration started with cfengine, which was really a configuration engine (hence the name). Despite having a little different purpose (configuration vs. orchestration), this tool is the parent of all the orchestration/configuration tools which exist nowaday. cfengine started off as a study in 1993 by Mark Burgess at the university of Oslo, with the creation of the cfengine software as a result. As far as I can see, it has been available as open source software since the beginning.

Now back to what I am surprised at: with cfengine, there has been a way to configure linux/unix systems in a structured way, and install and configure software on systems since the mid-nineties. Yet, this wasn’t picked up (of course with a few exceptions). Fast forward to today, we see it is being adopted. And that is a good thing.

I created a setup with cfengine for a client a long time ago, which had the ability to install the Oracle software, different PSU’s in different home’s, and remove it by adding or removing machines to groups in a cfengine configuration file. It wasn’t picked up by the client, it’s way more fun running X to install the software, and make the choices by hand, and redo this over and over on every machine, right?

I almost forgotten about my work with cfengine, until I spoke with Alex Gorbatchev at a conference, at which he pointed me to Ansible. At first I didn’t do a lot with it, but lately I’ve given it a go, and I am very happy with it.

Another redo of cfengine?
From what I read, most of the configuration/orchestration engines created after cfengine are created to circumvent all kinds of difficulties with cfengine. I can understand that. It took me a while to learn cfengine, and indeed it forces you to think in a different way.

The Ansible project decided to radically do it different than all the other engines. It is different in the sense that it advertises itself as simple, agentless and powerful.

Simple.
Simple is a terrific goal. For those of you that have worked with configuration/orchestration engines, there is a steep learning curve. It is just hard to get the basic principles in your head. To be honest, also Ansible took me a while too, to grasp the basic principles, and get the picture correctly in my head. Yet, having worked with cfengine comparing it with Ansible’s playbooks, which are the scripts to do things on the targets, it is a breath of fresh air. Playbooks are so clean they (almost) can be read and understood as plain english.

Agentless.
This is where Ansible is truly different than any of the other configuration/orchestration tools. Ansible does not require any agent installation on the targets. The obvious next question then is: how can this work? Well, quite simple: Ansible uses ssh to connect to the host, and executes commands via the shell. Having that said, it requires a little more detail; Ansible uses python on the remote host for it’s normal execution. However, you can use it without python, for example to setup the host up for the Ansible normal usage mode Which requires python and the simple-json module.

This is truly important, and makes it an excellent fit for my daily work as an IT consultant.

Powerful.
Ansible is powerful in the way that you can do the configuration and orchestration tasks in a simple clean way.

Summary on the introduction.
Above was a brief personal history, and some of the “marketed” features of Ansible. I think being agentless is the true “killer feature” here. All the other configuration/orchestration engines require you to setup and configure a fixed client-server connection, and install a deamon and a central server process. In case you wondered, yes, authentication is important, and it’s simply brilliant that the ssh password authentication or public key infrastructure can be used.

Because there’s no daemon to install, you can run your created play books everywhere. So instead of a fixed client configuration, you can create play books to do routine tasks, and repeat it at multiple sites.

Okay, how does this work?

Installation: add EPEL and install ansible.
If you are on one of the clones of RedHat Enterprise Linux (I use Oracle Linux), you simply need to add the EPEL repository to your yum source list, and run:

# yum install ansible

First steps.
One of the first things I do, is create a directory for a typical ansible ‘project’. Project means a set of tasks you want to do to a set of hosts here. Next, I create a file called ‘hosts’ which is the list of hosts you want to use for executing tasks on. By default, Ansible looks in /etc/ansible/hosts. In this case, I put a single machine in it (a test VM), but it can be a list of machines (ip addresses or hostnames).

$ cat hosts
192.168.101.2

In fact, you can create groups in the hosts file in the “ini style”. But I just put one host in for this example.
The next thing is to check if Ansible reads the file correctly. This is done in the following way:

$ ansible all -i hosts --list-hosts
192.168.101.2

Okay, this means Ansible will operate on this one host if invoked. The next logical thing (typically done when you are in a new client environment to check if you can reach the hosts):

Ping might be a bit misleading for some people. What ping does here (-m means module), is trying to connect to the host over ssh, and log in. Because I didn’t specify a user, it used the username of the current user on the machine, which is ‘ansible’. A user ‘ansible’ typically doesn’t exist on a normal server (and is not necessary or should be created), and also not on my test server. So it failed, as the message said, on authentication.

The authentication failed again. And it should! What this is doing, is trying to log on as root, and we haven’t given any password, nor have I put my local user’s public key in the remote authorised_keys file. So there is no way this could work. This is typically also the state when you want to do stuff with a “fresh” client system. Let’s add the ‘-k’ option (ask ssh password), and run again:

To walk you through the output: It now asks for a password, which I’ve filled out, then lists the host and the state: success. During this execution, there was nothing changed on the remote host, and the ping command resulted in a pong (alike the ICMP ping response).

Does this look familiar for you Exadata DBA’s? Yes, this replicates some of the functionality of dcli (although dcli is aimed at executing simple tasks to a group of hosts, whilst Ansible is aimed at enterprise configuration and orchestration).

One step beyond! Playbooks.
Now let’s progress to playbooks. An Ansible playbook is where the true strength lies of Ansible. It allows you to specify tasks to execute on the remote hosts, and create sequences of tasks and make decisions based on the outcome of a tasks for further execution. Let me show you a simple playbook, and guide you through it:

As you can see, this is a playbook with three tasks: upgrade all packages, install libselinux-python and adding my (local) public key to the authorised key file of root (to allow passwordless access).

Line 1 shows three dashes, which means the start of a YAML document.
Line 2 starts with a single dash, which indicates a list. There is one dash at this indention level, so it’s a list of one. The fields of this member are hosts, gather_facts and tasks. Tasks got his own list (mind the indention level, that is important). The fields are key/value pairs, with the separation indicated by the colon (:). The first field is ‘hosts’, with the value ‘all’. This means that all hosts in the hosts file are used for this playbook. I don’t think it’s hard to imagine how useful it can be to specify a group/kind of servers the playbook can run on. The next one is ‘gather_facts’. A normal playbook execution first gathers a lot of information from all the hosts it is going to run on before execution. These can be used during playbook execution. Next ‘remote_user’. This indicates with which user ansible is going to logon, so we don’t have to specify it on the command line. Then we see ‘tasks’ to indicate the list of tasks to be executed on the hosts.

It’s easy to spot we got three tasks. What is extremely important, is the indention of this list (it’s got a dash, so it’s a list!). Name is not mandatory, but it makes it easy to read if you give the tasks useful names and these will be shown when the playbook is executed. The first task has the name ‘upgrade all packages’. The next field shows the key is ‘yum’ indicating it is making use of the yum module. This key got two values: name=*, which means all ‘all packages’, and state=latest, which means we want all packages to be at the latest version. This means this command is the equivalent of ‘yum update’.

The second task is called ‘install python-selinux’. It makes use of the yum module again, and is self explanatory, it installs the libselinux-python package. This packages is necessary to work on a host which has selinux enabled on things that are protected by selinux.

The next task is called ‘add public key to authorised_key file of root’. It is making use of the authorized_key module. This module requires a parameter ‘key’, for which we use the lookup function to look up the local (!) public key, of the user with which I execute ansible, which is ‘ansible’. ‘state=present’ means we want this key to be present; ‘present’ is the default value, so it wasn’t necessary to put this in. Next ‘user=root': we want the public key to be added to the authorized_keys file of the user root.

Of course these tasks could be executed using the ‘ansible’ executable as single tasks. To show the importance of the installation of the libselinux-python module on a host with selinux enabled (which is the state of selinux on a fresh installed Oracle Linux machine), let’s execute the task using the authorized_key module:

Please mind I didn’t specify ‘-k’ on the command line to send a password: in the previous step we added our key, so we can access our host using our public key. Another extremely important thing is ‘changed’. ‘changed’ indicates if the task did actually change something on the destination server.

I have ran single task until now, I changed the state of my test VM back to it’s state before I started changing it with ansible (by removing the libselinux package using ‘ansible all -i hosts -k -u root -m yum -a “name=libselinux-python state=absent”‘

Now at this point you might think: I get it, but these are all pretty simple tasks, it’s not special at all. Well, let me show you an actual thing which totally shows what the importance of using this is, even on a single machine, but even more when you got a large group of servers you have to administer.

The next example is a playbook created to apply PSU3 to an Oracle 11.2.0.4 home. It’s still quite simple, it just applies PSU3 to the Oracle home. But totally automatic. The point I am trying to make is that this is already nice to have automated a lot of work for a single home, but it saves a lot of hours (read: a lot of money), and saves you from human error.

Let me run you through this playbook! It starts off with the indication of a YAML document: ‘—‘. Next hosts: all again. I just put all the hosts in the hosts file, I did not create all kinds of groups of hosts (which would be fitting when you use it at a fixed environment, but I use it for various customers). Then vars, with a list of variables. As you can see, I can use the variables, which are shown in the playbook as {{ variable }}. Then remote_user: oracle and tasks.

The first and second task use variables, and use the argument ‘register’ to save all response into a named variable. I also use ‘failed_when’ to make the playbook stop executing when the argument after ‘failed_when’ is true. Arguments of ‘failed_when’ is the named variable, for which the output of the standard out is used (.stdout). Then a filter is used to cast the output to integer, and is compared with a calculation of the variable.

The third task is using the files module to create a directory. The fourth task is using the copy module. The copy module means a file or files (in this case) are copied from the machine from which the playbook is run, onto the destination host or hosts. Here is also another trick used, to process the task with a list of items. As you can see, the copy line contains a variable {{ items }}, and the task is executed for all the items in the list ‘with_items’. I found this is fine for smaller files (up to a few hundred of megabytes), but too slow for bigger files. I use http (the get_url module) to speed up file transfer.

The fifth and sixth tasks execute a shell command, unzip, to extract the contents of a zip file into a specific place.

The seventh task is executing a small list of shell commands, in order to be able to run the conflict detection option of opatch. The same trick as with the first two tasks is used, register a name for the output of the conflict detection. Here I check if the stdout contains what I would manually check for when I would run it. The eighth task is the main task of the whole playbook: the actual patch. However, it uses the same technique as task seven. The last task simply removes a directory, in order to remove the files we used for this patch.

Summary
I hope this shows what a tremendous help Ansible can be for a consultant. This kind of tool is simply mandatory if you got an environment with more than approximately ten to twenty servers to administer. Ansible can be used even if the organisation does not want to spend time on the implementation of a configuration tool.

Like this:

(the details are investigated and specific to Oracle’s database implementation on Linux x86_64)

Exadata IO: This event is not used with Exadata storage, ‘cell single block physical read’ is used instead.
Parameters:
p1: file#
p2: block#
p3: blocks

Despite p3 listing the number of blocks, I haven’t seen a db file sequential read event that read more than one block ever. Of course this could change in a newer release.

Implementation:
One of the important things to realise here is that regardless of asynchronous IO settings (disk_asynch_io, filesystemio_options), Oracle always uses a pread() systemcall, so synchronous IO for reading blocks which are covered with this event. If you realise what the purpose of fetching the single block is in most cases: fetching a single database block which contents are necessary in order to continue processing, it should become apparent that issuing a synchronous IO call makes sense. This is also the reason the V$IOSTAT* view lists both SMALL_READ_REQS, SMALL_SYNC_READ_REQS and SMALL_READ_SERVICETIME, SMALL_SYNC_READ_LATENCY, to make a distinction between SYNC (pread()) reads and non-sync (thus asynchronous) calls, using the io_submit()-io_getevents() call combination.

IO done under the event ‘db file sequential read’ means a single block is read into the buffer cache in the SGA via the system call pread(). Regardless of physical IO speed, this wait always is recorded, in other words: there is a strict relation between the event and the physical IO. Just to be complete: if a block needed is already in the Oracle database buffer cache, no wait event is triggered and the block is read. This is called a logical IO. When the wait event ‘db file sequential read’ is shown, both a physical and a logical IO are executed.

This event means a block is not found in the database buffer cache. It does not mean the block is really read from a physical disk. If DIO (direct IO) is not used (filesystemio_options is set to ‘none’ or ‘async’ when using a filesystem, ASM (alias “Oracle managed raw devices”) is inherently direct path IO, except when the ASM “disks” are on a filesystem (when ASM is used with NFS (!), then filesystemio_options is obeyed)), the block could very well be coming from the filesystem cache of linux. In fact, without DIO a phenomenon known as ‘double buffering’ takes place, which means the IO doesn’t happen to it’s visible disk devices directly, but it needs to take a mandatory step in between, done at the kernel level, which means the data is put in the filesystem cache of linux too. It should be obvious that this extra work comes at the cost of extra CPU cycles being used, and is in almost any case unnecessary.

If you take a step back you should realise this event should take place for a limited amount of blocks during execution. Because of the inherent single block IO nature of db file sequential read, every physical read (when it needs to read from a physical disk device) takes the IO latency penalty. Even with solid state disk devices, which have an inherently lower latency time because there are no rotating parts and disk heads to be moved, chopping up an operation in tiny parts when a full table scan or fast full index scan could be done means a lot of CPU time is used whilst it could be done more efficient.

The time spend on ‘db file sequential read’ quite accurately times single block IO. This means a direct relationship between ‘db file sequential read’ timings and amount should exist with operating system measured IO statistics (iostat, sar and more).

Share this:

Like this:

Last week I’ve gotten a question on how storage indexes (SI) behave when the table for which the SI is holding data is changed. Based on logical reasoning, it can be two things: the SI is invalidated because the data it’s holding is changed, or the SI is updated to reflect the change. Think about this for yourself, and pick a choice. I would love to hear if you did choose the correct one.

First let’s do a step back and lay some groundwork first. The tests done in this blogpost are done on an actual Exadata (V2 hardware), with Oracle version 11.2.0.4.6 (meaning bundle patch 6). The Exadata “cellos” (Cell O/S) version is 11.2.3.3.1.140529.1 on both the compute nodes and the storage nodes.

A storage index is a memory structure used by the cell daemon, which is the storage server process on the storage layer of Exadata. By default (although I’ve never seen it different yet) a SI contains minimum and maximum values for up to eight columns of the table it is describing. The memory structure is transient. It describes a region of one megabyte of a table. A storage index is populated during the fetching of data for a smart scan, based on the filter predicates of the query causing the smart scan. In essence the whole SI management is done automatically by the cell daemon (“essence” means you can play around with some undocumented settings on the database level, and there are some undocumented settings and events you can set on the cell server level).

Okay, back to the the original question. Let’s test this. First we need a table large enough to be smartscanned. I’ve got a table called ‘BIGTAB_NOHCC’ (as you probably guessed, there is a table ‘BIGTAB’ in the schema I am using, which is hybrid columnar compressed). This table consists of 2671736 blocks of 8KB, which means it got a total size of 21G. This is big enough for my database instance (big enough is relative to the buffer cache size) to get a smart scan.

For this test, I use the session statistics. In v$sesstat there is a statistic called ‘cell physical IO bytes saved by storage index’, which tells us how many bytes we saved from being scanned because of the use of storage indexes. I also show some output of a (single) storage server, although a normal Exadata rack typically will have 3, 7 or 14 storage servers in use.

First of all, in order to get an idea and be pretty sure my query will populate storage indexes, I stop and start the cell daemon on the storage servers. I use ‘service celld restart’ for that.

Next, in order to get storage index information for the table I use, I need some metadata. The metadata I need are:
a) data object id

Next, I run the query on my BIGTAB_NOHCC table. Please mind it’s fundamental to have a filter predicate, and that the execution plan shows we are doing a full table scan. Only with a filter predicate the storage server has a reason to build a storage index:

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old 4: and name like nvl('%&name%',name)
new 4: and name like nvl('%%storage%%',name)
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index 0
MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906259;
COUNT(*)
-----------
16
MARTIN:dbm011> @mystats
Enter value for name: %storage%
old 4: and name like nvl('%&name%',name)
new 4: and name like nvl('%%storage%%',name)
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index 0

This is why it’s handy to have the aforementioned data on the table: you can dump the specific storage indexes for a single table. The three numbers after ‘all’ are data_object_id, tablespace number and the ksqdngunid from the x$ksqdn table.

Here you see the typical heading of an Oracle trace file, next the announcement of the dump (“Dumping storage idx summary”,”Dumping RIDX summary”). The real storage index information starts with “RIDX(0x7f27b4ddab64)”. Starting with that and until the “lo” and “hi” values, you are looking at an actual storage index which holds data for a single column. You can see which column by looking at the “Col id” in the square brackets: 1. It’s interesting to note that there is a scn (system change number) included. The storage index shows if there are NULL values in the column (in this case it says HASNONNULLVALUES, so we don’t have any NULL values in the 1MB chunk in the column this storage index describes), and, of course, the low and high values in the Oracle internal data format.

So, despite any indication on the database layer, the query built storage indexes! That should mean that executing the same query again will result in actually using the storage indexes which were just build:

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906259;
COUNT(*)
-----------
16
MARTIN:dbm011> @mystats
Enter value for name: %storage%
old 4: and name like nvl('%&name%',name)
new 4: and name like nvl('%%storage%%',name)
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index 5746368512

Yes! This using the storage index we caused the be built up previously!

With this being established, we can try to get an answer to our question: how does the storage index react to DML on the table it is describing?

We build the storage index, and used it. Now let’s update the ‘id’ field for which the storage index was build, and redo our query test:

Okay, now let’s redo the select again, and take stats before and after on storage index usage!

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old 4: and name like nvl('%&name%',name)
new 4: and name like nvl('%%storage%%',name)
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index 5746368512
MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906260;
COUNT(*)
-----------
16
MARTIN:dbm011> @mystats
Enter value for name: %storage%
old 4: and name like nvl('%&name%',name)
new 4: and name like nvl('%%storage%%',name)
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index 5746368512

What we see here with the storage index session statistic, is the update statement didn’t use the storage index (which is obvious, we updated the data, which happened on the database layer, so we didn’t use a smartscan) because this number is the same as the last time we looked at it before the update statement.
When we executed the select query on the table with a filter on the id column again, there is no storage index usage, because the storage index session statistic didn’t increase.

Actually, this query built new storage indexes. When this query is executed again, we can use these:

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old 4: and name like nvl('%&name%',name)
new 4: and name like nvl('%%storage%%',name)
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index 5746368512
MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906260;
COUNT(*)
-----------
16
MARTIN:dbm011> @mystats
Enter value for name: %storage%
old 4: and name like nvl('%&name%',name)
new 4: and name like nvl('%%storage%%',name)
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index 14534426624

(actually, it sometimes can take more than one execution for the storage indexes to be created again, my guess would be some heuristics are used to try to come up with the best candidates for storage indexes)

Let’s try another case: update a column in the table for which storage indexes are created on another column.
First make sure storage indexes are build:

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old 4: and name like nvl('%&name%',name)
new 4: and name like nvl('%%storage%%',name)
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index 0
MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;
COUNT(*)
----------
16
MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old 4: and name like nvl('%&name%',name)
new 4: and name like nvl('%%storage%%',name)
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index 8790155264

For completeness sake, I verified the storage indexes for this table by using the dump shown previously. A simple grep on the dump shows this query created storage indexes for only column 1, the id column:

And query the storage index use, and do our query with filter predicate again:

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old 4: and name like nvl('%&name%',name)
new 4: and name like nvl('%%storage%%',name)
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index 8790155264
MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;
COUNT(*)
----------
16
MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old 4: and name like nvl('%&name%',name)
new 4: and name like nvl('%%storage%%',name)
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index 8790155264

So…this strongly suggests the update invalidated our storage index, even when the column was not in the storage index.
Just for completeness sake, let’s issue the select statement again to see if the storage index was build up again:

Yes, there it is again, the previous statement didn’t benefit from the storage indexes, but build them, and now we take advantage of it again.

Conclusion.
I ran these tests on a quarter rack without any other usage. The method I used for measuring how storage indexes behave was to execute on the database layer, and see the result on the storage layer and in database layer session statistics. It’s important to realise that despite this being quite strong evidence, there isn’t definite proof on my conclusion. Also, the behaviour described can change in future storage server versions.

My conclusion is that the storage server uses a system change number to validate if the storage indexes are valid. If not, storage indexes are not used. If blocks contained in a storage index progresses it’s system change number (in other words: are updated), the storage index is invalidated right away (during my tests, when the update starts, no or few storage indexes remain).

So, the ones that picked storage indexes being invalidated were right. As far as I could detect, there is no updating of storage indexes.

Share this:

Like this:

There’s been some debate about how to get the parameters from a spfile. A spfile is a binary version of the parameter file of the Oracle database.

I added to the debate that my experience is that there are is some weirdness with using the strings command on the spfile. The discussion was on twitter, I didn’t add that doing that it most of the time meant it costed more time than I saved from using the “shortcut” of using strings on a spfile.

Let me show you what it means.

I’ve got a database with storage on ASM. Among other options, there are two simple methods to get the spfile from ASM:

You can get the spfile by logging on to the database, and create a pfile from the spfile, and create a spfile again:

This is exactly what I meant: for some reason, the line with ‘control_files’ seems to have been broken into two pieces. I think I don’t need to tell most readers of this blog that tracing this kind of oddities costs a lot of time, especially if you’ve got a big spfile. And: most of the time you are playing around with this, there probably is something wrong, and you simply don’t have the time for this fooling around.

But what why is this happening? Let’s look at the spfile contents using the ‘od’ utility (octal dump):

This is the beginning of the spfile, to get an idea what we are looking at.
The numbers on the left side (00000000, 00000016, etc) are the position numbers in decimal. This shows there are 16 characters per line. The numbers on the line of the position are the ASCII values. The character representation of the ASCII value is BENEATH it. If you now read the line (looking at the character representation, you see ‘v11204.__db_cach (new line) e_size=692060160 (new line)’ and then ASCII value 10, which is represented with ‘nl': newline.

Okay, now we are used to reading this output, now let’s look at the problem section with the control_files line:

If we look closely, you can see ‘*.control_fi’ on the first line, after the ASCII value 10, newline.
If we read on, it’s shows:
*.control_files=’+DATA/v11204/controlfile/current.25 (nl) 6.847475315′ (nl)
In other words, there is an additional newline. But the position (488) seems strange to me.

But when looking at how I generated this, I executed ‘strings’. This means the output is filtered to readable characters. Would there be unreadable characters in a spfile? Let’s look!

I think this is quite self explanatory to a lot of people. If not, let me help you: all is as we expect up to position 1018, at which there are a few non-readable characters, until position 1043. This means there are 25 character positions which contain something else, after which the parameter file contents continue.

But now look at the position: it’s around position 1024. It’s my guess that the spfile uses a block size of 1024 bytes (1KB). In order to check for consistency of the parameter file blocks, Oracle puts some extra (internal) data on the borders of the block so integrity can be checked. This is like an Oracle database block.

So, there you have a reason not to use strings on the spfile, unless you like a game of find the random newlines in your new pfile.

How should you create a new pfile then?

The preferred method is using ‘create pfile from spfile’. This requires logging on to the instance (nomount is enough).
An alternative is to look at the alert.log file. When starting an Oracle instance, the non-default parameters are printed in the alert.log. This is a very simple, yet useful method of reconstructing the parameter file.

Update: I’ve gotten an email from Bjoern Rost saying that my statement on the need of the instance at least needing to be in nomount phase is not true, because the ‘create pfile from spfile’ commands can be used with the instance being down.

I decided to take this for an additional test. First of all, there two methods which can be used pfile/spfile manipulation via sqlplus (as far as I know): logging on with SYSDBA privileges, and starting sqlplus on the local node without logging on to any instance (/nolog).

a) instance is OPEN (the which is the same with in the nomount and mount phase)

Upon given this a little thought, it’s kind of obvious only saying spfile can’t be used with /nolog: there are no settings, we are not connected to any instance. So let’s try specifying a full path for both pfile and spfile:

What is shown here, is that when my instance is down, I can’t use ‘create pfile from spfile’ without a specification for spfile. Apparently, it looks at the default location ($ORACLE_HOME/dbs) for the default spfile name (spfilev11204.ora in my case): ORA-01565: error in identifying file ‘?/dbs/spfile@.ora’. ‘?’ means $ORACLE_HOME and ‘@’ means instance name. Let’s see if this works if we create a spfile on that location:

The first step is already done, but redone for completeness sake. The second step creates a spfile on the default location from the pfile ‘/tmp/tt’. Now we can ask sqlplus to essentially do the same but in reverse: create ‘/tmp/tt’ from spfile:

Yes! Now this works. The reason for this behaviour is I explicitly want my spfile not to be on the default location ($ORACLE_HOME/dbs), because this is on a local filesystem. This is not a problem with single instance databases with no shared storage at all, but this is not practical in the case of RAC, and with single instance databases with a cluster and shared storage (think Exadata here!), because it’s practical to have the spfile on shared diskspace so the instance can very easily be started on another node.

Nope. That doesn’t work. I’ve seen blogposts indicating that /nolog can be used for pfile/spfile manipulation, this didn’t work in my case. Bjoern was right that pfile/spfile can be done while the instance is down.