Applying External Timing Data to Untimed Events

Or how to have an answer other than “I don’t know” when asked “How long does that take?”.

Recently while working on a client site I discovered that it takes 30-90 seconds to make an ssh connection to one of the servers. Connections between servers for this client typically take < 1 second, so the lengthy connection time was definitely out of order.

If you are familiar with debugging ssh connections you are probably familiar with the ‘-v’ option that directs ssh to output verbose comments stating which operation is currently taking place. You can add up to three -v options on the command line, increasing the verbosity with each one. An example follows:

So, now I can connect to the remote server, observe the screen and see which operations of the connection are taking a long time.

There are however at least two problems with this:

I don’t know how long each bit is taking

Even though I can see it is slow, there’s no way easy for me to pass this information along.

If only ssh provided some method for showing the timing of each operation. Then I could create a log file of it, bundle it up into a zip file and send it to the system administrator with an explanation of the problem, and corroborating data show where ssh was having problems.

As this was a linux system, I could have just used strace. Strace is too granular for this purpose – it provided far more data than I wanted.

And like it or not, sometimes I am not working on my favorite OS. Sometimes I work on Windows, and it would be nice to have some method of obtaining timing data that would work on many platforms.

About this time I had an idea. They don’t occur all that often, so I was glad to have one to entertain.

The diagnostic output from most programs is usually unbuffered, that is, it is written directly to standard output or error output as soon as it occurs. If that is the case, then the time between each line of output is the amount of time the operation required to run.

Ah hah! Why don’t I just collect the time between each line, and report the difference?

Doing so would allow me to see which parts of ssh are taking so much time.

Being something of a Perl aficionado I set about creating a utility to accept text into standard output, and report the time difference separating the arrival of each line.

While it did not turn in to a daunting task, if you’re at all familiar with Oracle 10046 trace files you will have some knowledge of the difficulty of timing events. It is not always straight forward. For instance, sometimes the timing for the event appears before the event itself appears in the output.

To keep this simple, useful and maintainable, the following goals were set for this new utility:

it should be able to track timing regardless of whether the diagnostic output appears before or after the actual operation

the timing should be reasonably close to reality – small differences (overhead in the trace program) are OK

Not getting the timing right in the first line fed to the utility is OK – the startup time of the trace utility will affect it

Thus perltrace.pl was born (known as perltrace in the rest of this article)

Perltrace is meant to be used in a pipeline. Rather than trace an ssh operation, I used perltrace to get timing data for an scp operation. The reason for doing so is that the ssh eventually succeeds so that I am connected to a new server. As I am logging the connection details it made more sense to use scp, and it must do most everything that ssh does, that is, verify authentication and make a connection to the remote host.

Here’s the command line I used, along with some initial lines from the output:

So that the perltrace can be tested, there is a shell script perltrace_ut_data.sh. The shell scripts performs a series of sleeps in a loop, optionally printing the command before or after it has occurred. The entire script appears here:

Line 4

This line probably doesn’t look very important, but in fact it is the key to printing the timing information with the correct line from the input.

There are two different types of input we expect to see. As specified by the perltrace.pl options they are -cmdfirst and -nocmdfirst, with -cmdfirst being the default.

-cmdfirst – This means we expect the text describing the operation to appear in the output before the operation is performed. This is the program that is being logged telling us “Here is what I am going to do”, and then doing it.

-nocmdfirst – What this means is that we expect to see the text describing the operation to appear in the output after the operation has been performed. It is the program telling us “This is what I just did”

If we want to associated the time with the correct operation, it is necessary to know whether the diagnostic output appears before the operation, or after the operations.

So line 4 is used to ensure there are 2 elements in the printline array. If the -nocmdfirst option is used, there will initially be no elements in the printline array. This will make more sense as we go to line 14.

Line 14

Here the topmost element of the array is shifted out. If there are no elements then nothing is shifted out, and the array remains empty. So if the -cmdfirst option was used, the current line of input will be placed in $printline[1]. If -nocmdfirst is in force, then the current line of input will be in $printline[0].

Lines 17-31

These lines deal with handling the first line of input from the pipe, and as such this block is executed only once. If -nocmdfirst is enforced, we pick up the timing information and go to the top of the loop. If however -cmdfirst is enforced, we skip the timing information. Why skip it? Since we expect the command to appear before it is executed, we just go back to the top of the loop and wait for the next line of input. When that line appears, we know the previous operation has completed and we can print the command and the timing operation.

Lines 33-40

These lines are executed for every remaining line of input. Line 33 gets the current time, and line 35 computes the elapsed time for the current operation. Line 37 is quite interesting, as this is the line that prints the output with the timing information. If -wallclock was included as a command line option, the wall clock time will be at the beginning of the line. Next will be the elapsed time, followed by the input text from the pipe. You may have noticed that all the printf() statements always use $printline[0] as the text to print. You may recall that the @printline array may have 1 or 2 elements in it, dependent on whether -nocmdfirst or -cmdfirst was used on the command line. Line 4 ensures that the correct timing information is matched up with the input line by setting the array size.

Lines 46-51

And finally these lines complete the job if -cmdfirst was used, as the final timing data is not known until the last line of input has been received and the program has exited the loop.

Other Uses

Since first writing perltrace.pl I have found other areas to use it. It has recently proved useful for timing parts of RMAN and rsync operations.

If you have are running a lengthy RMAN job, pipe the output through perltrace and then to a log file. You can see in real time how much time each step requires. Here’s the output from an operation to recatalog some backup files that were moved. Even if you don’t need the information right away, it is always nice to have an idea of how long it takes to perform various database operations. I had no idea it would take so long to recatalog these backup pieces, as this was something I had never before done.

Even more recently I was using rsync to relocate a few hundred megabytes of archived logs to a new location. By piping the output through perltrace I could see how time was required to copy each file and then make a reasonable estimate as to how much more time was required to complete the job. This was useful in forming an estimate of how much time it would take to copy all the files.

If you have ever used tail -f to watch the oracle alert log for changes such as the application of archive logs during a database recovery, you may have wanted to know just how much time was required to apply each log. Oracle does provide date stamps before and after the log, but you are left calculating the time yourself. This can easily accomplished by piping the output of tail -f through perltrace. I don’t have any results handy for this one, but the command line is simple enough:

$> tail -f alert_orcl.log | perltrace -wallclock

The first few lines will of course have incorrect timing information, as those lines already exist in the file. This method works for any log file that is continuously written to.

This is great. I can see it being really useful. I’ve just downloaded the code and started to have a play. Maybe I’m not seeing the wood for the trees, but it looks like the version you have available for download does not include the “-wallclock” or “-totals” options? Please would you make the version with these options available?

The outputs I have tested don’t seem to be buffered. I haven’t tested any other than RMAN, scp and alert.log as seen in the article.

I did run some tests piping strace -r output through perltrace.pl for scp, and the times from perltrace were close to those of scp. Not exact, and I didn’t expect exact, but still quite close.

If the output is buffered, then the results will be skewed. Even so, they may still be useful as an aid to narrowing down which areas are of concern. I think that the usefulness would depend on the size of the buffer – usefulness decreasing inversely with the size of the buffer.

I would hope that most programs diagnostic outputs would always be unbuffered, but that might be too much to hope for.