NYTProf is effectively two profilers in one: a statement profiler, and a subroutine profiler.

Statement Profiling

The statement profiler measures the time between entering one perl statement and entering the next. Whenever execution reaches a new statement, the time since entering the previous statement is calculated and added to the time associated with the line of the source file that the previous statement starts on.

By default the statement profiler also determines the first line of the current block and the first line of the current statement, and accumulates times associated with those. NYTProf is the only Perl profiler to perform block level profiling.

Another innovation unique to NYTProf is automatic compensation for a problem inherent in simplistic statement-to-statement timing. Consider a statement that calls a subroutine and then performs some other work that doesn't execute new statements, for example:

foo(...) + mkdir(...);

In all other statement profilers the time spent in remainder of the expression (mkdir in the example) will be recorded as having been spent on the last statement executed in foo()! Here's another example:

while (<>) {
...
1;
}

After the first time around the loop, any further time spent evaluating the condition (waiting for input in this example) would be be recorded as having been spent on the last statement executed in the loop!

NYTProf avoids these problems by intercepting the opcodes which indicate that control is returning into some previous statement and adjusting the profile accordingly.

The statement profiler naturally generates a lot of data which is streamed out to a file in a very compact format. NYTProf takes care to not include the measurement and writing overheads in the profile times (some profilers produce 'noisy' data due to periodic stdio flushing).

Subroutine Profiling

The subroutine profiler measures the time between entering a subroutine and leaving it. It then increments a call count and accumulates the duration. For each subroutine called, separate counts and durations are stored for each location that called the subroutine.

Subroutine entry is detected by intercepting the entersub opcode. Subroutine exit is detected via perl's internal save stack. The result is both extremely fast and very robust.

Subroutine Recursion

For subroutines that recurse directly or indirectly, such as Error::try, the inclusive time is only measured for the outer-most call.

The inclusive times of recursive calls are still measured and are accumulated separately. Also the 'maximum recursion depth' per calling location is recorded.

Goto &Subroutine

Perl implements a goto &destination as a return followed by a call to &destination, so that's how it will appear in the report.

The goto will be shown with a very short time because it's effectively just a return. The &destination sub will show a call not from the location of the goto but from the location of the call to the sub that performed the goto.

Application Profiling

NYTProf records extra information in the data file to capture details that may be useful when analyzing the performance. It also records the filename and line ranges of all the subroutines.

NYTProf can profile applications that fork, and does so with no loss of performance. There's (now) no special 'allowfork' mode. It just works. NYTProf detects the fork and starts writing a new profile file with the pid appended to the filename.

Fast Profiling

The NYTProf profiler is written almost entirely in C and great care has been taken to ensure it's very efficient.

Apache Profiling

Just add one line near the start of your httpd.conf file:

PerlModule Devel::NYTProf::Apache

By default you'll get a /tmp/nytprof.$$.out file for the parent process and a /tmp/nytprof.$parent.out.$$ file for each worker process.

NYTProf takes care to detect when control is returning back from perl to mod_perl so time spent in mod_perl (such as waiting for the next request) does not get allocated to the last statement executed.

PROFILING

Usually you'd load Devel::NYTProf on the command line using the perl -d option:

perl -d:NYTProf some_perl.pl

To save typing the ':NYTProf' you could set the PERL5DB env var

PERL5DB='use Devel::NYTProf'

and then just perl -d would work:

perl -d some_perl.pl

Or you can avoid the need to add the -d option at all by using the PERL5OPT env var:

PERL5OPT=-d:NYTProf

That's also very handy when you can't alter the perl command line being used to run the script you want to profile.

NYTPROF ENVIRONMENT VARIABLE

The behavior of Devel::NYTProf may be modified by setting the environment variable NYTPROF. It is possible to use this environment variable to effect multiple setting by separating the values with a :. For example:

export NYTPROF=trace=2:start=init:file=/tmp/nytprof.out

Any colon or equal characters in a value can be escaped by preceding them with a backslash.

addpid=1

Append the current process id to the end of the filename.

This avoids concurrent, or consecutive, processes from overwriting the same file.

trace=N

Set trace level to N. 0 is off (the default). Higher values cause more detailed trace output. Trace output is written to STDERR or wherever the "log=F" option has specified.

log=F

Specify the name of the file that "trace=N" output should be written to.

start=...

Specify at which phase of program execution the profiler should be enabled:

The start=no option is handy if you want to explicitly control profiling by calling DB::enable_profile() and DB::disable_profile() yourself.

optimize=0

Disable the perl optimizer.

By default NYTProf leaves perl's optimizer enabled. That gives you more accurate profile timing overall, but can lead to odd statement counts for individual sets of lines. That's because the perl's peephole optimizer has effectively rewritten the statements but you can't see what the rewritten version looks like.

For example:

1 if (...) {
2 return;
3 }

may be rewritten as

1 return if (...)

so the profile won't show a statement count for line 2 in your source code because the return was merged into the if statement on the preceding line.

Using the optimize=0 option disables the optimizer so you'll get lower overall performance but more accurately assigned statement counts.

If you find any other examples of the effect of optimizer on NYTProf output (other than performance, obviously) please let us know.

subs=0

Set to 0 to disable the collection of subroutine caller and timing details.

blocks=0

Set to 0 to disable the determination of block and subroutine location per statement. This makes the profiler about 50% faster (as of July 2008) and produces smaller output files, but you lose some valuable information. The extra cost is likely to be reduced in later versions anyway, as little optimization has been done on that part of the code.

stmts=0

Set to 0 to disable the statement profiler. (Implies blocks=0.) The reports won't contain any statement timing detail.

This significantly reduces the overhead of the profiler and can also be useful for profiling large applications that would normally generate a very large profile data file.

leave=0

Set to 0 to disable the extra work done by the statement profiler to allocate times accurately when returning into the middle of statement. For example leaving a subroutine and returning into the middle of statement, or re-evaluating a loop condition.

This feature also ensures that in embedded environments, such as mod_perl, the last statement executed doesn't accumulate the time spent 'outside perl'.

NYTProf is the only line-level profiler to measure these times correctly. The profiler is fast enough that you shouldn't need to disable this feature.

findcaller=1

Force NYTProf to recalculate the name of the caller of the each sub instead of 'inheriting' the name calculated when the caller was entered. (Rarely needed, but might be useful in some odd cases.)

use_db_sub=1

Set to 1 to enable use of the traditional DB::DB() subroutine to perform profiling, instead of the faster 'opcode redirection' technique that's used by default. Also effectively sets leave=0 (see above).

The default 'opcode redirection' technique can't profile subroutines that were compiled before NYTProf was loaded. So using use_db_sub=1 can be useful in cases where you can't load the profiler early in the life of the application.

savesrc=1

Save a copy of all source code into the profile data file. This makes the file self-contained, so the reporting tools no longer depend on having the original source code files available. So it also insulates you from later changes to those files that would normally make the reports out of sync with the data.

By default NYTProf saved some source code: the arguments to the perl -e option, the script fed to perl via STDIN when using perl -, and the source code of string evals. (Currently string eval source code isn't available in the reports. Patches welcome.)

If you're using perl 5.10.0 or 5.8.8 (or earlier) then you need to also enable the use_db_sub=1 option otherwise perl doesn't make the source code available to NYTProf. Perl 5.8.9 and 5.10.1+ don't require that.

slowops=N

Profile perl opcodes that can be slow. These include opcodes that make system calls, such as print, read, sysread, socket etc., plus regular expression opcodes like subst and match.

If N is 0 then slowops profiling is disabled.

If N is 1 then all the builtins are treated as being defined in the CORE package. So times for print calls from anywhere in your code are merged and accounted for as calls to an xsub called CORE::print.

If N is 2 (the default) then builtins are treated as being defined in the package that calls them. So calls to print from package Foo are treated as calls to an xsub called Foo::CORE:print. Note the single colon after CORE.

The opcodes are currently profiled using their internal names, so printf is prtf and the -x file test is fteexec. This may change in future.

Opcodes that call subroutines, perhaps by triggering a FETCH from a tied variable, currently appear in the call tree as the call of the sub. This is likely to change in future.

usecputime=1

Measure user CPU + system CPU time instead of the real elapsed 'wall clock' time (which is the default).

Measuring CPU time has the advantage of making the measurements independent of time spent blocked waiting for the cpu or network i/o etc. But it also has the severe disadvantage of having typically far less accurate timings.

Most systems use a 0.01 second granularity. With modern processors having multi- gigahertz clocks, 0.01 seconds is like a lifetime. The cpu time clock 'ticks' happen so rarely relative to the activity of a most applications that you'd have to run the code for many hours to have any hope of reasonably useful results.

A better alternative would be to use the clock=N option to select a high-resolution cpu time clock, if available on your system.

file=...

compress=...

Specify the compression level to use, if NYTProf is compiled with compression support. Valid values are 0 to 9, with 0 disabling compression. The default is 6 as higher values yield little extra compression but the cpu cost starts to rise significantly. Using level 1 still gives you a significant reduction in file size.

If NYTProf was not compiled with compression support, this option is silently ignored.

clock=N

Systems which support the clock_gettime() system call typically support several clocks. By default NYTProf uses CLOCK_MONOTONIC.

This option enables you to select a different clock by specifying the integer id of the clock (which may vary between operating system types). If the clock you select isn't available then CLOCK_REALTIME is used.

sigexit=1

When perl exits normally it runs any code defined in END blocks. NYTProf defines an END block that finishes profiling and writes out the final profile data.

If the process ends due to a signal then END blocks are not executed. The sigexit option tells NYTProf to catch some signals (e.g. INT, HUP, PIPE, SEGV, BUS) and ensure a usable by executing:

DB::finish_profile();
exit 1;

You can also specify which signals to catch in this way by listing them, seperated by commas, as the value of the option (case is not significant):

sigexit=int,hup

forkdepth=N

When a perl process that is being profiled executes a fork() the child process is also profiled. The forkdepth option can be used to control this. If forkdepth is zero then profiling will be disabled in the child process.

If forkdepth is greater than zero then profiling will be enabled in the child process and the forkdepth value in that process is decremented by one.

If forkdepth is -1 (the default) then there's no limit on the number of generations of children that are profiled.

RUN-TIME CONTROL OF PROFILING

You can profile only parts of an application by calling DB::disable_profile() to stop collecting profile data, and calling DB::enable_profile() to start collecting profile data.

Using the start=no option lets you leave the profiler disabled initially until you call DB::enable_profile() at the right moment.

The profile output file can't be used until it's been properly completed and closed. Calling DB::disable_profile() doesn't do that. To make a profile file usable before the profiled application has completed you can call DB::finish_profile(). Alternatively you could call DB::enable_profile($newfile).

DB::disable_profile()

Stops collection of profile data.

Subroutine calls which were made while profiling was enabled and are still on the call stack (have not yet exited) will still have their profile data collected when they exit.

DB::enable_profile($newfile)

Enables collection of profile data. If $newfile is true the profile data will be written to $newfile (after completing and closing the previous file, if any). If $newfile already exists it will be deleted first.

DB::finish_profile()

Calls DB::disable_profile(), then completes the profile data file by writing subroutine profile data, and then closes the file. The in memory subroutine profile data is then discarded.

REPORTS

The Devel::NYTProf::Reader module provides an interface for generating arbitrary reports. This means that you can implement your own output format in perl. (Though the module is in a state of flux and may be deprecated soon.)

There is currently no tool to merge multiple data files into one.

Included in the bin directory of this distribution are some scripts which turn the raw profile data into more useful formats:

nytprofhtml

LIMITATIONS

Threads and Multiplicity

Devel::NYTProf is not currently thread safe or multiplicity safe. If you'd be interested in helping to fix that then please get in touch with us.

For perl < 5.8.8 it may change what caller() returns

For example, the Readonly module croaks with "Invalid tie" when profiled with perl versions before 5.8.8. That's because Readonly explicitly checking for certain values from caller(). The NEXT module is also affected.

For perl < 5.10.1 it can't see some implicit calls and callbacks

For perl versions prior to 5.8.9 and 5.10.1, some implicit subroutine calls can't be seen by the subroutine profiler. Technically this affects calls made via the various perl call_*() internal APIs.

For example, the TIE<whatever> subroutine called by tie(), all calls made via operator overloading, and callbacks from XS code, are not seen.

The effect is that time in the subroutines for those calls is accumulated by the subs that trigger them. So time spent in calls invoked by perl to handle overloading are accumulated by the subroutines that trigger overloading (so it is measured, but the cost is dispersed across possibly many calling locations).

Although the calls aren't seen by the subroutine profiler, the individual statements executed by the code in the called subs are profiled by the statement profiler.

#line directives

The reporting code currently doesn't handle #line directives, but at least it warns about them. Patches welcome.

Scope::Upper unwind()

NYTProf is currently incompatible with the deep magic performed by Scope::Upper's unwind() function. As a partial workaround you can set the subs=0:leave=0 options, but you won't get any subroutine timings. See http://rt.cpan.org/Public/Bug/Display.html?id=50634

Slowops doesn't include all kinds of s/// substitutions

Currently the substcont opcode isn't profiled. This means that substitutions using variables in the replacement string, or that use the /e modifier, are not fully profiled.

CLOCKS

Here we discuss the way NYTProf gets high-resolution timing information from your system and related issues.

POSIX Clocks

These are the clocks that your system may support if it supports the POSIX clock_gettime() function. Other clock sources are listed in the "Other Clocks" section below.

The clock_gettime() interface allows clocks to return times to nanosecond precision. Of course few offer nanosecond accuracy but the extra precision helps reduce the cumulative error that naturally occurs when adding together many timings. When using these clocks NYTProf outputs timings as a count of 100 nanosecond ticks.

CLOCK_REALTIME

CLOCK_REALTIME is typically the system's main high resolution 'wall clock time' source. The same source as used for the gettimeofday() call used by most kinds of perl benchmarking and profiling tools.

If your system doesn't support clock_gettime() then NYTProf will use gettimeofday(), or the nearest equivalent,

The problem with real time is that it's far from simple. It tends to drift and then be reset to match 'reality', either sharply or by small adjustments (via the adjtime() system call).

Surprisingly, it can also go backwards, for reasons explained in http://preview.tinyurl.com/5wawnn

CLOCK_MONOTONIC

CLOCK_MONOTONIC represents the amount of time since an unspecified point in the past (typically system start-up time). It increments uniformly independent of adjustments to 'wallclock time'.

CLOCK_VIRTUAL

CLOCK_VIRTUAL increments only when the CPU is running in user mode on behalf of the calling process.

CLOCK_PROF

CLOCK_PROF increments when the CPU is running in user or kernel mode.

CLOCK_PROCESS_CPUTIME_ID

CLOCK_PROCESS_CPUTIME_ID represents the amount of execution time of the process associated with the clock.

CLOCK_THREAD_CPUTIME_ID

CLOCK_THREAD_CPUTIME_ID represents the amount of execution time of the thread associated with the clock.

Finding Available POSIX Clocks

On unix-like systems you can find the CLOCK_* clocks available on you system using a command like:

grep -r 'define *CLOCK_' /usr/include

Look for a group that includes CLOCK_REALTIME. The integer values listed are the clock ids that you can use with the clock=N option.

A future version of NYTProf should be able to list the supported clocks.

Other Clocks

This section lists other clock sources that NYTProf may use.

gettimeofday

This is the traditional high resolution time of day interface for most unix-like systems. It's used on platforms like Mac OS X which don't (yet) support clock_gettime().

With this clock NYTProf outputs timings as a count of 1 microsecond ticks.

Time::HiRes

On systems which don't support clock_gettime() or gettimeofday() NYTProf falls back to using the Time::HiRes module. With this clock NYTProf outputs timings as a count of 1 microsecond ticks.

CAVEATS

SMP Systems

On systems with multiple processors, which includes most modern machines, (from Linux docs though applicable to most SMP systems):

The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on
many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium).
These registers may differ between CPUs and as a consequence these clocks may
return bogus results if a process is migrated to another CPU.
If the CPUs in an SMP system have different clock sources then there is no way
to maintain a correlation between the timer registers since each CPU will run
at a slightly different frequency. If that is the case then
clock_getcpuclockid(0) will return ENOENT to signify this condition. The two
clocks will then only be useful if it can be ensured that a process stays on a
certain CPU.
The processors in an SMP system do not start all at exactly the same time and
therefore the timer registers are typically running at an offset. Some
architectures include code that attempts to limit these offsets on bootup.
However, the code cannot guarantee to accurately tune the offsets. Glibc
contains no provisions to deal with these offsets (unlike the Linux Kernel).
Typically these offsets are small and therefore the effects may be negligible
in most cases.

In summary, SMP systems are likely to give 'noisy' profiles. Setting a "Processor Affinity" may help.

Processor Affinity

Processor affinity is an aspect of task scheduling on SMP systems. "Processor affinity takes advantage of the fact that some remnants of a process may remain in one processor's state (in particular, in its cache) from the last time the process ran, and so scheduling it to run on the same processor the next time could result in the process running more efficiently than if it were to run on another processor." (From http://en.wikipedia.org/wiki/Processor_affinity)

Setting an explicit processor affinity can avoid the problems described in "SMP Systems".

Processor affinity can be set using the taskset command on Linux.

Note that processor affinity is inherited by child processes, so if the process you're profiling spawns cpu intensive sub processes then your process will be impacted by those more than it otherwise would.

Virtual Machines

I recommend you don't do performance profiling while running in a virtual machine. If you do you're likely to find inexplicable spikes of real-time appearing at unreasonable places in your code. You should pay less attention to the statement timings and rely more on the subroutine timings. They will still be noisy but less so than the statement times.

You could also try using the clock=N option to select a high-resolution cpu-time clock instead of a real-time one. That should be much less noisy, though you will lose visibility of wait-times due to network and disk I/O, for example.

If your system doesn't support the clock=N option then you could try using the usecputime=1 option. That will give you cpu-time measurements but only at a very low 1/100th of a second resolution.

nytprofhtml is a script included that produces html reports. nytprofcsv is another script included that produces plain text CSV reports.

Devel::NYTProf::Reader is the module that powers the report scripts. You might want to check this out if you plan to implement a custom report (though it's very likely to be deprecated in a future release).

AUTHOR

COPYRIGHT AND LICENSE

Copyright (C) 2008 by Adam Kaplan and The New York Times Company.
Copyright (C) 2008, 2009 by Tim Bunce, Ireland.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

HISTORY

A bit of history and a shameless plug...

NYTProf stands for 'New York Times Profiler'. Indeed, this module was initially developed from Devel::FastProf by The New York Times Co. to help our developers quickly identify bottlenecks in large Perl applications. The NY Times loves Perl and we hope the community will benefit from our work as much as we have from theirs.

Devel::NYTProf is a (now distant) fork of Devel::FastProf, which was itself an evolution of Devel::SmallProf.

Adam Kaplan took Devel::FastProf and added html report generation (based on Devel::Cover) and a test suite - a tricky thing to do for a profiler. Meanwhile Tim Bunce had been extending Devel::FastProf to add novel per-sub and per-block timing, plus subroutine caller tracking.

When Devel::NYTProf was released Tim switched to working on Devel::NYTProf because the html report would be a good way to show the extra profile data, and the test suite made development much easier and safer.

Then he went a little crazy and added a slew of new features, in addition to per-sub and per-block timing and subroutine caller tracking. These included the 'opcode interception' method of profiling, ultra-fast and robust inclusive subroutine timing, doubling performance, plus major changes to html reporting to display all the extra profile call and timing data in richly annotated and cross-linked reports.

Steve Peters came on board along the way with patches for portability and to keep NYTProf working with the latest development perl versions. Nicholas Clark added zip compression. Jan Dubois contributed Windows support.