Determining the Name of a Process from Python

Finding the name of the program from which a Python module is
running can be trickier than it would seem at first, and
investigating the reasons led to some interesting experiments.

A couple of weeks ago at the OpenStack Folsom Summit, Mark McClain
pointed out an interesting code snippet he had discovered in the Nova
sources:

nova/utils.py: 339

script_dir=os.path.dirname(inspect.stack()[-1][1])

The code is part of the logic to find a configuration file that lives
in a directory relative to where the application startup script is
located. It looks at the call stack to find the main program, and
picks the filename out of the stack details.

The code seems to be taken from a response to a StackOverflow
question, and when I saw it I thought it looked like a case of
someone going to more trouble than was needed to get the
information. Mark had a similar reaction, and asked if I knew of a
simpler way to determine the program name.

I thought it looked like a case of someone going to more trouble
than was needed…

Similar examples with inspect.stack() appear in four places in
the Nova source code (at last as-of today). All of them are either
building filenames relative to the location of the original “main”
program, or are returning the name of that program to be used to build
a path to another file (such as a log file or other program). Those
are all good reasons to be careful about the location and name of the
main program, but none explain why the obvious solution isn’t good
enough. I assumed that if the OpenStack developers were looking at
stack frames there must have been a reason. I decided to examine the
original code and spend a little time deciphering what it is doing,
and especially to see if there were cases where it did not work as
desired (so I could justify a patch).

The Stack

The call to inspect.stack() retrieves the Python interpreter
stack frames for the current thread. The return value is a list with
information about the calling function in position 0 and the “top”
of the stack at the end of the list. Each item in the list is a tuple
containing:

the stack frame data structure

the filename for the code being run in that frame

the line number within the file

the co_name member of the code object from the frame,
giving the function or method name being executed

the source lines for that bit of code, when available

an index into the list of source lines showing the actual source
line for the frame

The information is intended to be used for generating tracebacks or by
tools like pdb when debugging an application (althoughpdb has its own implementation). To answer the question “Which
program am I running in?” the filename is most the interesting piece
of data.

One obvious issue with these results is that the filename in the stack
frame is relative to the start up directory of the application. It
could lead to an incorrect path if the process has changed its working
directory between startup and checking the stack. But there is
another mode where looking at the top of the stack produces completely
invalid results.

The simple one-liner is not always going to produce the right
results.

The -m option to the interpreter triggers the runpy module,
which takes the module name specified and executes it like a main
program. As the stack printout above illustrates, runpy is then at
the top of the stack, so the “main” part of our local module is
several levels down from the top. That means the simple one-liner is
not always going to produce the right results.

Why the Obvious Solution Fails

Now that I knew there were ways to get the wrong results by looking at
the stack, the next question was whether there was another way to find
the program name that was simpler, more efficient, and especially more
correct. The simplest solution is to look at the command line
arguments passed through sys.argv.

argv.py

1
2
3

import sys
print sys.argv[0]

Normally, the first element in sys.argv is the script that was run
as the main program. The value always points to the same file,
although the method of invoking it may cause the value to fluctuate
between a relative and full path.

As this example demonstrates, when a script is run directly or passed
as an argument to the interpreter, sys.argv contains a relative
path to the script file. Using -m we see the full path, so
looking at the command line arguments is more robust for that
case. However, we cannot depend on -m being used so we aren’t
guaranteed to get the extra details.

Using import

The next alternative I considered was probing the main program module
myself. Every module has a special property, __file__, which holds
the path to the file from which the module was loaded. To access the
main program module from within Python, you import a specially named
module __main__. To test this method, I created a main program
that loads another module:

import_main_app.py

1
2
3

importimport_main_moduleimport_main_module.main()

And the second module imports __main__ and print the file it was
loaded from.

import_main.py

1
2
3

import __main__
print __main__.__file__

Looking at the __main__ module always pointed to the actual main
program module, but it did not always produce a full path. This makes
sense, because the filename for a module that goes into the stack
frame comes from the module itself.

Wandering Down the Garden Path

After I found such a simple way to reliably retrieve the program name,
I spent a while thinking about the motivation of the person who
decided that looking at stack frames was the best solution. I came up
with two hypotheses. First, it is entirely possible that they did not
know about importing __main__. It isn’t the sort of thing one
needs to do very often, and I don’t even remember where I learned
about doing it (or why, because I’m pretty sure I’ve never used the
feature in production code for any reason). That seems like the most
plausible reason, but the other idea I had was that for some reason it
was very important to have a relatively tamper-proof value –
something that could not be overwritten accidentally. This new idea
merited further investigation, so I worked back through the methods of
accessing the program name to determine which, if any, met the new
criteria.

I did not need to experiment with sys.argv to know it was
mutable. The arguments are saved in a normal list object, and
can be modified quite easily, as demonstrated here.

All normal list operations are supported, so replacing the program
name is a simple assignment statement. Because sys.argv is a list,
it is also susceptible to having values removed by pop(),remove(), or a slice assignment gone awry.

This is less likely to happen by accident, so it seems somewhat
safer. Nonetheless, changing it is easy.

$ python import_modify.py
Before: import_modify.py
After : wrong

That leaves the stack frame.

Down the Rabbit Hole

As described above, the return value of inspect.stack() is a
list of tuples. The list is computed each time the function is called,
so it was unlikely that one part of a program would accidentally
modify it. The key word there is accidentally, but even a malicious
program would have to go to a bit of effort to return fake stack data.

The filename actually appears in two places in the data returned byinspect.stack(). The first location is in the tuple that is part
of the list returned as the stack itself. The second is in the code
object of the stack frame within that same tuple
(frame.f_code.co_filename).

Replacing the filename in the tuple was relatively easy, and would be
sufficient for code that trusted the stack contents returned byinspect.stack(). It turned out to be more challenging to change
the code object. For C Python, the code class is implemented
in C as part of the set of objects used internally by the interpreter.

Instead of changing the code object itself, I would have to replace it
with another object. The reference to the code object is accessed
through the frame object, so in order to insert my code object into
the stack frame I would need to modify the frame. Frame objects are
also immutable, however, so that meant creating a fake frame to
replace the original value. Unfortunately, it is not possible to
instantiate code or frame objects from within
Python, so I ended up having to create classes to mimic the originals.

I stole the idea of using namedtuple as a convenient way to
have a class with named attributes but no real methods frominspect, which uses it to define a Traceback class.

$ python stack_modify2.py
From stack: wrong
From frame: wrong

Replacing the frame and code objects worked well for accessing the
“code” object directly, but failed when I tried to useinspect.getframeinfo() because there is an explicit type check
with a TypeError near the beginning of getframeinfo()
(see line 16 below).

defgetframeinfo(frame,context=1):"""Get information about a frame or traceback object. A tuple of five things is returned: the filename, the line number of the current line, the function name, a list of lines of context from the source code, and the index of the current line within that list. The optional second argument specifies the number of lines of context to return, which are centered around the current line."""ifistraceback(frame):lineno=frame.tb_linenoframe=frame.tb_frameelse:lineno=frame.f_linenoifnotisframe(frame):raiseTypeError('{!r} is not a frame or traceback object'.format(frame))filename=getsourcefile(frame)orgetfile(frame)ifcontext>0:start=lineno-1-context//2try:lines,lnum=findsource(frame)exceptIOError:lines=index=Noneelse:start=max(start,1)start=max(0,min(start,len(lines)-context))lines=lines[start:start+context]index=lineno-1-startelse:lines=index=NonereturnTraceback(filename,lineno,frame.f_code.co_name,lines,index)

The solution was to replace getframeinfo() with a version that
skips the check. Unfortunately, getframeinfo() usesgetfile(), which performs a similar check, so that function
needed to be replaced, too.

After reviewing inspect.py one more time to see if I needed to
replace any other functions, I realized that a better solution was
possible. The implementation of inspect.stack() is very small,
since it calls inspect.getouterframes() to actually build the
list of frames. The seed frame passed to getouterframes() comes
from sys._getframe().

defgetouterframes(frame,context=1):"""Get a list of records for a frame and all higher (calling) frames. Each record contains a frame object, filename, line number, function name, a list of lines of context, and index within the context."""framelist=[]whileframe:framelist.append((frame,)+getframeinfo(frame,context))frame=frame.f_backreturnframelist

If I modified getouterframes() instead of inspect.stack(),
then I could ensure that my fake frame information was inserted at the
beginning of the stack, and all of the rest of the inspect
functions would honor it.

Enough of That

At this point I have proven to myself that while it is unlikely that
anyone would bother to do it in a real program (and they would
certainly not do it by accident) it is possible to intercept the
introspection calls and insert bogus information to mislead a program
trying to discover information about itself. This implementation does
not work to subvert pdb, because it does not useinspect. Probably because it predates inspect,pdb has its own implementation of a stack building function,
which could be replaced using the same technique as what was done
above.

This investigation led me to several conclusions. First, I still don’t
know why the original code is looking at the stack to discover the
program name. I should ask on the OpenStack mailing list, but in the
mean time I had fun experimenting while researching the question.
Second, given that looking at __main__.__file__ produces a value
at least as correct as looking at the stack in all cases, and more
correct when a program is launched using the -m flag, it seems
like the solution with best combination of reliability and
simplicity. A patch may be in order. And finally, monkey-patching can
drive you to excesses, madness, or both.