IncPy: Automatic memoization for Python

What is IncPy?

IncPy (Incremental Python) is an enhanced Python
interpreter that speeds up script execution times by automatically
memoizing (caching) the results of long-running function calls and then
re-using those results rather than re-computing, when safe to do so.

When you first run your script with the IncPy interpreter, it might be ~20% slower
since IncPy needs to determine which functions are safe and worthwhile
to memoize, what their dependencies are, and then memoize their results
to a persistent cache. After you make some edits to your script,
subsequent runs can be much faster since IncPy can skip calls to
functions whose dependencies are still satisfied and load their results
directly from the cache. That way, IncPy can greatly speed up your
iteration and debugging cycles.

IncPy is designed to be a drop-in replacement for the Python 2.6
interpreter, so it should work seamlessly with all of your existing
scripts and 3rd-party libraries. You don't need to learn any new
language features or programming idioms to get its benefits.

How can IncPy be useful for me?

If you've written Python scripts that run for at least a few minutes,
then you've probably encountered the following dilemma:

Simple code, but slow runs: If you keep your code relatively
simple, then it takes unnecessarily long to re-execute after you make
minor edits to your code, since the Python interpreter re-executes your
entire script, even those parts that have not been affected by your
edits.

Complicated code + temp data files, but faster runs: If you
write extra caching code to save intermediate results to disk (and later
load them from disk), then subsequent runs of your script can be much
faster. However, now your code is more complicated, and you need to
manage those temporary data files that your script has generated.

As your project progresses, you might end up writing a collection of
ad-hoc scripts, each reading some input files, munging the data, and
writing intermediate results out to temporary files that other scripts
then read and munge.

For instance, the above diagram shows the Makefile that I created during
a summer internship in which I wrote dozens of Python scripts to munge
software bug database and employee personnel data. Each rectangle
represents a Python script, each ellipse represents a data file (shaded
ellipses represent the final results of my analyses), and each arrow
shows a script either reading from or writing to a data file. To speed
up execution times, I re-factored my scripts to load and save several
layers of intermediate datasets (white ellipses), so that I could tweak
portions of my analyses and not have to wait for the entire workflow to
re-execute. As a consequence, my code got more bloated, and I also had
to keep track of over a dozen intermediate data files. I realized from
this experience that an enhanced Python interpreter could automatically
do all of this caching and dependency management, so that's when I set
out to create IncPy.

By running your scripts with IncPy rather than the regular Python
interpreter, you can keep your code simple while still getting the
benefits of faster execution times. In particular:

You can potentially edit and debug your code, re-execute it, and see
new results in a few seconds rather than waiting for minutes (or even
hours)

You no longer need to manually track dependencies between
intermediate datasets and the code and data that created them. No more
asking yourself, "Oh wait, is that dataset outdated now that I've
updated this other one? Which script should I run to re-create this
dataset from that one?"

How does IncPy differ from other approaches to memoization?

IncPy is fully automatic, so you don't need to figure out
which functions are safe and worthwhile to memoize.

IncPy provides an on-disk persistent cache, which can be
shared across executions of the same script or related scripts that call
shared functions.

IncPy guarantees correctness by tracking dependencies between
cached results and the code and data that produced them and clearing
cache entries when dependencies are broken, so you don't need to manage
the intermediate datasets.

IncPy is designed for a general-purpose imperative programming
language, with an emphasis on having a low run-time overhead,
in contrast to related work on automatic memoization for functional and
domain-specific languages.

Can you show me a quick demo?

Sure, this 6-minute screencast demonstrates some of IncPy's basic capabilities:

Can you show me a small code example?

Here's an example data analysis script and a graphical representation
of data flow through its functions:

The inputs to this script are 10 text files containing database
queries (named queries.0.txt through queries.9.txt),
and its outputs are corresponding files named output.0.txt
through output.9.txt. The initial run of the script takes 30
minutes (1 minute for each function call x 3 calls per input file x 10
files).

During the initial run, IncPy automatically memoizes the arguments,
return values, and dependencies for all invocations of the 3 functions.
With the cache now populated, subsequent runs can be much faster than
the original 30 minutes. For instance:

If the script remains the same and no input files change, then a
subsequent run will terminate instantly.

If you manually edit any individual input file, then the only code
that needs to re-run is the code that generates its corresponding output
file (this is similar to how Makefiles work). e.g., if you just edit
queries.5.txt, then the script will only run for 3 minutes
processing it to create a new output.5.txt, but no other input
files get re-processed.

If you edit the code of transformAndOutputStats to adjust
the transformations, then only it needs to re-run, but
processQueries and calculateStats can be skipped. A
subsequent run will take 10 minutes (1 minute per file x 10 files).

If you change the value of MULTIPLIER, then only
calculateStats and transformAndOutputStats need to be
re-run, but processQueries can be skipped. A subsequent run
will take 20 minutes.

If you add new input (query) files, then the only code that needs
to run is the code that processes those new files; none of the existing
input files get re-processed. A subsequent run will take 3 minutes for
each additional input file.

For my research, I'm actively looking for new users to evaluate the
effectiveness of IncPy, so I'd be happy to create a custom installation
for your machine and to provide technical support. Feel free to email
me, Philip Guo, at:

How can I download and install IncPy?

IncPy is a modified version of the Python 2.6.3 interpreter. I've
successfully installed IncPy on Mac OS X (10.4 and 10.6) and Linux
(Ubuntu 8.04 LTS). Unfortunately, it might not work on Windows since
it makes some POSIX system calls (but I haven't tried yet, so it might
actually work). I want to make it
easy for people to start using IncPy, but I haven't yet had time to
create reliable one-click installers for all supported operating
systems.

Compiling IncPy from source code

To get the most recent version of IncPy, you must download and
compile its source code. If you don't want to go through this hassle,
please send me an email at:

and I will try my best to compile a custom version for your
computer and to guide you through the setup process.

The configure step creates a new incpy.config configuration
file in your home directory (if one doesn't already exist). You can use
that file to customize IncPy's functionality.

Dependencies

Mac OS X: If you install the 'Xcode developer tools' and 'X11'
packages from your installation DVD, then you should have all of the
software required to compile IncPy. It's also a good idea to install the GNU
readline library before compiling IncPy, so that your Python
interactive prompt acts more pleasant.

Linux: The software needed to compile IncPy might already come
pre-installed, but in case they're not, here are some useful packages to
install (these names are for Debian-based distros, but it should be easy
to look up the corresponding names in other package management
systems):

sudo apt-get install libc6-dev g++ gcc libreadline-dev

It's normal for warning messages like this one to appear when you're
compiling Python:

Failed to find the necessary bits to build these modules:
_bsddb bsddb185 dbm
dl gdbm imageop
sunaudiodev
To find the necessary bits, look in setup.py in detect_modules() for the module's name.

It just means that certain Python modules cannot be compiled for your
machine, but as long as you see an executable named python (or
python.exe on Mac OS X) in the IncPy directory, the build was
successful.

Running IncPy for the first time

After a successful compile, there should be an executable named
python (or python.exe on Mac OS X) in the IncPy
directory. When you execute that program, you should see an interactive
Python prompt like the following:

Working with 3rd-party libraries

IncPy is designed to work seamlessly with all 3rd-party libraries,
extensions, and tools (e.g., NumPy, SciPy, matplotlib, IPython), as long as they are
compatible with Python 2.6. You shouldn't need to re-compile any
libraries or extension code.

All you need to do is to set the PYTHONPATH
environment variable so that IncPy knows where your libraries and
extensions are installed (alternatively, you can prepend the path onto
the sys.path
variable from within your Python script).

You can install 3rd-party libraries in a variety of ways, but if
you're affiliated with a university, I highly recommend downloading a free
academic version of the Enthought
Python Distribution. It's a fantastic one-click installer
containing Python 2.6 and over 75 useful libraries.

After installing the Enthought Python Distribution on my Mac OS X
10.6 computer, I can give IncPy access to all of its installed libraries
by setting PYTHONPATH to the appropriate location and then
starting up IncPy:

If you're on a 64-bit machine and want to compile a 32-bit x86 IncPy
binary (e.g., to interoperate with already-installed 32-bit 3rd-party
libraries), you can run this modified configure command before
compiling:

Please let me know if you have troubles getting 3rd-party libraries
working with IncPy.

How do I get started using IncPy?

There's no new user interface to learn! Just run the Python
executable that you compiled in the IncPy/ sub-directory.
IncPy should behave like a regular Python interpreter, except that it
will automatically memoize the results of long-running functions to disk
and re-use those results in subsequent runs rather than re-computing
them.

Toy example

For example, suppose you wrote the following toy script called
analysis.py:

When you run this script for the first time, you must wait for 1
minute for the mungeFile function to finish processing
students.txt and return a dictionary to its caller. At that
time, IncPy memoizes the argument and return value of mungeFile
to disk, storing them in a sub-directory called
incpy-cache/. IncPy also creates a log file called
incpy.log in your current directory, with the following
contents:

The only event in the log is that the mungeFile function ran for 1 minute
(60,000 milliseconds) and had its results memoized. (I'll explain what
the TIME_LIMIT and IGNORE fields are in the next
section.)

When you run this same script again, it will terminate almost
instantly since IncPy can skip the call to mungeFile
and load its results directly from incpy-cache/. The
incpy.log for this run looks like:

This log shows that mungeFile was skipped and that it took 0
milliseconds to look-up and retrieve its results from the cache. If
this function had returned a larger data structure, then the look-up
time would likely be higher (but you still get a performance improvement
as long as the look-up time doesn't exceed the original running
time).

At this time, you can add additional code after the call to
mungeFile (e.g., for plotting the data or doing further
analysis), and that code will get to run immediately rather than having
to wait for 1 minute for mungeFile to re-execute every time you
run the script.

Ok, let's say that you modify the file students.txt and then
re-run the script. Unfortunately, IncPy must now delete the memoized
results for mungeFile, since they depend on the original
contents of students.txt and are probably now incorrect since
students.txt has changed. Thus, this run will again take a
full minute, and its incpy.log will look like this:

If you want to manually clear the entire cache, you can simply delete
the incpy-cache/ sub-directory. (I have yet to add
support for clearing individual cache entries.)

The incpy.log in your current working directory only
contains information about the most recent run. In addition, IncPy
combines the log files from all runs into a single
incpy.aggregate.log file in your home directory.

The incpy.config configuration file

IncPy looks for a configuration file in your home directory named
incpy.config (and won't run if it doesn't find one). Here are
the options you can specify in that file:

time_limit = <time in seconds>

Specifies the minimum amount of time that a function must take to
execute before it is eligible for memoization. The default time limit
is 1 second (as seen in the incpy.log files for the toy
example above). The intuition here is that it's only worthwhile to
memoize functions that take a (relatively) long time to execute;
subsequent script runs would probably slow down if all
functions were memoized.

ignore = <absolute path>

Specifies the absolute path to a directory or Python
file
containing code that should be ignored for the purposes of memoization
and dependency tracking. One effective way to lower IncPy's run-time
overhead is to ignore code from libraries. It's safe to do so if you
trust that the library code will not change and is pure with respect
to your scripts (usually reasonable assumptions).

Note that if you specify a directory to ignore, then code in all
files in that directory and in all sub-directories are ignored.

Annotating functions

For finer-grained control over memoization, IncPy also allows you to
annotate individual functions by specifying options in their docstrings (a string
literal that appears at the beginning of the function body). Here are
the annotations it currently supports:

incpy.memoize

To force IncPy to always memoize calls to a particular
function, put the string incpy.memoize in its docstring. For
example, we can implement the cliched Fibonacci sequence example using
this annotation:

Normally, the fib function will not be memoized, since each
call takes far less than 1 second, and IncPy only memoizes calls that
take a macroscopic amount of time to complete (at least
time_limit seconds, as specified in incpy.config).
However, the incpy.memoize annotation forces it to be memoized.
Note that even impure functions bearing this annotation will be
memoized.

incpy.ignore

In addition to ignoring entire files or directories (by specifying
ignore lines in incpy.config), you can ignore
individual functions by including the string incpy.ignore in
its docstring. For example:

IncPy will not track dependencies or purity in ignored functions and
will never memoize their return values.

One situation where you might want to ignore a function is if its return
value is some huge data structure that's not worth memoizing (since it
takes too long to memoize and also takes up too much disk space).

incpy.no_output

Sometimes your long-running functions will print out a bunch of
debugging or 'progress bar'-style output to stdout or stderr, but you
really don't want to capture all of that output in your cache; all you
care about is their return values. If you annotate your functions with
incpy.no_output, then IncPy won't track stdout/stderr
contents:

def process_file():
'''incpy.no_output'''
total = 0
for line in open('data.txt'):
print line # debugging output that you don't care to memoize
total += parse_and_analyze(line)
return total

The first time you run this function, all lines in the
data.txt input file will be printed to stdout (presumably for
debugging or to track progress through the file). But when you re-run
this function, the stdout output will not be 'replayed' (in fact, they
were never saved to the cache); only the return value will be retrieved
from the cache.

How does IncPy work?

This section provides a brief (and somewhat-simplified) overview of
how IncPy works. The input to IncPy is a Python script (and optional
customizations in incpy.config), and its outputs are the
results of running that script, memoized data in the
incpy-cache/ sub-directory, and log files (incpy.log
and incpy.aggregate.log).

What functions are worthwhile to memoize?

While your script is executing, IncPy records how much time each
function invocation takes. Whenever a function takes longer than
time_limit to run (1 second by default), IncPy will attempt to
memoize its results. Thus, the vast majority of function invocations
will not be memoized, since it's probably faster just to
re-execute them rather than to save and load their results from the
cache.

In rare circumstances, it might take longer to memoize a function
invocation than to simply re-run it (e.g., if the function returns a
large object that takes a long time to pickle and save to disk). If
that occurs, then the function invocation will not be memoized; IncPy
will instead log a warning message to incpy.log, so that you
can choose to have IncPy ignore that function in the future.

What functions are safe to memoize?

IncPy will only attempt to memoize function invocations that are
side effect free and deterministic.

A function invocation is side effect free (a.k.a. pure) if it
and all functions it calls never mutate a value that existed prior to
its invocation (e.g., global variables and parameter contents). IncPy can
automatically detect when a function violates this condition and mark it
as impure and thus ineligible for memoization.

In addition to returning a result, this function also mutates its
input parameter lst. If IncPy were to skip this function and
simply load its return value from the cache, then the mutation wouldn't
be properly replayed. Thus, as soon as IncPy executes the
append call, it marks munge_and_mutate as impure and
ineligible for memoization because it mutated a value (lst)
that existed prior to its invocation.

An example of a non-deterministic function is one that queries a
random number generator or the system clock. If such a function were
skipped and its return value loaded from the cache, it would probably be
incorrect. It's difficult to automatically detect non-determinism, so
IncPy must be given a list of functions known to be non-deterministic.
Currently I hard-code a list containing a few standard library
functions; in the future, I plan to make that list customizable via
incpy.config.

The one kind of non-deterministic function that IncPy can
automatically detect is one that opens stdin; IncPy marks all
functions on the stack as impure when stdin is opened, since
user input is definitely non-deterministic.

What data is stored in the cache?

Each function gets its own cache (currently implemented as a
sub-directory of pickle
files within the incpy-cache/ directory). While a
function is executing, IncPy records what it (and all functions it
calls) prints to stdout and stderr. When it finishes executing, if it
is still eligible for memoization, IncPy will save the following data in
a new cache entry, stored as a pickle file:

Function name and enclosing module filename (unique identifier)

Argument values (input)

Names and values of global variables read by this invocation (input)

Return value (output)

stdout contents (output)

stderr contents (output)

Final seek offsets of all files read (output)

Last modification times of all files read (dependency)

Last modification times of all files written (dependency)

Bytecode of this function and all of its callees (dependency)

How is the cache used to speed up future executions?

In a future call to that same function (either in the same script
execution or during a future execution), if IncPy finds an entry in the
cache that matches the values of arguments and global variables, then it will skip the call,
print out the saved stdout and stderr contents, advance the seek offsets of
all read files to their final locations, and return the saved return
value to its caller. This perfectly emulates the original function
invocation, except that it runs much faster. However, if IncPy cannot
find a cache entry that matches the argument and global variable values, then it will simply
execute the function normally (and create a new cache entry upon
completion).

How are cache entries automatically deleted?

IncPy automatically deletes a cache entry when one of its
dependencies gets broken, because the stored data is likely incorrect.
If a file that the function has read or written has changed (indicated
by modification time), then the cache entry will be deleted. If the
bytecode for that function or any of its callers have changed, then
all cache entries for that function are deleted.

If you want to manually clear the entire cache (like a 'make clean'),
you can simply delete the incpy-cache/ sub-directory.
(I have yet to add support for clearing individual cache entries,
though.)

Trusting previously-cached results

If you invoke IncPy with the -T option, then it will
never delete a function's cache, even when its dependencies have
been broken (it will instead issue a warning to stderr and to
incpy.log).

This "trust previously-cached results" mode is useful when you know
that the code changes you just made should not affect the
previously-cached results. For example, say you're writing a script to
sequentially process N records in a dataset. Your script runs fine
until it crashes on a record i somewhere in the middle of your
dataset, since that record contains data that your script doesn't
properly handle. With IncPy, the results from processing records 1
through (i - 1) have been memoized to disk, so if you re-run your
script, it can just re-use those results. But since your script
actually crashed, you will definitely modify it before re-running (to
fix the bug). However, once your code has changed, IncPy must
invalidate the cache entries for processing records 1 through (i
- 1), since those results might no longer be valid. Thus, your script
must start running again from record 1, which gives you no time
savings. Using the -T option, though, IncPy simply trusts the
previously-cached results, which lets your script skip the first
(i - 1) records and resume processing at record i.

When you're first writing a new ad-hoc data processing script, it
will likely crash at least a few times on records somewhere in the
middle of your dataset due to quirks in the data format (sometimes after
running fine for minutes or even hours). With this option, you can fix
bugs and resume processing at the first failed record rather than always
back at the beginning, which can eliminate lots of waiting time.

What are some of IncPy's limitations?

User-defined classes need to override == with something
other than the default pointer equality test

IncPy makes extensive use of the Python == operator for
comparing memoized argument and global variable values. If you want
user-defined classes to work well with IncPy, then make sure each
contains a valid __eq__ or __cmp__ method based on
something other than pointer equality.

This isn't much of a limitation in practice, though. Overriding
== is good programming style anyways, and even if you don't do
it, then IncPy will still work fine but simply miss some opportunities
for re-using memoized results.

Unpicklable data cannot be memoized

In reality, though, pretty much all Python data types that you might
care to memoize can be pickled. One way to get around this limitation
if it arises is by creating picklable proxy objects and then writing
code to convert between the proxy and real objects. For convenience,
IncPy automatically creates proxies for file handles, function objects, and sqlite cursor
objects.

Cannot track dependencies or purity within non-Python extension code

IncPy cannot track dependencies in functions implemented in other
languages (e.g., C or Fortran). Also, it cannot determine whether these
functions are pure. These limitations are shared by any analysis that
works purely on Python code. Fortunately, lots of non-Python extension
functions (e.g., those in math libraries) are pure and have no external
dependencies (library code is often pure and self-contained, or else
they could get awkward to use).

The only practical way around this limitation is to annotate
functions to indicate which arguments and global variables they mutate.
I've started annotating some standard library functions, and in the
future I plan to allow the user to make annotations in
incpy.config.

IncPy only works on Python code, so if you launch sub-programs
written in other languages, then IncPy cannot track what happens within
those programs. However, if those sub-programs are written in Python
(e.g., using the multiprocessing module)
and run under IncPy, then of course it's possible to track what happens
within them. (On certain operating systems, one could imagine
augmenting IncPy with a utility like strace or DTrace to determine which
files are read/written by spawned sub-programs originating from any
programming language.)

How can I help you with your research?

Thanks for being so considerate; I thought you'd never ask!

IncPy is an active research project, so I'm currently looking for
users to try it out and to give me feedback, complaints, and feature
requests via email at:

If you end up using IncPy regularly in your work, the only piece of
data that I'd like from you is the incpy.aggregate.log
file in your home directory. This is a plain-text file that indicates
how much time you saved by using IncPy rather than re-running your
entire script after every edit. IncPy does not collect or transmit any
information about you, your scripts, or your datasets.

Appendix: Grubby technical FAQ

Why didn't IncPy do what I expected it to do?

The first file you should investigate is the incpy.log file
that IncPy creates in the current working directory after every
invocation. The exact format of that log file is still in flux, but it
should provide some insights into what IncPy did during its most recent
invocation. Please feel free to email me at:

Oftentimes library code perform actions that are technically impure
(e.g., mutating a global variable) but are actually pure from the
perspective of your script. For example, the built-in Python regular
expression library keeps an internal cache of already-compiled regexps
and mutates that cache whenever a new regexp is compiled; however, the
act of compiling a regexp is a conceptually-pure operation.

To ignore all impure actions in library code so that your functions
can be memoized, add the absolute path of the library's file(s) or
directories to your incpy.config file as an 'ignore' option,
like so:

What should I do if IncPy can't memoize one of my function's arguments?

If an object doesn't properly override == or is
unpicklable, then IncPy cannot memoize it (see the limitations sub-section). If the object belongs
to a class you defined, then you can simply augment the class to fix
this problem. However, if the object is from an extension library
(e.g., CvMat objects in OpenCV), then you can't
easily override == or make it picklable. One
easy (but kludgy) workaround is to convert your object into one that can
be memoized, pass that 'proxy object' as the function's argument, and
then inside the function, convert it back into the original type (e.g.,
one user had to convert an OpenCV CvMat into a Python list,
pass it into his function, and then convert the list back into a
CvMat). Although this process is inefficient, it's worthwhile
if the memoization benefits outweigh the conversion times.

IncPy seems to be taking a long time to memoize certain functions or to load their results from disk

IncPy uses the Python pickle mechanism
to serialize/deserialize objects so that they can be stored on-disk.
When IncPy memoizes a function, it must pickle its arguments, return
value, and values of global variables that it has read. In general,
the larger and more complex these objects are, the longer they will
take to pickle (and unpickle), not to mention the more disk space
they will use. Also, large and complex objects are more likely to
not even be picklable, which forces IncPy to give up on trying to
memoize the enclosing functions!

Thus, for optimal performance, I recommend to refactor your code so
that the minimum amount of data needs to be pickled. For example, in
this sub-optimal code snippet ...

The call to getImageHistogram should be memoized, since it
ran for 10 seconds. However, its argument (imgBytes) could be
quite large since it represents the binary data of an entire image
(perhaps it could be 10MB or even 100MB in size). Thus, the memo table
entry would be at least the size of imgBytes, which can really
slow down IncPy.

Instead, if we refactored the above code to wrap the desired
functionality in an additional function ...

Then it's likely to be a Unicode
problem (NumPy Unicode using a different byte size for Unicode
characters than IncPy); I've been told that configuring IncPy with the
following option and then re-compiling solves the problem: