Thursday, July 7, 2011

Realtime image processing in Python

Image processing is notoriously a CPU intensive task. To do it in realtime,
you need to implement your algorithm in a fast language, hence trying to do it
in Python is foolish: Python is clearly not fast enough for this task. Is it?
:-)
Actually, it turns out that the PyPy JIT compiler produces code which is fast
enough to do realtime video processing using two simple algorithms implemented
by Håkan Ardö.sobel.py implements a classical way of locating edges in images, the
Sobel operator. It is an approximation of the magnitude of the image
gradient. The processing time is spend on two convolutions between the
image and 3x3-kernels.magnify.py implements a pixel coordinate transformation that rearranges
the pixels in the image to form a magnifying effect in the center.
It consists of a single loop over the pixels in the output image copying
pixels from the input image.
You can try by yourself by downloading the appropriate demo:

There is only a single implementation of the algorithm in
magnify.py. The two different interpolation methods are implemented by
subclassing the class used to represent images and embed the
interpolation within the pixel access method. PyPy is able to achieve good
performance with this kind of abstractions because it can inline
the pixel access method and specialize the implementation of the algorithm.
In C++ that kind of pixel access method would be virtual and you'll need to use
templates to get the same effect without incurring in runtime overhead.

The video above shows PyPy and CPython running sobel.py side by
side (PyPy taking input from the webcam, CPython from the test
file). Alternatively, to have a feeling on how much PyPy is faster than
CPython, try to run the demo with the latter. These are the the average fps
(frames per second) that I get on my machine (Ubuntu 64 bit, Intel i7 920, 4GB
RAM) when processing the default test.avi video and using the prebuilt
PyPy binary found in the full tarball alinked above. For sobel.py:

PyPy: ~47.23 fps

CPython: ~0.08 fps

For magnify.py:

PyPy: ~26.92 fps

CPython: ~1.78 fps

This means that on sobel.py, PyPy is 590 times faster. On
magnify.py the difference is much less evident and the speedup is "only"
15x.
It must be noted that this is an extreme example of what PyPy can do. In
particular, you cannot expect (yet :-)) PyPy to be fast enough to run an
arbitrary video processing algorithm in real time, but the demo still proves
that PyPy has the potential to get there.

Image processing is notoriously a CPU intensive task. To do it in realtime,
you need to implement your algorithm in a fast language, hence trying to do it
in Python is foolish: Python is clearly not fast enough for this task. Is it?
:-)
Actually, it turns out that the PyPy JIT compiler produces code which is fast
enough to do realtime video processing using two simple algorithms implemented
by Håkan Ardö.sobel.py implements a classical way of locating edges in images, the
Sobel operator. It is an approximation of the magnitude of the image
gradient. The processing time is spend on two convolutions between the
image and 3x3-kernels.magnify.py implements a pixel coordinate transformation that rearranges
the pixels in the image to form a magnifying effect in the center.
It consists of a single loop over the pixels in the output image copying
pixels from the input image.
You can try by yourself by downloading the appropriate demo:

There is only a single implementation of the algorithm in
magnify.py. The two different interpolation methods are implemented by
subclassing the class used to represent images and embed the
interpolation within the pixel access method. PyPy is able to achieve good
performance with this kind of abstractions because it can inline
the pixel access method and specialize the implementation of the algorithm.
In C++ that kind of pixel access method would be virtual and you'll need to use
templates to get the same effect without incurring in runtime overhead.

The video above shows PyPy and CPython running sobel.py side by
side (PyPy taking input from the webcam, CPython from the test
file). Alternatively, to have a feeling on how much PyPy is faster than
CPython, try to run the demo with the latter. These are the the average fps
(frames per second) that I get on my machine (Ubuntu 64 bit, Intel i7 920, 4GB
RAM) when processing the default test.avi video and using the prebuilt
PyPy binary found in the full tarball alinked above. For sobel.py:

PyPy: ~47.23 fps

CPython: ~0.08 fps

For magnify.py:

PyPy: ~26.92 fps

CPython: ~1.78 fps

This means that on sobel.py, PyPy is 590 times faster. On
magnify.py the difference is much less evident and the speedup is "only"
15x.
It must be noted that this is an extreme example of what PyPy can do. In
particular, you cannot expect (yet :-)) PyPy to be fast enough to run an
arbitrary video processing algorithm in real time, but the demo still proves
that PyPy has the potential to get there.

I saw this demo recently when Dan Roberts presented at Baypiggies. We broke into spontaneous applause when the pypy runtime ran at a watchable speed after cpython ran at less than 1 frame/second. Very impressive!

The only chamge I'd like to see in this project is its name... Trying to gather news from twitter for example, makes me search amongst thousands of comments in japanese (pypy means "boobies" in japanese), other incomprehensible comments in malay and hundreds of music fans of Look-Ka PYPY (WTF??)

Other Anonymous: Yes, I can read. I should have given a bit more context, but I was offtopic anyway. My goal was not running the demo, my goal was running pypy. I used the OS X binary from pypy.org. For those who are really good at reading, this was probably clear from the fact that my binary only crashed at library loading time.

@Anonymous: most probably, the prebuilt PyPy for Mac Os X was built on a system different (older?) than yours.

For a quick workaround, you can try to do "ln -s /usr/lib/libssl-XXX.dylib /usr/lib/libssl.0.9.8.dylib". This should at least make it working, but of course it might break in case you actually use libssl.

to avoid the potential problem of infinite tracing, the JIT bails out if it traces "too much", depending on the trace_limit.In this case, the default trace_limit is not enough to fully optimize the whole algorithm, hence we need to help the JIT by telling it to trace a bit more than usual.

I agree that having to mess up with the internal parameters of the JIT is suboptimal. I plan to address this issue in the next weeks.

it's lovely that pypy can do this. This result is amazing, wonderful, and is very kittens. pypy is fast at running python code (*happy dance*).

But.

It also makes kittens cry when you compare to CPython in such a way.

The reality is that CPython users would do this using a library like numpy, opencv, pygame, scipy, pyopengl, freej (the list of real time video processing python libraries is very large, so I won't list them all here).

Of course python can do this task well, and has for more than 10 years.

This code does not take advantage of vectorization through efficient SIMD, multiple cores or graphics hardware, and isn't careful with reusing memory - so is not within an order of magnitude of the speed of CPython code with libraries doing real time video processing.

Anyone within the field would ask about using these features.

Another question they would ask is about pauses. How does the JIT affect pauses in animation? What are the rules for when the JIT warms up, and how can you tell when the code will start running fast? How does the GC affect pauses? If there is a way to turn off the GC, or reuse memory in some way such that the GC won't cause the program to fail(Remember that in realtime a pause is a program fail). Does the GC pool memory of similar size objects automatically? Does the GC work well with 256MB-1GB-16GB sized objects? In a 16GB system, can you use 15GB of objects, and then delete those objects to then use another 15GB of different objects? Or will the program swap, or fragment memory causing pauses?

Please don't make kittens cry. Be realistic with CPython comparisons.

At the moment the python implementation is not as elegant as a vector style implementation. A numpy/matlab/CUDA/OpenCL approach looks really nice for this type of code. One speed up might be to reuse memory, or act in place where possible. For example, not copying the image... unless the GC magically takes care of that for you.

@illume:More or less everyone knows that you can speed up your code by writing or using an extension library. Unfortunately this introduces a dependency on the library (for instance libssl mentioned in the comment thread) and it usually increases the complexity of your code.

Using PyPy you can solve computationally intensive problems in plain Python. Writing in Python saves development time. This is what the comparison is all about.

If pypy is 5x slower than C, and SIMD is 5x faster than C... and using multiple cores is 8x faster than a single core you can see this python code is (5 * 5 * 8) 200x faster than the pypy code. This is just comparing CPU based code. Obviously GPU code for real time image processing is very fast compared to CPU based code.

Things like numpy, pyopengl etc come packaged with various OSes - but chosing those dependencies compared to depending on pypy is a separate issue I guess (but many cpython packaged libraries are packaged for more platforms than pypy).

Of course using tested, and debugged existing code written in python will save you development time: for example using sobel written with the scipy library:http://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.filters.sobel.html

The fact is CPython is fast enough, more elegant, and will save you time for realtime image processing - unless you ignore the reality that people use CPython libraries for these tasks.

Finally the given code does not prove that the frames are all processed in realtime. They give an average time over all of the frames. Realtime video requires that you meet your target speed for every frame. It would need to be extended to measure each frame to make sure that each frame is within the required time budget.

@illume:This example shows pure python code and compares its execution time in cpython and pypy. Nothing else. Writing graphics code in pure python that runs not dreadfully slow was to my knowledge never before shown.If enough people understand the potential of this technique and put their time into it, we will hopefully come closer to your (5 * 5 * 8) acceleration in pypy, too.I will for sure work on this.

I think you are still missing the point of the post. It was not "use pure Python to write your video processing algos". That's of course nonsense, given the amount and quality of existing C extension modules to do that.

The point is that when you want to experiment with writing a new algorithm of any kind, it is now possible to do it in pure Python instead of, say, C code. If later your project needs to move past the experimentation phase, you will have to decide if you want to keep that Python code, rewrite it in C, or (if applicable) use SIMD instructions from Python or from C, or whatever.

The real point of this demo is to show that PyPy makes Python fast enough as an early experimentation platform for almost any kind of algorithm. If you can write in Python instead of in C, you'll save 50% of your time (random estimate); and then for the 5% of projects that go past the experimentation phase and where Python is not enough (other random estimate), spend more time learning other techniques and using them. The result is still in your favor, and it's only going to be more so as PyPy continues to improve.