vmprof compression

John Camara <john.m.camara <at> gmail.com>
2015-03-26 16:29:18 GMT

Hi Fijal,

To recap and continue the discussion from irc.

We already discussed that the stack id are based on a counter which is good but I also want to confirm that the ids have locality associated with the code. That is similar areas of the code will have similar ids. Just to make sure are not random with respect to the code otherwise compression will not be helpful. If the ids are random that would need to be corrected first.

Right now the stack traces are written to the file repeating the following sequence

MARKER_STACKTRACE

count

depth

stack

stack

...

stack

In order to get a high compression ratio it would be better to combine multiple stacktraces and rearrange the data as follows

MARKER_COMPRESSED_STACKTRACES

counts_compressed_length

counts_compressed

depths_compressed_length

depths_compressed

stacks_compressed_length

stacks_compressed

In order to build the compress data you will want to 3 pairs of 2 buffers. A pair of buffers for counts, depths, and stacks. Your profiller would be writing to one set of buffers and another thread would be responsible for compressing buffers that are full and writing them to the file. Once a set of buffers are full the profiller would start filling up the other set of buffers.

For each set of buffers you need a variable to hold the previous count, depth, and stack id. They will be initialized to 0 before any data is written to an empty buffer. In stead of writing the actual count value into the counts buffer you will write the difference between the current count and the previous count. The reason for doing this is that the delta values will mostly be around 0 which will significantly improve the compression ratio without adding much overhead. Of course you would do the same for depths and stack ids.

When you compress the data you compress each buffer individually to make sure like data is being compressed. Like data compresses better the unlike data and by saving deltas very few bits will be required to represent the data and you are likely to have long strings of 0s and 1s.

I'm sure now you can see why I don't want stack ids being random. As if they are random then the deltas will be all over the place so you wont end up with long strings of 0s and 1s and random data itself does not compress.

To test this out I wouldn't bother modifying the c code but instead try it out in Python to first make sure the compression is providing huge gains and figure out how to tune the algorithm without having to mess with the signal handlers and writing the code for the separate thread and dealing issues such as making sure you don't start writing to a buffer before the thread finished writing the data to the file, etc. I would just read an existing profile file and rewrite it to a different file by rearranging the data and compressing the delta as I described. You can get away with one set of buffers as you wouldn't be profiling at the same time.

To tune this process you will need to determine the appropriate number of stack traces that is small enough to keep memory down but large enough so that the overhead associated with compression small. Maybe start of with about 8000 stack traces. I would try gzip, bz2, and lzma and look at their compression ratios and times. Gzip is general faster than bz2 and lzma is the slowest. On the other hand lzma provides the best compression and gzip the worse. Since you will be compressing deltas you most likely can get away with using the fastest compression options under each compressor and not effect the compression ratio. But I would test it to verify this as it does depend on the data being compressed whether or not this is true. Also one option that is available in lzma is the ability to set the width of the data to look at when looking for patterns. Since you are saving 32 or 64 bit ints depending on the platform you can set the option to either 4 or 8 bytes based on the platform. I don't believe qzip or bz2 have this option. By setting this option in lzma you will likely improve the compression ratio.

You may find that counts and depths give similar compression, between the 3 compression types in which case just use the fastest which will likely be gzip. On the other hand maybe the stack ids will be better off using lzma. This is also another reason to separate out, like data, as it gives you an option to use the fastest compressors for some data types while using others to provide for better compression.

I would not be surprised if this approach achieves a compression ratio better than 100x but that will be heavily dependent on how local the stack ids are. Also don't forget about simple things like not using 64 bit ints when you can get away with smaller ones.

Also for a slight variation to the above. If you find most of your deltas are < 127 you could write them out as 1 byte and when greater than 127 write them out as a 4 byte int with the high bit set. If you do this then don't set the lzma option to 4 or 8 byte boundaries as now your data is a mixture of 1 and 4 byte values. This sometimes can provide huge reductions in compression times without much effect on the overall compression ratio.

Custom types for annotating a flow object space

Henry Gomersall <heng <at> cantab.net>
2015-03-22 20:02:08 GMT

I'm looking at using PyPy's flow object space for an experimental
converter for MyHDL (http://www.myhdl.org/), a Python library for
representing HDL (i.e. hardware) models. By conversion, I mean
converting the MyHDL model that represents the hardware into either
Verilog or VHDL that downstream tools will support. Currently, there is
a converter that works very well in many situations, but there are
substantial limitations on the code that can be converted (much greater
restrictions than RPython imposes), as well as somewhat frustrating
corner cases.
It strikes me that much of the heavy lifting of the conversion problem
can be handled by the PyPy stack.
My question then regards the following. MyHDL represents certain low
level structures as python objects. For example, there is a notion of a
signal, represented by a Signal object, that has a one to one mapping to
the target HDL language. All the attributes of the Signal object
describe how it should be converted. So, during annotation, the Signal
object should be maintained as a base type, rather than burying deeper
into the object to try and infer more about its type (which invariably
breaks things due to RPython non-conformity). There are probably a few
other types (though not many) that should be handled similarly.
How does one instruct the translator to do this? Is it a case of writing
a custom TranslationDriver to handle the custom types?
Thanks for any help,
Henry

GSoC 2015

w0mTea <w0mTea <at> 163.com>
2015-03-22 14:31:31 GMT

Dear developers,

I'm a student interested in the idea of copy-on-write list slicing.

I noticed that on the PSF's GSoC wiki, students are suggested to fix
a beginner-friendly bug, but after some searching, I eventually
failed in finding some appropriate ones. May you help me about this?

Another question is about building pypy. I'm novice in pypy's
implementation, so I decide to modify it's source code to help me
understand it clearly. But it's really slow to build pypy. According
to the pypy doc, I use this command to build it:pypy rpython/bin/rpython --opt=jit pypy/goal/targetpypystandalone.py
It takes about an hour to complete on my computer. Then I modify a
file, adding only one line. But rebuilding through this command also
takes an hour. Is there any faster way to rebuild pypy with little
modification?

The servers starts, but when I launch a HTTP request, I've a strange stackstrace:

Error handling request

Traceback (most recent call last):

File "/home/lg/Documents/IDEA/pypy/site-packages/aiohttp/server.py", line 240, in start

yield from handler

File "/home/lg/tmp/asyncio_examples/7_aiohttp_server.py", line 25, in handle_request

message.method, message.path, message.version))

AttributeError: 'str' object has no attribute 'method'

Traceback (most recent call last):

File "/home/lg/Documents/IDEA/pypy/site-packages/aiohttp/server.py", line 240, in start

yield from handler

File "/home/lg/tmp/asyncio_examples/7_aiohttp_server.py", line 25, in handle_request

message.method, message.path, message.version))

AttributeError: 'str' object has no attribute 'method'

After debugging a little bit with pdb, I see that aiohttp.HttpRequestParser parses correctly the request, but "message = yield from httpstream.read()" in aiohttp.server at line 226 returns "GET" string instead of a RawRequestMessage object.

I'm a total newbie about PyPy internals, implement a monotonic timer seems to be over-complicated to me.

But, I should maybe help with tests and fix small issues like with aiohttp.server if I've a clue to fix that.

Are you interested if I create issues on PyPy project for each problem, or I will add only noise in PyPy project ?

BTW, to help PyPy project not only with my brain time, I've did a small recurring donation for PyPy 3.

Specialization for app level types

I'd like to add some optimization to app level types in Pixie. What I'm thinking of is something like this (in app level PyPy code):

class Foo(object):

def __init__(self, some_val):

self._some_val = some_val

def set_value(self, val):

self._some_val = val

In a perfect world the JIT should be able to recognize that ._some_val is only ever an int, and therefore store it unboxed in the instance of the type, hopefully this would decrease pressure on the GC if ._some_val is modified often. Also in a perfect world, the value of _some_val should be auto promoted to an object if someone ever decides to set it to something besides an int.

How would I go about coding this up in RPython? I can't seem to figure out a way to do this without either bloating each instance of the type with an array of object, an array of ints and an array of floats.

Currently app level objects in Pixie are just a wrapper around a object array. The type then holds the lookups for name->slot_idx.

Some summary and questions about the 'small function' problem

黄若尘 <hrc706 <at> gmail.com>
2015-03-18 15:49:33 GMT

Hi Fijal,

This is Ruochen Huang, I want to begin to write my proposal and I think actually there is not so much time left. I tried to make a summary of what I have understood until now and the questions I want to know. Please feel free to point out any incorrect things in my summary, and for the questions, if you think the question is meaningless, you can just skip it, or provide some possible document link or source code path if you think it will take too much time to explain it.

As far as I understood,

The ‘small function’ problem occurred when one trace try to call another trace. In source code level, it should be the situation that, inside one loop, there is a function invocation to a function which has another loop.

Let me take the small example we discussed before, function g() tried to call function f(a,b,c,d) in a big loop, and there is another loop inside f(a,b,c,d). So in current version of PyPy, the result is that, two traces were generated:

the trace for the loop in g(), let me call it T1, actually, g() tried to inline f(a,b,c,d), but since there is a loop in f, so the result is that T1 will inline only the first iteration of the loop in f, let’s say f was taken apart into f1(for the first iteration) and f’(for the rest iterations), so what T1 exactly does is start the loop in g -> do f1 -> do some allocations of PyFrame (in order to call f’) -> call_assembler for f’.

the trace for the loop in f’. let me call it T2. T2 firstly unpack the PyFrame prepared by T1, then do a preamble work, which means f’ is once again taken apart into f2 (for the 1st iteration in f’, and it actually is also the 2nd iteration in original f), and f’’(the iterations from 3rd iteration to the last iteration), for f2 and f’’, there is a label in the head of them, respectively. So finally we can say T2 consist of 3 parts: T2a (for PyFrame unpacking), T2b(with label1, do f2), T2c(with label2, do f’’).

As mentioned above, we have T1 -> T2a -> T2b -> T3c, from the viewpoint of the loop in f, f is distributed into: T1(f1) -> T2a -> T2b(f2) -> T2c(f’’), which means the loop in f was peeled twice, so T2b might be not needed, further more, the work for PyFrame before call_assembler in T1, and the unpacking work in T2a is a waste. I can’t understand why it’s a waste very well, but I guess it’s because T2c(f’') actually do the similar thing as f1 in T1, (or, T2c is already *inside* the loop) Anyway, T2b is also not needed, so we want to have T1 -> T2c, and since the work in PyFrame in T2a is eliminated, the allocation for PyFrame in T1 can also be eliminated. So ideally we want to have T1’ (without PyFrame allocation) -> T2c.

Some questions until now:

What’s the bridge you mentioned? To be honest I have only a very slight understand of bridge, I know it is executed when some guard failed, but as far as I knew, in normal trace JIT compiler, only one path of a loop will be traced, any guard failure will make the execution escape from the native code and return to VM, but I guess the bridge is a special kind of trace (native code), is it right?

Could you please explain more about why T2b is not needed? I guess the answer may be related to the “virtualizable” optimization for PyFrame, so what if PyFrame is not virtualizable? I mean, if in that situation, does the problem disappear? or become easier to solve?

What’s the difficulties in solving this problem? I’m sorry I’m not so familiar with the details of RPython JIT, but in my opinion, we need just to make the JIT know that,

when tries to inline a function, and encounter a loop so the inline work has to stop, it’s time to do optimization O.

what O does is to delete the allocation instructions about PyFrame before call_assembler, and them tell call_assembler to jump to 2rd label of target trace. (In our example is T2c).