>>>>> "Rich" == Rich Somerfield <rich_somerfield at tertio.com> writes:
Rich> That code would be gratefully received.
Rich> rich_somerfield at nO-sPaM.tertio.com
Oops - I somehow missed your response.
I wrote the code for someone else quite a while ago to demonstrate a
point. They wanted to be able to load massive electronic schematic
files and make them available to Python. The problem was similar to
yours, due the amount of the data they were loading, Python was just
too slow and too resource hungry. They had a yacc grammar which they
were using and wanted to make the parse tree available to their Python
code. I did not have any schematic files I could load, so I picked
the largest text file I could find on my computer; the Debian Packages
file.
Since I was only trying to demonstrate a technique, I took a few
shortcuts. I did not bother trying to write a yacc grammar, I just
wrote a simple parser in C. Instead of building my own collections, I
used those which come in GLIB.
There is too much code to post here, so I have placed it on our web
server.
http://www.object-craft.com.au/projects/lazy/
There is a benchmark program which compares the speed of the raw
parser and the parser accessed through a Python module. On my
machine, the Packages file is 121,111 lines and 4,471,834 bytes. The
benchmark program prints the following:
Raw parser loads in 0.70 seconds
aalib1 1.2-18
aalib1 1.2
libc6 2.2.1
libgpmg1 1.14-16
slang1 1.3.0-0
xlibs 4.0.1-11
Wrapped parser loaded and used in 0.75 seconds
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#!/usr/bin/env python
import os
import time
file = '/var/state/apt/lists/ftp.us.debian.org_debian_dists_unstable_main_binary-i386_Packages'
# Prime the buffer cache
os.system('./pkg_test %s' % file)
t0 = time.time()
os.system('./pkg_test %s' % file)
t1 = time.time()
print 'Raw parser loads in %.2f seconds' % (t1 - t0)
import dpkg
main = dpkg.load_file(file)
deps = main.packages['aalib-bin'].Depends
for dep in deps:
print dep.Package, dep.Version
t2 = time.time()
print 'Wrapped parser loaded and used in %.2f seconds' % (t2 - t1)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The files are:
Makefile.pre.in -- used to build module
Setup.in -- used to build module
benchmark.py -- the above benchmark program
dpkg.c -- the Python wrapper of the parser
pkg.c -- simple (and incomplete) Packages file parser
pkg.h -- parser interface
pkg_test.c -- drives the parser from C
The key to the technique is that Python proxies for the parse tree
nodes are created on demand. The Python attributes of each parse tree
node are created on demand.
dpkg.c is the interesting file as it demonstrates the technique.
The dpkg.c file should have probably been two files as it contains two
things; but what the hell, I wrote this for someone to demonstrate a
point. It contains the building blocks for constructing the lazy
evaluation proxy objects, and it also contains the proxies for the
pkg.c parser.
If enough people are interested, I will do some more work on splitting
the code up to make it more general.
Oh, to build and test the code do this:
% make -f Makefile.pre.in boot
% make
% cc -g -Wall -o pkg_test pkg_test.c pkg.c `glib-config --cflags --libs`
% ./benchmark.py
Traps for young players:
1- The data is read-only.
2- If you drop the reference to the top level parse tree object,
things will go bad very quickly (main in the benchmark program).
- Dave
--
http://www.object-craft.com.au