ARGH! I vastly underestimated the relative size of the strings in the LDAP database and the offsets to the strings. The ratio is about 2:1. I.e. for two megs of raw data, I need one more meg for the offsets. And that is not counting the indices (which are much smaller).

parse still can't handle the phone CD. The reason is that Linux will not let me mmap more than 2 gigs. This probably has to do with some stupid compile time constants like the relative position of code, stack, reserved kernel space and the mmap space. Or so. I'm just pulling this out of thin air, sorry. So if there ever was a compelling reason for 64-bit computing, this would be it. Time to get myself an Athlon 64 box. Anyone have a spare one for me? *sigh*

On the other hand, I could have foreseen it. I haven't counted them, but I expect the phone CD to carry at least 20 million records, which means at least 80 megs just for the pointers to the records. This is a lot of data. Maybe I can think of an easy way to be more space efficient. Let's sleep over it. Comments or ideas welcome.

File systems keep pointer sizes down by segmenting the data into smaller independent chunks. Maybe I should do the same. The data could be encoded with a static Huffman tree. Bad tradeoff here. Maybe I will just have to settle for the data for Berlin and Brandenburg or so. Too bad. I really looked forward to this.

The event notification stuff has progressed quite nicely. It is the basis of some benchmarks I did on Linux and the BSDs, and it made quite a wave when I posted it on Slashdot.

The goal was to create a good abstraction for the various event notification APIs, but also around the various sendfile variants, so it would be easy to write a highly web server. The abstraction part is in libowfat, and the web server is called gatling (it's unreleased so far, but you can get the source code via cvs from :pserver:cvs@cvs.fefe.de:/cvs, check out "gatling".

gatling now uses sendfile on Linux, FreeBSD, Solaris, HP-UX, AIX, and uses mmap+write on the other systems (this provides zero-copy TCP at least on NetBSD and IRIX). gatling uses better-than-poll event notification APIs on Linux, the BSDs, IRIX, Solaris and MacOS X. So it should be the ideal basis to write highly scalable network servers and clients.

gatling is not just an HTTP server, it also is an FTP server. And the beauty of it is that it can be used in ad-hoc mode, i.e. you just run it and the current directory is exported read-only to the world. Uploads are possible via FTP, but only if the receiving directory is world writable. This is incredibly useful, much more so than I would have initially thought. I'm planning to also add SMB and NFS read-only export. Then gatling would be the ideal LAN party network publication tool.

Also, I started hacking on tinyldap again. A friend of mine reverse engineered the database format on a "telephone book CD" for Germany. I thought to myself: this would be a formidable data set to put into tinyldap to prove that it is not just a toy. So he wrote a converter that spit out tab separated data and I converted it to LDIF and it's over 2 Gigs worth of ASCII LDIF. Whew. tinyldap assumes that all your data fits into memory. Not so much tinyldap itself but the programs that create the data file tinyldap serves from. So I started hacking on parse (which creates the binary data file but no indices on it). I ended up completely reworking it and over days my local source tree would not create correct data files. It was a strange experience for me because normally I only do small changes on my local cvs checkout and check every change in immediately. On the other hand, I always wait until it at least compiles and survives very basic tests. But now I checked everything in.

The next step is to convert the index generation to work like sort(1), i.e. quicksort on smaller blocks and then mergesort on these blocks.

That's harder than it sounds because I am sorting offsets. The comparison function has to seek to the offset and compare the data there. So the merge sort would drown the disk in wild seeking around unless I attach the first n bytes behind the offset to the offset before the merge sort and throw it away afterwards. I'm curious how much seeking I will end up doing.

I am also seeking sponsors for my work on tinyldap. If you run company working with LDAP, please go to your LDAP admin when he is doing LDAP administration and see for yourself. Chances are, he will be cursing all the time. Why not spend some money on the development of an LDAP server that does not suck? Hey, you can even say what features you want implemented! If you are interested, please contact me.

I have been working on event notification again. This time, I decided to throw away the old callback based API. It was not as easy to program as I had hoped.

The impetus was that djb has defined an API for event notification in the mean time; it's called io. At first, it didn't look so good to me, but the second look convinced me to use this API for my new implementation.

So far, the code supports epoll (Linux 2.6), sigio (Linux 2.4), kqueue (FreeBSD, OpenBSD and NetBSD-current) and /dev/poll (Solaris). I also wrote a small web server using this API, to conduct some kernel networking scalability tests. Yesterday, I added read-only FTP support. Maybe I'll also add read-only SMB and NFS support, then it would be the ideal file publishing platform for LAN parties or so.

I completed the vorbis diff. It can now also speed up encoding by 25% if you have SSE.

The CCC is doing another camp in a few days. I find the entrance fee of 100 Euros a little steep, but I nonetheless hope many of you will consider coming. I will probably do a workshop on either pattern matching and text retrieval or SIMD hacking. If you are coming and have a preference in the matter, please drop me a mail.

Other than that there's not much to say. It is too hot in Germany right now to do much meaningful work, and I don't have an air conditioner, so I'm basically waiting for the night to get things done (like now, it's 23:00 local time).
The EPIA M decided to come back once more. I'm not sure if it is the mainboard or the power supply. But I use it now to play back Winnie Pooh DivX recordings to my son, and it's quite good enough for that.

Oh, in the mean time AMD has released good documentation on SSE and SSE2 (in their Opteron documents). So now AMD documentation is in every respect much better than the Intel stuff. For some reason, the Intel compiler does not produce better code than gcc for me either, even on my Pentium 3 notebook. Bad karma, maybe ;)

My 1 month old VIA EPIA M has died on me :-( When I press the power button, the CPU fan starts very briefly and then stops again. No beeps, nothing else. The power supply fan has the same behaviour. Too bad.

Anyway, c't had an article about EPIA M and they said theirs had a VIA Nehemiah instead of the Ezra that mine had. The difference is that Nehemiah has SSE and no 3dnow. So I hope that I will get a replacement EPIA board with Nehemiah, it is slightly faster.

I took the opportunity to port my libvorbis 3dnow patch to SSE, but boy was I mistaken about the work that would entail! I thought I'd just do it on the train using the new builtins the Intel C compiler defines and gcc also supports. It turns out that my gcc version (3.1.1) creates bad code for some of them, and it generates bad code for all of them unless you turn the optimizer on. gcc wants to save the xmm registers to the stack otherwise, and forgets to align the stack storage to 16 bytes, which is a requirement for SSE. The result is a seg fault in innocuous looking code.

The Intel documentation really sucks. They find it completely unnecessary to document their SSE stuff, you only get documentation about SSE+SSE2. So in the middle of your hacking you find that the instruction you used does not exist in SSE. Duh. Unfortunately, I haven't found any AMD documentation about SSE; their CPU documentation is excellent, especially in comparison to Intel's crap.

Well, back to my vorbis hacking. The patch only speeds up decoding, and it does 1/4 to 1/3 speedup on my Athlon, although I had to work around unaligned data and data layout that is not so vector friendly (one array per channel, so you have to interleave the data manually for the sound card). I put it on my home page, let's see how many people will find it useful. If you download it, please send me an email and tell me! I want to know! ;)

I was reading the documentation for the different x86 CPUs regarding my SIMD work, and I have started comparing the latencies. I was shocked just how bad the Pentium 4 is.

The Pentium 4 has really horrible latencies, in particular where it really matters: at the SIMD instructions. For example, movq has a latency of 6 on the Pentium 4 (VIA C3: 1, Pentium 3: 1, Athlon: 2). This means moving one MMX register to another, no memory stalls or AGIs included!

Now movq is pretty important because the x86 architecture uses source+destination addressing in the instructions, not source1+source2+destionation as most RISC CPUs, so you need to copy data around all the time to get something done. This means that you have to find five other instructions to schedule between the movq and the next instruction to keep the pipeline filled. That is next to impossible in typical applications. There are not enough registers to unroll a typical loop four times.

No wonder the Pentium 4 performs so badly per clock cycle in comparison to, well, everything else on the market. I'll stick with my Athlon, thank you. Until I can buy a Hammer, that is ;-)

BTW: Since I mentioned it here, I put the 3dnow vorbis decoder diff on my home page, and 46 people downloaded it so far. Whee! Now I need to find the time to do it again in SSE, that should speed things up even more on my Athlon XP.

@nymia: you describe the code slave, not the coder. You describe someone who does it because he needs the money, not because he likes to do it. Don't just copy other people's code; in most cases it turns out those others didn't know what they were doing, too. Read their code and understand why they did it like that. Then you can copy from yourself.
Only copy from others after you have completely understood what and why they were doing.

SIMD hacking is starting to be a lot of fun. That and trying to find clever ways to avoid branches. I spent the last few days hacking vorbis to make it faster on my slow C3. Vorbis is completely float based, and the C3 has 3dnow, so this was a good opportunity to learn 3dnow. So far I converted the q&p loop in vorbis_lsp_to_curve, the overlap/add (only large/large) and copy sections of vorbis_synthesis_blockin, vorbis_apply_windowmdct_butterfly_generic and the 2-channel same-endianness segment of ov_read and netted a 10% speed-up on my Athlon. The C3 has about the same speed-up, which is interesting since the C3's FPU is running and only half the CPU clock speed, so 3dnow should be a bigger gain. Maybe vorbis is limited by the RAM bandwidth and cache misses and not FPU?

I have to say that the AMD documentation is much better than the Intel documentation, especially about SIMD. I'll look for some SSE docs on their web page, because the Intel SSE docs are even worse than their other docs. The Intel web site is less forthcoming and their documentation sucks in comparison. At least all the necessary documents can be downloaded for free and without registration (and as PDF and not winword) ;-)

Anyway, during my SIMD hacking I found that I need a good assembly level debugger, a good profiler (I'm using gcov right now, but it won't tell me where the time is spent, only what code is executed often -- close but no cigar; gcov is too unprecise, hrprof seems to get the timings wrong) and a good stall simulator. If someone knows a free tool where I can specify a given target CPU and run my code and it will then tell me which assembly instructions caused which stalls and for what reason, it would be most helpful. Something like this should be relatively easy to do for the bochs people. Or for valgrind, once someone adds MMX, SSE and 3dnow support to their JIT engine.

By the way: a big hooray for valgrind! What a great piece of software! If you develop software, use valgrind!

I wonder how much crypto bulk cipher performance could be gained using MMX and SSE. SSE is next on my list. Hacking code is great, every day you have the opportunity to learn new exciting things! ;-)

If you are new to SIMD and want to see how really skillful people do it, have a look at the ffmpeg sources. I find the altivec stuff (in libavcodec/ppc) particularly impressive.

I started looking at MMX, SSE and 3dnow, and I started hacking at ffmpeg. I actually found a few routines that were reasonable often called, small enough and candidates for SIMD, and wrote a translation. I learned a lot in the process, and now the patches have been integrated into ffmpeg. I will try to submit enough patches to be mentioned in the ChangeLog or in some comment, so I have something to show.

In the process, I found a great profiling package called hrprof.
It basically uses a new gcc instrument-this-code option and the Pentium cycle counter to write profiling data usable by gprof but much more accurate. Great tool!

Using the profiler, I found that ffmpeg is spending much time in quant_psnr8x8_c, dct_sad8x8_c and dct_sad16x16_c (and a whole lot of already mmx/mmx2 optimized functions). Those will be my next targets, let's see how far I get. It's quite an experience using those SIMD instructions, so very different from normal SISD hacking. I find it quite rewarding so far.

VIA EPIA-M

The shipment with the Infineon DIMM arrived yesterday, and I put together my new EPIA-M box. I put together some PXE net-boot configuration and it now boots diskless. It's on mobile mode (several free-hanging parts connected with strange cables) and is virtually noise-less, although the CPU does have a fan.

Most of the hardware is supported. The network card, the USB, USB2 and Firewire controllers, the sound card... but not the graphics card. I can use mplayer and XFree86 in VESA mode, but that sucks very much, in particular since I want to use it as web clicking PC for my wife, and scrolling a large Mozilla window without hardware blitting is not what I had in mind. The graphics chipset is not even in the lspci database, and I found no data sheet or description of it. It is apparently called CastleRock and is some sort of S3 chip, but what do I know? This is very disappointing, in particular since this hardware was advertised as having "full Linux support". Well, it's not so full after all. If anyone here can help me, please contact me! I already looked at google and tried the current XFree86 snapshot.

Other than that, this is some great hardware! It is fast enough to play full-screen hq-divx movies with AC3 soundtrack at (according to vmstat) 25-40% CPU load. I got the faster one with the 933 MHz CPU, but I guess the slower one would have been sufficient.

I found out about gmane today. What a great site! I have been waiting for something like this for years! They basically offer mailing lists accessible via nntp, i.e. it's a Usenet news server, just without the usual Usenet newsgroups, instead it has mailing lists.

Why is that so great? Because I hate web interfaces to mailing lists. In particular I loathe bad ones like the one for ffmpeg on Source Forge. They don't have all mailing lists, and they don't have all the articles for every mailing list, but it is possible to submit mailing list archives. To my great surprise, I found that they already include the mailing list for one of my projects, the diet libc! Wow, what an ego trip ;-)

Anyway, I mentioned this yesterday in my diary, but overwrote it when I tried to post a second diary entry, which overwrote the first one. I started working on an event notification framework. I wrote it to learn about new platform specific APIs like sigio and epoll on Linux. I am planning to add kqueue support as well, but haven't gotten around to it yet. I measured some 8000 HTTP connections per second on my desktop box with it. With HTTP keep-alive I even got over 30000 transactions per second in one benchmark.

The Economist has run a very interesting story. It's about some polls about how people see other countries. The USA approval in Europe has dropped dramatically, even in UK only 50% of the population approve! Even more dramatic: Europe's approval of Israel is only 38%, that's way down there, almost as bad as Iraq. The strangest result is that 60% of the Europeans and 80% of the Americans want the EU to be a strong leader. Is that a call to rescue the Americans from their own government? Food for thought.

I wasted the better half of the day trying to get glibc to compile. It just wouldn't work for me. Nobody else appears to have problems with this.

The problem was that glibc's new ld.so checks whether the ELF run path is set. This is done for shared libraries using -Wl,-rpath, which you might have seen somewhere already. Or you can override it at run time with $LD_LIBRARY_PATH. Or you can specify a default at link time with $LD_RUN_PATH.

This I happen to do to make all those stupid GNOME applications find their libraries on my system, which I do not want clobbering my /usr/local/lib, so I put them somewhere under /opt.

How does ld.so check? With assert(). Apparently nobody ever tested this code, because assert calls __assert_fail, which calls some internal printf clone, which calls some conversion routine for numbers, and that one segfaults before anything is actually printed.

So ld.so segfaults before calling any syscall that could be observed with strace. D'oh!