Sunday, 5 June 2011

dtrace -- some updates

After spending a lot of effort on the xcall issue, I had hit an issuewhere occasionally, system calls would fail. The regressiontest shows this up by running a perl script which continuouslyopens an existing and a non-existing file, plus a variety of other things.

Very occasionally, Perl would emit a warning relating to a file handlebeing referred to which belong to a file which couldnt be opened.(/etc/hosts - which always exists).

Similarly, other apps would occasionally fail to start with rtldlinker errors.

This proved very hard to track down: I was pretty certain it wasrelated to the xcall work I was doing. The error rates were rare - lessthan 1 in a million, and almost impossible to track down.

I moved away from xcall debugging and found that by having twosimple perl scripts (on a dual core machine), which continuously openedfiles and nothing else, that the error rate would increase whilstthe two scripts ran.

To try and get a better handle on this, I moved from 64-bit kerneldebugging to 32-bit kernel, where the error rate was significantlyhigher.

After a lot of experimentation, it transpired that the error wasnt to dowith xcall, but the syscall provider. Specifically, a piece ofassembler glue turned out to be rubbish. I am not sure why it appeared towork, but it didnt. (I had made some changes earlier on which mayhave broken the syscall tracing on 32-bit kernels).

After recoding the assembler glue - things looked much better. Theerrors in syscall processing appeared to be gone. But a new problemsurfaced - one I wasnt too surprised to see. There are a handfulof 32-bit syscalls which use a differing calling convention to the others.(The 64-bit code handles this, but not the 32-bit code).

I have nearly finished redoing the 32-bit syscall tracing, and, oncedone, will need to validate the 64-bit syscall tracing.

If I am lucky, hopefully in the next few days or weeks, the resiliencyissues will disappear and I can put out a new release.

The syscall tracing code is horribly ugly - because we have to supportdifferent calling conventions across the two types of cpu architecture.I may split the code up into an x86 and x86_64 code file.Post created by CRiSP v10.0.11a-b6022