Wolfgang Hoschek wrote:
> This is even though the conversion routines are highly optimized,
> taking full advantage of pure or partial ASCII valued data, similar
> in spirit to the technique your blog mentions (except that it's in Java).
Oh, if it is Java it is not really similar in spirit. The point of the
blog is not scanning ahead for non-ASCII codes but on taking advantage
of the parallelism/pipelining functions in current CPUs, as exposed by
C++ Intrinsic funcitons, to speed things up. I apologize for not being
clear.
> I do have some hope that future VMs with better dynamic optimization
> logic for memory prefetching, bulk operations, etc. could make more
> of a difference here, though. Care to explain why a dynamic optimizer
> couldn't get close to what those handcoded assembler routines do, in
> particular considering modern memory latencies?
It is highly unlikely that a programmer would write code that is readily
parallelizable* into optimal SSE2 instructions unless they knew SSE2's
constraints in the first place. They have to process data in 128-bit
chunks. The data has to be aligned on certain memory boundaries. Only
some kinds of data are allowed. Only some kinds of operations are
available. Almost any call to a function or method will break the
pipeline. Expressions have to be written with certain variable-writing
constraints to prevent pipeline stalling. Expressions have to be written
to interleave use of different execution units in the CPU.
(*I say parallelizable, because the intrinsics make pipelined
instructions look to the programmer like parallel instructions.)
The reason that current C++ compilers don't attempt to do anything
sophisticated with parallelization is that it is too hard and
defeatable. Providing built-in Intrinsic functions which act on special
built-in 128bit data types had turned out to be workable instead.
I think Java's best hopes are
* add little optimizations like my one to the X86 version of the Java
libraries, and call as native code;
* add more functions to System that can use SSE2, but hide it. For
example, a function to scan a byte array and detect the location of the
first non-ASCII code value like my example. But the Java designers could
only do this *after* it becomes clear what the useful functions are, and
this will only happen *after* programmers have explored using the SSE2
instructions for non-mathematical uses like parsing;
* add some kinds of annotations and datatypes to support small-grain
parallelized/pipelined code, generalizing SSE2 or perhaps even just
having direct equivalents to SSE intrinsics: @parallel(128) ?
> On the standard textual XML front: As has been noted, Xerces and
> woodstox can be made to run quite fast, but in practise, few people
> know how do configure them accordingly, and to do so reliably, and
> without conformance compromises.
A red herring. Xerces' defaults are an issue unrelated to the merits of
stimulating software developers to use modern C++ features instead of
sticking to slow 90's features.
(In any case, these optimisations are potentially also applicable to
binary XML parsing as well as to real XML processing.)
> Most users can't afford to study the complex reliability vs.
> performance interactions of myriads of more or less static tuning knobs.
Same fish.
Cheers
Rick Jelliffe