On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan <peter(at)2ndquadrant(dot)com> wrote:
> [ new patch ]
I spent quite a bit of time looking at this today - the patch
specifically, and the issue of making quicksort fast more generally.
It seemed to me that if we're going to have separate copies of the
quicksort code for tuple sorting, we might as well go whole hog and
specialize those copies to the particular needs of tuplesort.c as much
as possible. Accordingly, I whacked the code around so that it knows
that it is sorting SortTuple objects and need not conditionalize at
runtime on the size of the objects being swapped. You suggested
upthread that this might be worthwhile, and it seems that it is, so I
think we should do it.
Your patch removes the CHECK_FOR_INTERRUPTS() call from
comparetup_heap, which is no good. However, since I'd already decided
to specialize the copies of quicksort intended for sorting as much as
possible, it made sense to me to refactor things so that the qsort
routine itself, rather than the comparator, is responsible for calling
CHECK_FOR_INTERRUPTS(). This slightly reduces the number of times we
CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons
before doing it.
I find that your pg_always_inline macro is equivalent to just plain
inline on my system (MacOS X v10.6.8, gcc 4.2.1). It seems to need
something like this:
+#elif __GNUC__
+#define pg_always_inline inline __attribute__((always_inline))
...but I'm not very happy about relying on that, because I don't know
that it will work on every gcc version (never mind non-gcc compilers),
and I'm not convinced it's actually improving performance even on this
one. The documentation seems to indicate that this is intended to
force inlining even when not optimizing, which may have something to
do with the lack of effect: that's not really the point here anyway.
What I did instead is to replace template_qsort_arg.h with a script
called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c
that tuplesort.c then #includes. This seems more flexible to me than
the macro-based approach. In particular, it allows me to generate
versions of qsort with different call signatures. The attached patch
generates two:
static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator
cmp_tuple, Tuplesortstate *state);
static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup);
The first of these is a drop-in replacement for qsort_arg() - any
tuplesort can use it, not just heap sorts. But it is faster than
qsort_arg() because of the specializations for the SortTuple data
type. The second handles the special case where we are sorting by a
single key that has an associated SortSupport object. In this case we
don't need to carry the overhead of passing around the Tuplesortstate
and dereferencing it, nor do we need the SortTupleComparator: we can
just pass the SortSupport itself. Maybe there's a way to get this
effect using macros, but I couldn't figure it out. At any rate, at
least for the single-key case, this approach effectively forces the
comparator to be inlined without requiring pg_always_inline.
With this patch, I get the following results, as compared with your
2012-02-10 version and master, using the same test cases I tested
before.
select * from nodups order by g offset 10001;
tps on master: 289.471274, 289.967984, 289.595958
tps on 2012-02-10 version: 359.150280, 356.284723, 356.888900
tps on attached version: 388.212793, 386.085083, 386.867478
select * from twocol order by a, b offset 10000;
tps on master: 261.676611, 260.440886, 259.529362
tps on 2012-02-10 version: 283.941312, 279.981723, 283.140208
tps on attached version: 283.146463, 278.344827, 280.727798
select * from f8 order by g offset 10000;
tps on master: 228.299924, 222.650355, 227.408506
tps on 2012-02-10 version: 260.289273, 257.181035, 256.377456
tps on attached version: 276.985299, 275.341575, 274.428095
There's some variation (which I can't account for) between the results
on master now and the results on master before - possibly just code
shifting around between cache lines due to unrelated changes, or maybe
some inadvertent change in my test setup. But it looks to me like
your 2012-02-10 version, without any type-specific optimizations, does
pretty much just as well on multi-key sorting as your previous
version, which had them - or if there is a difference, it's pretty
small.
Overall, I think the numbers for the version I'm attaching here look
pretty good: the single-key performance is clearly better than your
last version, and the multi-key performance is very slightly worse. I
think that slight worsening is a good trade-off, though, because this
version can use qsort_tuple() for all kinds of tuplesorts, not just
heap tuplesorts. Still, it seems like we ought to be able to do even
better: the multi-key specialization that you had in your patch can be
coded in this framework, too, and in theory those are ndependent of
the swapcode improvements. I tried coding up a multi-key
specialization which I believe to be quite similar to what you did,
but it didn't seem to do much. I'm attaching it here; maybe there's
some way to improve it (or a different test case where it pays off).
It strikes me that if we wanted to take this further, we could look at
squeezing out ApplySortComparator. For example, suppose that, upon
discovering that we can do an in-memory quicksort on a single sort
key, we make an initial pass over the data where we check whether it's
sorted and, as we go, swap all the entries with isnull1 = true to the
end of the memtuples array. We then sort the isnull1 = true entries
with the standard comparator, and the isnull1 = false entries with an
optimized comparator that elides most of ApplySortComparator and
instead just calls the comparison function directly. We then decide
on whether to emit the isnull1 = true entries first or last based on
NULLS FIRST/LAST, and decide whether to emit the remaining entries in
forward or reverse order based on ASC/DESC. Or maybe not exactly that
thing, but something like that, so that we sort the null and non-null
entries separately. The additional complexity in the read-out logic
would probably be more than offset by being able to use a simpler
comparator.
What do you think of this version?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company