Here's a multikey sorting problem that's bothering me; I think it's well
researched but I can't find the right keywords.
Say we have a collection of people with height and hair length
information. We want to get people who are "tall and long-haired". More
precisely, we want to order people by rank(p.height) +
rank(p.hairlength), where rank(p.k) is the function that returns the
rank of person p if ordered by key k.
The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary
array. But I think things can be done considerably better. Any ideas?
Thanks,
Andrei

Here's a multikey sorting problem that's bothering me; I think it's well
researched but I can't find the right keywords.
Say we have a collection of people with height and hair length
information. We want to get people who are "tall and long-haired". More
precisely, we want to order people by rank(p.height) +
rank(p.hairlength), where rank(p.k) is the function that returns the
rank of person p if ordered by key k.
The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary
array. But I think things can be done considerably better. Any ideas?
Thanks,
Andrei

Here's a multikey sorting problem that's bothering me; I think it's well
researched but I can't find the right keywords.
Say we have a collection of people with height and hair length
information. We want to get people who are "tall and long-haired". More
precisely, we want to order people by rank(p.height) +
rank(p.hairlength), where rank(p.k) is the function that returns the
rank of person p if ordered by key k.
The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary
array. But I think things can be done considerably better. Any ideas?
Thanks,
Andrei

The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary

I think, you are mixing two problem levels:
Level one (theory):
How are the two properties weighted against each other
and can be combined to a total weight (a weight-function).
Level two (programming):
how can this be expressed in an optimal algorithmical way,
given D2.
Your question suggests, the the (naive) sum-function is
already the solution to level one.
But I suspect, that the solution for the problem requires
rather to find a proper weight function than a
combinatation-algorithm.
Christof

The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary

I think, you are mixing two problem levels:
Level one (theory):
How are the two properties weighted against each other
and can be combined to a total weight (a weight-function).

There's no weighing. The sorting order is rank(k1) + rank(k2).

Weighting doesn't imply a simple linear constant. The rank function is an
example of a non-linear weighting scheme. Non-linear weighting functions are
common in complex sorting and classification problems.

A k-d tree isn't directly applicable here - the problem is that
the input data isn't normalized, so you can't just drop people
into a 2d grid consisting of one axis being height, and the other
axis being hair length. You need to 'normalize' the heights and
lengths into ranks first, *then* insert them, and normalizing the
lengths -> ranks is an NlogN process to begin with. Constructing
a k-d tree isn't free either - you either have to pre-sort your
inputs, or sort-as-you-go, which works out to NlogN as well.
On Friday, 27 January 2012 at 20:07:44 UTC, sclytrack wrote:

On 01/27/2012 08:36 PM, Andrei Alexandrescu wrote:

Here's a multikey sorting problem that's bothering me; I think
it's well
researched but I can't find the right keywords.
Say we have a collection of people with height and hair length
information. We want to get people who are "tall and
long-haired". More
precisely, we want to order people by rank(p.height) +
rank(p.hairlength), where rank(p.k) is the function that
returns the
rank of person p if ordered by key k.
The brute force approach would essentially compute the two
ranks and
then sort by their sum. That's three sorts and at least one
temporary
array. But I think things can be done considerably better. Any
ideas?
Thanks,
Andrei

Here's a multikey sorting problem that's bothering me; I think it's well
researched but I can't find the right keywords.
Say we have a collection of people with height and hair length
information. We want to get people who are "tall and long-haired". More
precisely, we want to order people by rank(p.height) +
rank(p.hairlength), where rank(p.k) is the function that returns the
rank of person p if ordered by key k.
The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary
array. But I think things can be done considerably better. Any ideas?

Just a further clarification ...
If a person is tall and has short hair, would they be included in the
sort? In other words, are you ONLY looking for tall AND long-haired people
or are you looking for relative ranking of tallness and hairiness?
--
Derek Parnell
Melbourne, Australia

Here's a multikey sorting problem that's bothering me; I think it's
well researched but I can't find the right keywords.
Say we have a collection of people with height and hair length
information. We want to get people who are "tall and long-haired".
More precisely, we want to order people by rank(p.height) +
rank(p.hairlength), where rank(p.k) is the function that returns the
rank of person p if ordered by key k.
The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary
array. But I think things can be done considerably better. Any ideas?

Just a further clarification ...
If a person is tall and has short hair, would they be included in the
sort? In other words, are you ONLY looking for tall AND long-haired
people or are you looking for relative ranking of tallness and hairiness?

Everybody is included in the sort: the size of the array before and
after the sort is the same. Then, of course, the top K problem is very
interesting too.
Andrei

Here's a multikey sorting problem that's bothering me; I think
it's well researched but I can't find the right keywords.
Say we have a collection of people with height and hair length
information. We want to get people who are "tall and
long-haired". More precisely, we want to order people by
rank(p.height) + rank(p.hairlength), where rank(p.k) is the
function that returns the rank of person p if ordered by key k.
The brute force approach would essentially compute the two
ranks and then sort by their sum. That's three sorts and at
least one temporary array. But I think things can be done
considerably better. Any ideas?

I don't think you can get much better than this.
Knowing the rank (for one key) for one element requires looking
at all the other elements. Thus, knowing the rank of all elements
isn't possible in linear time or faster.
Knowing the ranks of only one key doesn't get you much
information regarding the final result. Even the item ranked last
on one key can tie for first place in combination with the other.
These observations seem to impose a strict dependency on the
order the data is calculated. I don't see any shortcuts, but I'd
love to be proven wrong.
You may want to ask on http://cstheory.stackexchange.com/ .

If a temporary array is allowed, the ranks and the sum of the ranks
might be computed by a diogonal sweep over the area defined by the two
dimensions and populated by the elements.
Population and sweep would each be linear in time.
-manfred

If a temporary array is allowed, the ranks and the sum of the ranks
might be computed by a diogonal sweep over the area defined by the two
dimensions and populated by the elements.
Population and sweep would each be linear in time.
-manfred

Interesting, I have some thoughts along the same lines. Could you please
give more detail?
Andrei

The sweep is linear in time only, if the populatable points of the area
are bount by O(n), which is not neccessary the case.
Therefore, I don't see to find in the generell case a solution faster
than
O( n + k*log( k)),
where `n' is the number of elements of the input and `k' is the
predefined number of "best of the breed" to be sorted.
-manfred

The brute force approach would essentially compute the two
ranks and then sort by their sum. That's three sorts and at
least one temporary array. But I think things can be done
considerably better. Any ideas?

Can you define "better"?
Obviously you can't beat the O(n lg n) for comparison based
ordering. The O(n) space may be avoidable, but it doesn't seem
like the end of the world.
Does this problem occur frequently enough in practice to justify
having an optimised version in the standard library? Three sorts
+ O(n) space sounds fine to me.

The brute force approach would essentially compute the two ranks and
then sort by their sum. That's three sorts and at least one temporary
array. But I think things can be done considerably better. Any ideas?

Can you define "better"?
Obviously you can't beat the O(n lg n) for comparison based ordering.
The O(n) space may be avoidable, but it doesn't seem like the end of the
world.
Does this problem occur frequently enough in practice to justify having
an optimised version in the standard library? Three sorts + O(n) space
sounds fine to me.

I wasn't thinking necessarily of the standard library, but it is
something that is often needed even if not perceived as such. For
example, the stock screeners offered by financial sites are crappy -
they allow you to specify a range of e.g. P/E, P/book value, dividends
etc. etc. and sort by one of those. This is crappy - generally what one
wants is a selection of "best of breed" stocks that are among the top
ones in a variety of categories. (Relative importance could be assigned
to each category.)
A colleague at Facebook mentioned that you can use O(n) bucket sort for
the last-leg sorting.
The best approach I'm thinking of is to allocate an extra array of
size_t. First sort by k1 and then fill the array with 0, 1, ..., n-1.
Then sort simultaneously the two arrays by k2, and then add 0 to the
first slot, 1 to the second slot, ..., n-1 to the last slot. Continue
this by any number of keys, and finally sort using the secondary array
as a key. This works for any number of keys and only requires one extra
array of size_t.
The more interesting problem is yielding the answer lazily. I'm unclear
what the best algorithm for such would be.
Andrei

I wasn't thinking necessarily of the standard library, but it is
something that is often needed even if not perceived as such. For
example, the stock screeners offered by financial sites are crappy -
they allow you to specify a range of e.g. P/E, P/book value, dividends
etc. etc. and sort by one of those. This is crappy - generally what
one wants is a selection of "best of breed" stocks that are among the
top ones in a variety of categories. (Relative importance could be
assigned to each category.)

[...]
If I understand the problem correctly, isn't this a kind of optimization
problem? I.e., given a set A of k objects with n attributes, find the
object(s) that maximizes a given measure function f over its attributes,
where f is a weighted sum of n attributes.
If you treat the attributes as components of an n-dimensional vector,
then you could recast the problem as the convex hull of k hypercuboids,
where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The
points on the convex hull then are the high-ranking points (points
inside the hull are obviously suboptimal). Furthermore, the closest
points to the line spanned by the vector (1,1,1,...) are the points that
maximize the most number of attributes.
If you "peel away" the points on the convex hull, then the convex hull
generated by the remaining points gives you the "second class" points.
Peel away those, and the convex hull gives you the "third class" points,
etc..
Granted, this isn't exactly the original problem as stated, but it seems
to be closer to what you intend to achieve with your stocks example.
T
--
If the comments and the code disagree, it's likely that *both* are
wrong. -- Christopher

If you treat the attributes as components of an n-dimensional vector,
then you could recast the problem as the convex hull of k hypercuboids,
where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The
points on the convex hull then are the high-ranking points (points
inside the hull are obviously suboptimal). Furthermore, the closest
points to the line spanned by the vector (1,1,1,...) are the points that
maximize the most number of attributes.
If you "peel away" the points on the convex hull, then the convex hull
generated by the remaining points gives you the "second class" points.
Peel away those, and the convex hull gives you the "third class" points,
etc..
Granted, this isn't exactly the original problem as stated, but it seems
to be closer to what you intend to achieve with your stocks example.

That's close. Closer is to use simplexes instead of hypercubes. And the
space is defined by rank, not the original features.
Andrei

If you treat the attributes as components of an n-dimensional vector,
then you could recast the problem as the convex hull of k hypercuboids,
where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The
points on the convex hull then are the high-ranking points (points
inside the hull are obviously suboptimal). Furthermore, the closest
points to the line spanned by the vector (1,1,1,...) are the points that
maximize the most number of attributes.
If you "peel away" the points on the convex hull, then the convex hull
generated by the remaining points gives you the "second class" points.
Peel away those, and the convex hull gives you the "third class" points,
etc..
Granted, this isn't exactly the original problem as stated, but it seems
to be closer to what you intend to achieve with your stocks example.

That's close. Closer is to use simplexes instead of hypercubes. And
the space is defined by rank, not the original features.

[...]
Actually, the reason I chose hypercubes was because I was hoping that
the convex hull with hypercubes in the original features' space would be
equivalent (or easily massaged to be equivalent) to using simplexes in
rank space. (Although the part about the (1,1,1,...) vector may not
work out quite that nicely in this case.) Otherwise it doesn't really
give you a good algorithm since you'll have to precompute the ranks.
T
--
War doesn't prove who's right, just who's left. -- BSD Games' Fortune

Actually, the reason I chose hypercubes was because I was hoping that
the convex hull with hypercubes in the original features' space would be
equivalent (or easily massaged to be equivalent) to using simplexes in
rank space. (Although the part about the (1,1,1,...) vector may not
work out quite that nicely in this case.) Otherwise it doesn't really
give you a good algorithm since you'll have to precompute the ranks.

I understand. You're up to something; if you get to prove anything in
that direction, I'd be interested. Thanks!
Andrei

generally what one wants is a selection of "best of breed" stocks
that are among the top ones in a variety of categories. (Relative
importance could be assigned to each category.)

This is quite different and easier than the problem initially
stated, because ranks must not be computed.
1) Compute the individual importance `ii' of each element `e', which
seems to be the sum over all categories `c' of all
`e.c.value * c.importance'
2) Use the linear median algorithm to find the element `ek' whose
individual importance `ek.ii' has rank `k' in the set of all elements
`e' and `k' is the number of elements `e' acceptable as "best of the
breed"
3) return all elements `e' whith `e.ii >= ek.ii'. This can be done
by a sweep over the array.
Total running time aproximately: #elements * #categories
Please note, that this relies on having choosen the vector of relativ
importances for the categories correctly. How does one know about this
correctness?
For example in marketing it is desired to occupy unique selling
points, because the gain of profit might be huge. But this means, that
also the less for the buyer is huge.
-manfred

It's very simple and clear in formulation actually.
Given a range r of elements, define rank(a, p) as the position that
object a would have if range r would be sorted by predicate p. So the
rank is a number between 0 and n-1 (n being the number of elements in
the range.) For example, if we sort [ 40, 10, 30, 20 ] using "<" as a
predicate, the rank of 20 is 1. (For simplicity we can consider for now
that all elements are distinct.)
Now say we have multiple predicates, p1, p2, ..., pk. We want to sort
elements in increasing order of rank(a, p1) + ... + rank(a, pk).