Is there an equivalent to T-SQL's "top n" in Gigaspaces SQLQuery?

Our product deals with large numbers of numeric-valued, dated observations. They are conceptually ordered by date, but they can come into the system out-of-order. Each observation has a "trend" value, which is a measure of how different its numerical value is from the numerical value of the previous observation. Obviously, if a new observation comes in, we need to calculate a trend for this new observation AND for the one that immediately follows it in time.

Consider a new observation Ox. We need an efficient way of calculating these trends. One way to do this would be to use two "top n" queries. One would get the observation Ow just before Ox (so I can calculate the trend for Ox itself). The other would get the observation Oy just after Ox (so I can REcalculate its trend).

Of course, either or both of Ow and Oy may not exist.

We could do this by reading ALL observations, sorting them, and finding Ow and Oy by walking the list -- but there could be hundreds of thousands of observations, I'd much prefer that Gigaspaces found Ow and Oy for me.

5 Answers

We am not sorting data across partitions. The data being sorted is all in a single local partition. I don't think we need a "reduce' step. Is there still an advantage to writing our own Aggregator? Can you explain what that advantage is when dealing with a single local partition?

Also: we are still using Gigaspaces 9.1. Is Aggregator functionality even available in 9.1? It appears that gigaspace.execute() exists in the 9.1 API. Is that really all we need?

At first blush I am not seeing how counters would help. Are you thinking that they would give me a way to sort my observations? An incoming observations already has an observationDate member. This member already imposes an ordering on the observations.

As I see it, the real problem is that an incoming observation may have a date that's earlier than a lot of existing observations. I need to find the previous and next observations for this incoming observation in order to update trending. Furthermore, I'd like to do this without increasing my process' memory footprint.

@Venkat: I don't think a polling container buys us anything. We see each observation as it comes in, so we already have a point in the code where we could impose our own logic on some in-memory structure we'd build. This, however would incur a large client-side memory hit. We'd prefer to leave the Observations in the space, and to pull into client space just the ones needed to do the trending calculations. The actual total number of Observations in the space can be in the millions, even though they can be categorized in such a way that any given trending operation pertains only to a subset of them.

@shay: The "Aggregator" approach (your first link) would have worked if there were "CeilingTask" and "FloorTask" aggregators to get (respectively) the earliest Observation whose date is greater than a given date and the latest Observation whose date is less than a given date. But these do not exist.

@shay: I tried repeatedly to follow your second link. I kept getting "page not found".

OK. But if the JDBC driver will not provide you what you need or will not be fast enough leveraging the above patterns will be the way to go.

If you must sort data across multiple partitions you will need to use a Distributed Task that will get the "top" objects from each partition and "reduce" at the client each intermediate result into one consolidated result. This is what the JDBC driver essentially doing when called from a remote client. Still , a custom code using projections will make it go way faster as we demonstrate with the Aggregators.