I am optimizing a part of a Delphi application where lists of objects are frequently filtered using different criteria. The objects are kept in TObjectList structures and it is common to select a very small percentage (ex. 1%) of the entire set with each filter. The total number of objects can be in the 100k range and during computations the main set doesn't change. Although the filters are applied for only a few properties, the lists cannot be sorted in such a way that would optimize all the possible criteria.

I am looking for suggestions on how to organize the objects (data structures) or an algorithm that can be used to approach this problem. Thank you!

Filter example:

((Object.A between 5 and 15) AND
(Object.B < 20) AND
(Object.C(AParam) > 0)) OR
(Object.IsRoot(...))

You can use a sorted list using the field, which constricts your search/filtering at most, and then perform a binary search to get the index/indexes of this criteria.

Referring to your example above: Generate an A-sorted list, search for the indexes of 5 and 15 and so you will get a (much) smaller list (all the indexes between the two), where you have to check the other fields (B, C and IsRoot).

Thanks for your reply. Sorting and binary search are already used when possible. The problem is that more filter criteria are applied on the same list and they would require different sort orders for the binary search to work. And re-sorting the list each time a filter is applied doesn't seem wise from a performance standpoint.
–
TihauanFeb 27 '10 at 16:01

So, you already use binary search for filtering the first field? When exactly do you sort the list? When you set up the list, or every time you search/filter the list?? Can I ask how many entries (compared to the original 100k entries) you get after this step?
–
ulrichbFeb 27 '10 at 16:20

+1 for this, sort the list by the most common/unique property and apply a binary search to get a smaller set to apply further filters.
–
arthurprsFeb 27 '10 at 16:28

@ulrichb In the current implementations the list is split in a few smaller lists using the "most common" filter characteristic. The smaller lists are sorted by a "second common" filter criteria. The smaller lists have up to 10k items. When filtering you stop after a certain (specified) number of items, sometimes you may need only the first matching. Another problem is that you need the first matching by a different sort order that the list is sorted on.
–
TihauanFeb 27 '10 at 18:08

I didn't really get it. Maybe you could add a more detailed description to your question above (with portions of your code or pseudo code).
–
ulrichbFeb 27 '10 at 18:26

This solution probably uses a lot of RTTI: just create a list with filters (property names and values) and loop over the objects. It will provide you with maximum flexibility but at the cost of some speed. If you need speed I think the solution provided by Ulrichb will be better.

Filter iterators are nice but in this case I have the feeling it would introduce some overhead. I'm thinking at the possibility of having filter iterators with some kind of state that would make them re-usable...that may help...
–
TihauanFeb 27 '10 at 15:56

If the quantity of objects is small, it probably doesn't matter too much how efficient the search is, though caching a result may help if done often.

If the quantity of objects is large, I'd consider using an in-memory database and using SQL to do the query. The database can then use indexes to find things as fast as possible, and you pass the burden to a proven tool. Personally I use DBISAM or ElevateDB, but others will do in-memory databases too. By using a real database tool, you can move the data to disk easily if it gets really large.

If the number of attributes to filter on is small and they can be sorted, why not have multiple lists, each if which is sorted by another attribute? Each list costs you 4 bytes per object for the reference plus a small overhead for the list itself. Of course all depends on whether the memory requirements can be fullfilled. 2 GB is not that much, if your are dealing with a lot of objects...

This is a very, very very difficult problem to solve in a generic way. There's one kind of software that does this all day long, and it's called an SQL query optimizer: that bit of code present in every single modern SQL engine will take a look at what you want (query), then take a look at the indexes available for your data, the selectivity of available indexes and it has to figure out the optimal way to use all that to give you your result set. Just to prove the problem is very difficult, the SQL query optimizer sometimes fails and produces visibly inefficient plans.

I'm pretty sure you don't want to implement an full-fledged query optimizer so here are some tips on how to make your queries fast enough:

(1) Select some fields that are frequently used in your queries and set up indexes on them. Those fields need to provide good selectivity: Don't index on a "boolean" value, you'd just loose time traversing complex binary search structures when you might just as fast (or faster) look at the whole list!

(2) For every given query select ONE single index to pre-filter the data and the apply all other filters one-by-one (without optimization).

In your example: Create indexes on the "A" and "B" fields. "C" seems to be a function so that's impossible to index. "IsRoot" seems to return an Boolean, that's not worth indexing.

The data structures to use for your data depend entirely on your data. If performance is critical implement several and do tests. If it's not critical just your favorite list sorting algorithm and be done!

The problem with this approach, is that he apparently already has actual objects he wants to filter, so in order to use sql, he would first need to create and populate in memory tables, which would take some time and memory.
–
Daniel MaurićFeb 28 '10 at 13:03

I'm not asking anyone to actually use an SQL engine so no tables need to be populated. I'm simply showing a real-world application that has to solve the same problem: One can look at how SQL engines solve this problem and take hints to make better a suited implementation. If the OP would give some real examples of how the searches look like we might suggest a better solution. But if the problem is defined as "Searches are so complex one can't index" then there's no easy solution.
–
Cosmin PrundFeb 28 '10 at 14:06

yes, I read your answer for the second time, actually paying attention... And I agree, what he needs is a custom query optimizer, as a matter of fact, I need one as well.
–
Daniel MaurićMar 1 '10 at 18:42