Welcome to the Lounge

The Lounge is rated PG. If you're about to post something you wouldn't want your
kid sister to read then don't post it. No flame wars, no abusive conduct, no programming
questions and please don't post ads.

After a day of tinkering, I am giving up on a project that involves a real-time auto-complete textbox with filtering. It runs too slow and I could create a 10+GB table in SQL and that might speed things up, but that's just getting silly. Instead, I think I'm just going to remove the filtering and call it a day. That feature is, after all, not strictly necessary (I can filter after the user clicks "search" rather than do it through auto-complete).

It seems to me that if you want real-time auto-complete, you'd have to load all the data into memory. However, if you want auto-complete in "near-time", then a worker thread might do the trick, letting the user continue typing and you can filter the return further when it comes in. I would never rely on SQL to give you real-time performance anyways, it seems unreasonable to blame a technology where other factors, such as network performance, are influencing the overall system performance.

OK, at risk of turning this into something I'll have to move to the database forum, I need to know what the problem is. It could be you're missing something obvious. Or possibly something horrendously complex involving clustering and linear algebra.

Obvious...complex linear algebra...I always get those two mixed up. It'll be the death of me.

I don't want to give you anything solid (wouldn't want to risk you solving my problem, I might get reported!). However, I'll give you the gist (also, I'll be speaking in generalities to make it more confusing).

The user can type in some characters, and they get sent to a stored procedure to see if those characters occur in any of the data (the data will then be returned to the page so the textbox can show an auto-complete list of 10 or so items). The code is smart enough that it can massage the input in various ways to match it up with the data (e.g., remove special characters, ignore casing, ignore whitespace, manipulate "ab" so that it looks for "cd" instead).

The data resides in various tables. About 4. One of the tables actually just provides aliases for the values in another table (e.g., "ducks" are also known as "mallards"). And the data returned for auto-completion is only from 2 of the four tables (the alias table and another table are linked back to the other 2 tables to find the return data). There is further data in other tables that indicate attributes of the main 4 tables, but they are only connected via the 2 auto-complete tables (so some tables require multiple joins to get to the attributes). There are roughly 200 attributes per row in the main 4 tables. The user can select among those attributes to filter by (e.g., they can select 40 of the 200 attributes so only results with those 40 attributes will be returned).

Also, I kind of lied when I said the data resided in 4 tables. There are tables which indicate the history of the changes to the data in the other tables. So those history tables are also included. And that history is actually of the changes to the attributes, so it magnifies the size of the data considerably. This on top of the main 4 tables, each of which contain tens of thousands to hundreds of thousands of records.

To give a concrete, but fictitious example, imagine a store that sells belts. They have thousands of types of belts, and each type of belt has other factors that can be varied (size, buckle type, etc.), which create a few hundred thousand unique belt combinations. Now, there is one attribute in particular that customers like to choose from... say, the color. So, there are 200 colors to choose from. The belts are stored in database tables according to various aspects. For one, belt name. Also, they are stored by an internal belt identifier as well as a SKU shown on the belt. They are also stored by the ID of the documents used to describe the various uses for the belt. Whenever one belt color goes out of style, a record is tossed in another table. Whenever a new belt color goes into style, that is tossed in that same table. Sometimes colors go out of style, sometimes colors come into style, and sometimes color styles are replaced with other color styles. Sometimes an out of style color comes back into style.

Now, I'd like to provide an auto-complete box for belts. The user can type in the belt name, the internal belt ID, the public belt SKU, or the belt document ID. The auto-complete will only return belt name and public belt SKU, but the user can type in any of the 4 primary aspects (though, if they type in internal belt ID or belt document ID, the corresponding public belt SKU will be shown). Colors in the database happen to be related to public belt SKU, and the user may enter any of the colors they are interested in (say, all shades of red, 3 shades of purple, and a few greens). The auto-complete should only show belts that are available or were ever available in the belt colors specified by the user. Results must be returned in less than a second, and you have a day to complete this. Oh, and you are on SQL Server 2005, so no table-valued parameters.

Good luck.

P.S. If my manager is really interested in this feature, I may give it another go. We'll see.

The closest I know of to that is a search we have here that we (yes, I'll take some of the blame) implemented by making a table to hold the ID and a big long string of essentially every field (including flattening out several many-to-many relationships) in the record; the search then does a LIKE -- and viola. But our situation sounds easier than yours.

I did similar, though a LIKE '%Whatevs%' was too slow, so we split the strings and tossed them into a table like this (this is a simplified version):

Section

Complete

Type

y

Ugly

Name

ly

Ugly

Name

gly

Ugly

Name

Ugly

Ugly

Name

3

123

ID

23

123

ID

123

123

ID

C

ABC

SKU

BC

ABC

SKU

ABC

ABC

SKU

c

UgDoc

Document

oc

UgDoc

Document

Doc

UgDoc

Document

gDoc

UgDoc

Document

UgDoc

UgDoc

Document

Then we could do SELECT DISTINCT Complete, Type FROM Sections WHERE Section LIKE @Input + '%'. That can use the index rather than a full table scan (and still allows for the middle of a string to be searched). Then we use the compete string to do an index seek in the other tables. But that was before the addition of "Color" as a filter. That's where things get complicated.

Doing what you are doing was certainly one of my suggestions. The other was looking for patterns in searches and precalculating and caching common searches. You'd essentially lookup the search string in a "precache" table first, then move on to a full search if no luck. That doesn't solve the fundamental problem though.

For raw speed is it practical to simply load everything in a memory cache? It may not be faster than a database, but it may - you could be surprised. RAM is stupid cheap, so should never be an issue (compared to the cost of your time).

Another option is massively denormalising the data, splitting it and storing it among multiple servers, with a single central server (give or take redundancy) farming out requests to the various servers.

A third option is a little old school, but still seems popular: wait half a second so it looks like you're doing something, then return a random set of records. I've seen this very algorithm implemented just a couple of weeks ago.

For raw speed is it practical to simply load everything in a memory cache?

That is an option I may explore. Certainly using some dictionaries rather than relying on SQL Server's funky query tuning would be more deterministic.

Chris Maunder wrote:

Another option is massively denormalising the data

That's the 10+GB option I mentioned not wanting to do in my OP. It'd probably work, but then I'd have a beast of a table to deal with. If we move to Azure, that would have a significant monthly cost associated with it.

Chris Maunder wrote:

wait half a second so it looks like you're doing something, then return a random set of records

I suspect they'll notice when they search for "Ugly Bel" and get "Pretty Belt" instead.

Panic, Chaos, Destruction. My work here is done.Drink. Get drunk. Fall over - P O'HOK, I will win to day or my name isn't Ethel Crudacre! - DDEthel CrudacreI cannot live by bread alone. Bacon and ketchup are needed as well. - TrollslayerHave a bit more patience with newbies. Of course some of them act dumb - they're often *students*, for heaven's sake - Terry Pratchett

I once built an XML autocomplete using CONTAINS which was way to slow when it included all the additional data. The data was loaded into an IDictionary where the address of the item to look up was returned by the autocompleted word. Then I broke the files down by alphabet and put in the keydown event to get the file matching the first character and load that file newly into the IDictionary after dumping what was there. The result was instantaneous results including the STARTS WITH and CONTAINS matches. Possibly something like this would help you out. I hate to see SQL win

Yep - but less that they calm down, more that the short term gain investors grab their 12.5% so the drop can be greater than it 'should' be - which means more short term investors jump in.,... Bouncy bouncy!