Imagine you had a database of films. A lot of the content would be very closely related material - HD and SD versions of the same film, and sequels in the same franchise.

I work on a recommendations system. If we wanted to use that system to return "similar" films to a seed film it'd make sense to remove items in the results that are either closely related to the seed film or to each other.

Finding "closely related" doesn't have to be exact. I've been grabbing the first ten characters of the title and assuming anything else with the same first ten characters is "related" and excluding it. Doing that to make sure nothing "closely related" to the seed appears in the results is easy.

However, pulling the same trick with the other results is much more difficult. Let's say that the seed film is Terminator. We can remove Terminator HD and the Terminator sequels easily. However, a likely result might be The Expendables. Once we've got one of those items in the result set I need to try and make sure that no more "closely related" items to the Expendables appear in that result set.

I need to do this efficiently - response times need to be sub-second and the data tables involved are very large. No way I can think of to do this is remotely efficient. Is there one?

I would consider keeping a physical representation of "closely related" relationships in the form of an intersection table with OriginalFilmID + RelatedFilmID. You can then break out the algorithm to find these related films so it isn't a factor in your response times. Since you only need to update this table when the Film list changes, this gives the added benefit in that you can use multiple methods and the methods can be more exact than first 10 characters, which I would guess is very rudamentary.

Once you have this intersection table, you can use it to feed the algorithm that returns results. I'd consider doing the heavy lifting in the middle tier because you could scale it horizontally and relieve some pressure on your db server. I'll have to think about how I'd structure the query in the DB layer to be most efficient, but it should be doable. I'll get back to you on that.

The hardest part of this problem was setting up an example. I had to simplify the model and the "Recommendation" algorithm and take some liberties to ensure relevant results, but I think I have something that illustrates the technique.

Overall, the goal is to rank the recommendation list. Once you have that, you can associate it with the closely related films and then remove items from the list that relate to a film with a higher rank. This operation should be fast if you have the right indexes in place, but it also depends on how complex your recommendation algorithm is.

I agree with lazerath's premise, you need to make the relationship explicit, rather than matching titles. Another thing is to separate the presentation (HD, SD, streaming, etc.) from the title, that has nothing to do with your recommendation engine.

You also have several ways to consider Terminator and The Expendables related: they're both action films (genre), they both have Arnold in them (actor). But they don't have the same subject (sci-fi/time travel/robots vs. present-day mercenaries) or other factors (SFX, box office, release year, director) that could also guide a recommendation. I know Netflix's recommendation engine is heavily biased towards user's viewing statistics...what they watched, how they rated it, etc.

The hardest part of this problem was setting up an example. I had to simplify the model and the "Recommendation" algorithm and take some liberties to ensure relevant results, but I think I have something that illustrates the technique.

Thanks for this, it's very useful. I just need to work on populated the relatedness table now.