Full text search using Linqdb

Building full text search engine without using any search specific libraries is tough. I've built stackse as a hobby project, the site searches Stack Overflow and allows special characters. The first version had a crawler (which is tough by itself) and complicated search library. The entire project was rather tangled and hard to maintain. Lessons learned were:

complexity is enemy, make things simple first and then complicate if necessary. Complex projects without any experience won't be done right the first time anyway.

complex projects must be made of small autonomous and testable components.

crawling is difficult to implement and is slow, parsing contents is slow, writing to disk is very slow

real time search of entire large data sets is slow too, most time is spent reading from disk and deserializing data. Therefore with limited hardware partial search is the solution.

surprisingly, cheap title search is most often the most useful one

In an unrelated project, being dissatisfied with sql server on one hand and amazed with Leveldb on the other, I've attempted to build small relational database based on Leveldb with the goal to have something efficient, yet relational, goals that are somewhat mutually exclusive. The result is
Linqdb which is something in between two worlds.

At some point I've realized that Linqdb can be used to build full text search engine and it's quite simple.

So first of all I've downloaded data available
here. Linqdb supports simple full text search on string properties. The search is very simple - all words in a query must be present in text. Also Linqdb supports partial search, i.e. we can search only first 1000 documents or only second 1000 documents and so on. Besides doing main post search we are also going to do title search and then combine results.

We could construct one text property where entire post would be, however such search would return results where words are far away from each other. Therefore smaller text fragments need to be constructed. The final data model looks like this:

This is what we eventually need to get. PostFragment is small text fragment that we are actually going to search, so the search words are close to each other. We make PostFragment by scanning post text in a sliding window fashion, so for example for post of 300 words we get say 3 PostFragment: one with words 0-150, second 75-225 and third 150-300. AnswerFragment is something we return as a result when doing title search. The entire import code is here. Importing 2.4 million posts results in 3 dbs (totaling 70Gb) and takes around 10 hours. Only the FINAL_DATA (22 Gb) is used for search. We have 5 machines so in total 12 million posts are indexed. So much for preparation.

And that's it! There are 5 machines, each searching 2.4 million posts in this way. The master site gets results, aggregates and presents to the user. A lot could be done to improve search quality but the basics are laid and in a simple, efficient manner.

See it in action here. The code is available here. Linqdb project: here.