Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.

I want to build a large document (news article) searchable database, such as when adding a new article I will be able to quickly find X most similar articles from it.
What is the right tech/algorithm/Python framework to approach this?

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

$\begingroup$I have found that elasticsearch can achieve what I need, but it's kind of an overkill. I thought about a nice sql structure with tags in Python$\endgroup$
– rubmzNov 7 '17 at 8:55

1

$\begingroup$Elasticsearch does wonders in this case and it is (amongst) the most friendly databases. I can't recommend it more, especially considering how much time it can save you. It even has a query mode named more_like_this that will give you the most similar documents to the one passed to it. Docs$\endgroup$
– Carlo MazzaferroNov 7 '17 at 19:53

1 Answer
1

Elasticsearch is the right tool to use if you don't want to code this yourself. Indeed, you need an indexing algorithm that is able to efficiently retrieve pieces of texts in a big database, and SQL isn't particularly good at it. Moreover, Elasticsearch is quite user friendly, so it won't be an overkill to actually install it and use it. You might discover in the process that finding most similar articles isn't that easy and that Elasticsearch is of a great.

$\begingroup$I agree. It did look like a similarity problem at first, but its an indexing nightmare when you get to it. And - elasticsearch seems to be the right solution in a number of aspects.$\endgroup$
– rubmzNov 8 '17 at 9:59

$\begingroup$Yes, getting the similarity is the simplest part of information retrieval in general, all the processes before are the real burden.$\endgroup$
– debzsudNov 10 '17 at 15:42