Description

Wikidata has a lot of items with the same label. We should explore ways to rank them according to their relevance. For example, Berlin (the capital of Germany) should be ranked higher than Berlin (music album) in suggestions.

The purpose of this is, to allow users to find their desired items more easily. Currently, we are taking into account the number of sitelinks and labels. We should see if this is still sufficient.

I find this tasks description oddly confusing. At the moment this ticket is non-actionable. It does not even describe if there is an actual issue to solve or not. Is the current method not sufficient? Do you have more specific examples? How do you suggest to improve the situation? What would be the goal of such an improvement? How do we measure if a change is a success or not?

Note that there are already multiple tickets about switching to CirrusSearch. The current ranking will be obsolete then.

We were recently discussing a Wikipedia PageRank solution (or a combination of that ranking with other features). I could contribute these scores and get ready also to implement some integration (with some help).

@Smalyshev, I think we might check first if the type of output is of any use for you. You can get most info (e.g. output/input format) at http://people.aifb.kit.edu/ath/#Wikidata_PageRank. It is not run on Hadoop and it takes fairly little resources (actually it can be optimized to run on a laptop with 16gb of ram). Currently, there are no optimizations in place and we use about 200GB of RAM (processing power doesn't matter). In case good use cases exist and it has been verified that the current output is of any use, as next steps I would consider the following:

transform the actual link datasets of Wikipedia to a processable format (similar to the output of DBpedia pagelinks)

develop a processing pipeline as a docker file and make all source code available under a free license

I have developed a full Bash+Python3 framework that enables to compute PageRank on any Wikipedia language edition (even with low-cost hardware). By default, the input is based on the latest version of the Wikidump and the output involves each page's Q-id and an according ranking score. The software is licensed under GPL v3 and it can be accessed at the following URL: