18 February 2012

At any rate, did you know that "Chicago Stags coaches" are a subcategory of "Natural sciences"? If you don't believe me, go to the Wikipedia entry for the Natural sciences category, and expand the following list of subcategories:

Biology

Zoology

Subfields of zoology

Ethology

Behavior

Human behavior

Recreation

Games

Ball games

Basketball

Basketball teams

Defunct basketball teams

Defunct National Basketball Association teams

Chicago Stags

Chicago Stags coaches

I guess it kind of makes sense. There are some other fun ones, like "Rhaeto-Romance languages", "American World War I flying aces" and "1911 films".
Of course, these are all quite deep in the "hierarchy" (all of those are at depth 15 or higher).

So if you're trying to actually find pages about Natural sciences, maybe it's enough to limit the depth of your breadth first search down the graph.

This is sort of reasonable, and things up to and including depth four are quite reasonable, including topics like "Neurochemistry", "Planktology" and "Chemical elements". There are a few outliers, like "Earth observation satellites of Israel" which you could certainly make a case might not be natural science.

At depth five, things become much more mixed. On the one hand, you get categories you might like to include, like "Statins", "Hematology", "Lagoons" and "Satellites" (interesting that Satellites is actually deeper than the Isreal thing). But you also get a roughly equal amount of weird things, like "Animals in popular culture" and "Human body positions". It's still not 50/50, but it's getting murky.

At depth six, based on my quick perusal, it's about 50/50.

And although I haven't tried it, I suspect that if you use a starting point other than Natural sciences, the depth at which things get weird is going to be very different.

So I guess the question is how do deal with this.

One thought is to "hope" that editors of Wikipedia pages will list the categories of pages roughly in order of importance, so that you can assume that the first category listed for a page is "the" category for that page. This would render the structure to be a tree. For the above example, this would cut the list at "Subfields of zoology" because the first listed category for the Ethology category is "Behavioral sciences", not "Subfields of zoology."

Doing this seems to make life somewhat better; you cut out the stags coaches, but you still get the "Chicago Stags draft picks" (at depth 17). The path, if you care, is (Natural sciences -> Physical sciences -> Physics -> Fundamental physics concepts -> Matter -> Structure -> Difference -> Competition -> Competitions -> Sports competitions -> Sports leagues -> Sports leagues by country -> Sports leagues in the United States -> Basketball leagues in the United States -> National Basketball Association -> National Basketball Association draft picks). Still doesn't feel like Natural sciences to me. In fairness, at depth 6, life is much better. You still get "Heating, ventilating, and air conditioning" but many of the weird entries have gone away.

Another idea is the following. Despite not being a tree or DAG, there is a root to the Wikipedia hierarchy (called Category:Contents). For each page/category you can compute it's minimum depth from that Contents page. Now, when you consider subpages of Natural sciences, you can limit yourself to pages whose shortest path goes through Natural sciences. Basically trying to encode the idea that if the shallowest way to reach Biology is through Natural sciences, it's probably a natural science.

This also fails. For instance, the depth of "Natural sciences" (=5) is the same as the depth of "Natural sciences good articles", so if you start from Natural sciences, you'll actually exclude all the good articles! Moreover, even if you insist that a shortest path go through
Natural sciences, you'll notice that many editors have depth 5, so any
page they've edited will be allowed. Maybe this is a fluke, but "Biology lists" has depth of only 4, which means that anything that can be reached through "Biology lists" would be excluded, something we certainly wouldn't want to do. There's also the issue that the hierarchy might be much bushier for some high-level topics than others, which makes comparing depths very difficult.

So, that leaves me not really knowing what to do. Yes, I could compute unigram distributions over the pages in topics and cut when those distributions get too dissimilar, but (a) that's annoying and very computationally expensive, (b) requires you to look at the text of the pages which seems silly, (c) you now just have more hyperparameters to tune. You could annotate it by hand ("is this a natural science") but that doesn't scale. You could compute the graph Laplacian and look at flow and use "average path length" rather than shortest paths, but this is a pretty big graph that we're talking about.

Has anyone else tried and succeed at using the Wikipedia category structure?

12 comments:

out of curiosity, have you looked at other dbs that map (at least partially) to wikipedia and have their own taxonomies? http://freebase.com/ , for example, which i've been working with a bit recently.

Your analysis is right: the category structure ist just not very reliable. So a way to deal with it is regarding categories more as tags than as really structured information. I used this idea for NE Classification:

http://www.aclweb.org/anthology/W11-3607 (PDF)

I think the global structure is way too messy for meaningful analysis. Maybe for many problems it is sufficient to look at local structure in the category network.

I am not very sure what your use-case is. But I've used the Wikipedia category graph to measure relatedness between two articles. I guess the distance between the category "Biology" and "Chicago Stags coaches" is a good estimate of how (un)related they are.

But before creating the graph that, I excluded the root node (Category:Contents) from the graph so as to get rid of "too much of generalization".

I have refined the Wikipedia category graph for use in the INEX 2010 XML Mining track. This was done by finding the shortest paths between a page and any of the 'Main Topic Classifications'. It results in a multi-label category structure. I only used the last 2 vertices of the shortest path sequences and threw away small categories.

You can find the details in the paper at http://eprints.qut.edu.au/41223/.

My method is stupid but effective: manually pruning. First I generate the hierarchical taxonomy, and then skim it by dragging the scrollbar. If I find something of no concern, I'll locate where the hierarchy goes astray and then add the topmost undesired category to a "blacklist". Next time when the taxonomy is generated, the children of this category won't be added to the hierarchy. Iterate this process and the taxonomy becomes purer and purer.

Depending on your use-case, you might find YAGO (http://www.mpi-inf.mpg.de/yago-naga/yago/) useful. We had started with Wikipedia, but switched to YAGO for our paper (http://www2011india.com/proceeding/companion/p21.pdf) on answering transitive type-entity queries.The YAGO category hierarchy is much more cleaner and manageable.

I did a bachelor's thesis on using Wikipedia categories for NE recognition, based on this paper: http://www.mt-archive.info/ACL-2008-Richman.pdf . But that uses categories from the bottom-up, so to speak, where as you are talking about top-down.

I wasn't able to reproduce the same level of results as in that paper but my software was surely much more crude and my level of knowledge much less than the authors'.

I used the Wikipedia categories to define a vector space (mostly for disambiguation purposes), which gave OK results. The code I used for doing that is available on our Github, in case that's useful. Using them literally though, given the weird things you point out, will probably give very odd results :)