I've followed the paths by hand before, but this is the first time that I've actually tried to do it in a programmatic manner. It's harder than you might think, but really only because of the crazy number of edge cases involved in correctly parsing either the MediaWiki format or the resulting HTML. I eventually got it working though, and here's the result.

First, we want to be able to fetch the pages from Wikipedia. We really want to Cache (computing)these as we go, so that we don't accidentally get rate-limited by Wikipedia's servers (or at the very least to be nicer to them). Here's what I wrote for that part (using the Wikipedia API) ((If you'd like to follow along, here's the code on GitHub: path to philosophy source)):

Basically, where ever your code is running from, it will create / use a subdirectory called cache into which it will dump one file per page, each with the json extension duplicating whatever content that Wikipedia has returned. Technically, it has the ability to force a refresh, but in practice I just deleted the entire cache folder when I wanted to reset it ((Yes, I know. That kind of defeats the purpose of having a cache in the first place...)).

Unfortunately, once you have that, it's still in a nice nested structure. So we need to be able to pull out the actual page content. If you use Racket's JSON library, you'll have a nested series of hashes and lists that you have to navigate through. It's not quite as nice as Python's model ((Which I've used extensively before. How haven't I written a blog post about that yet?)) but it's not bad. Basically, I figured out the structure by hand and write a function to just ignore anything I didn't care about:

Now we have the actual content of the page in the original MediaWiki syntax. Links will look like this [[target|text]], so we should just be able to write a simple regular expression and we'll be golden, right? Well, not so much. There are a number of additional rules that we're supposed to follow:

Take the first link in the main body of the page, except:

Ignore links in parentheses

Ignore links in italics

Ignore links to images / files

Ignore links in infoboxes, etc

If a path doesn't work, backtrack and try the second link

So that gets a little more interesting. Originally, I tried to do everything with regular expressions. That took a few hours until I realized that no matter how close I was getting, the code was an unholy mess. It turns out that regular expressions are not the tool you want to use for non-regular languages (anything with nestable tags) ((Yes, I should have known this. I've taken probably a dozen classes in theory and/or programming languages. But you can actually do a pretty decent hack job of it if you make certain assumptions about your input...)). In any case, eventually I regained my senses and decided to go for a simple parser instead. All I needed was some way to match (potentially nested) tags with arbitrary delimiters ((MediaWiki uses at single and double square and curly braces. It's a little wild...)). Here's what I have for that:

Basically, given a position in the page-content ((It's a free variable here; that's because these are actually nested in the actual main parsing function.)) and the starting / ending delimiters, find the correctly matching end. nesting is used to keep track of how many tested labels we have (mostly useful for {{patterns like this}}). So what do we do with all of that?

Basically, scan through the file. Anytime we find one of the patterns to remove, find it's close and skip ahead. Anytime we find a link, remember it in the recursion.

At the end, we have a lovely little set of maps and filters to clean up the links and remove any potential external links. It's things like this that I miss most when I'm working in non-functional languages...

Anyways, now we have all of the pieces. We can put them all together to get the neighbors for any give page:

; Get the neighbors to a given article
(define (get-neighbors title)
(filter-links (get-first-page-content (get-page title))))

And then it's easy enough to solve the problem. I'm actually going to do the general solution first, since we'll have to write it eventually:

The smartest part of the code here comes in the neighbor-loop. Here's where we do the backtracking, primarily by that strange looking => in the second cond clause. Basically, that lets you pass the result of the conditional to a function so you can use it without recalculating it. Since anything that's not #f in Scheme (and by extension Racket) is true, this means that the recursive cases that don't work and return #f will fall through to the else case and the rest will call the function with the recursive result as next. Hope that explanation made some sort of sense, it's a really helpful tool.

And just for the sake of completeness, here's how we get to Philosophy:

; Starting at a given page and taking the first link, how long to Philosophy?
(define (->philosophy start) (find-path start "philosophy"))

I've got an extra link (rest mass) than the source example, but that's mostly because it's a redirect page. My code will print those out, but the sample apparently does not. But overall, it works out pretty well. Let's try something perhaps a bit more amusing:

The path is a bit different (and longer) than the one given, diverging after pervasive developmental disorder. They went to mental disorder while I went to specific developmental disorder. That's just the nature of Wikipedia. From time to time, people will actually discover the 'paths to philosophy' challenge and purposely switch about links so that it specifically either works faster / doesn't work, but such changes generally get reverted fairly quickly. It's really kind of amazing how Wikipedia has grown to deal with things like that.

In case you were curious, here are the timing differences when using the cache and when not:

It's really not much faster. Most of the time is spent processing the text and looking for links, perhaps only 1/4 to 1/3 is network latency. But still, if you're doing a lot of testing, that saved bandwidth to Wikipedia is probably worth a few fractions of a penny at the very least.