We don't implement a sitemap.xml and I'm not sure it's practical to - we have several hundred thousand nodes being updated hourly in an irregular way, it would be constantly stale and take a lot of cpu to make. We do implement an atom feed of updates to the index which is discoverable in markup and I *think* google and other robots use it to detect timely changes to nodes to re-index without resorting to a full re-scrape. Hard to know their internals.

But for our purposes I'd assume there is already lots of DDG scraping and indexing going on and this should just be an extra step after that process and not a separate scrape?

If this needs another scrape, or just for dev purposes you could start at the world node, or any region or crag node and then walk down through the index following links in the left nav:

Just to get started I'd probably only walk down as far as the highest crag node, we internally call this the TLC or Top Level Crag. We use this concept of a TLC for lots of reasons, eg what 'crag' does a route belong to? Crag's can be nested, ie the Grampians or Yosemite are considered crags but are 100s of km wide and contain smaller well known and names crags, eg Yosemite > El Capitan and Grampians > Hollow Mountain. This will avoid the large number of cases of crags which have children nodes with generic names like 'Left side / Right side', 'North / South' or 'Sunny side / Shady side'. Later we can go down to the route level and figure out how to filter out all the duplicates.

This would be a completely separate scraper. All the fatheads and longtails use a source specific scraper. Usually we just figure out what a reasonable update period is and manually run it then. I like the atom feed though. I'll try and think of a way we could use that.

Given that this would be a separate scraping process I'm starting to think a custom endpoint on our side which is hit once a week or so would be better, but not a live api.

In the mean time you guys can manually piece together a static text file, csv or json or directly in the longtail format with exactly the bits of data that you need, and adding fields and test data as you go. When you are close to getting it all working and the format has stabilized a bit we (ie thecrag) will knock up a single api endpoint which replicates that format.

That way you are focusing on the IA logic and not on a bunch of scraping and parsing code, and we are focus on just pumping out the data in the right shape without having to think too much about what that shape is and what field goes where.

Yeah that's what I had in mind, this seems the easiest. Even monthly could be fine as this data is very slow moving at this high level. This would work well for all crags, which is a fairly limited data set, currently ~5,000 records worldwide (and probably return only high quality subset of these). This probably won't work so well for route level stuff ~300,000, but I'm a little dubious about the value of IA's for individual routes. Perhaps we could only IA routes which are iconic, 2 or 3 star popular routes.

That is quite odd. I haven't been running anything since I posted and my traffic was minimal and confined to that single url. I don't live in Quebec or have safari installed either. I do hope you figure out what the problem is though.

Great thanks! All of the data that you want to show will have to go into a single "paragraph" field. This field will have to be formatted in the way you want it to show on the site. For now that can only be plain text and newlines.

Feel free to make a pull request at any time. It doesn't have to be completely finished and it might be easier to go over the output file format on github.