Yes, it costs more for Patagonia. But the way they treat me as a customer makes me happy to pay a premium…as my latest experience shows.

I had a pair of Patagonia gortex pants from way-back-when. Worked fine, though my duct tape patch job ruined the clean lines – I’d accidentally stuck my ice axe through the pants and into my left leg, instead of the glacier, during a glissade off Rainier.

And then this past snowboarding season some seam sealing tape started coming off, so things began to get a bit wet at times. I sent the pants to Patagonia, with a note explaining that I’d also be happy to pay for a real repair job of my ice axe mishap.

Yesterday I got a Patagonia gift card in the mail, for $238.44. No idea how they calculated that amount, but I’m looking forward to buying a replacement pair of pants. And they’ve reaffirmed my belief that paying for quality gear winds up being cheaper in the end.

Over at the Nutch mailing list, there are regular posts complaining about the performance of the new queue-based fetcher (aka Fetcher2) that became the default fetcher when Nutch 1.0 was released. For example:

Not sure if that problem is solved, I have it and reported it in a previous thread. Extremely fast fetch at the beginning and damn slow fetches after a while.

But in my experience using Nutch to do vertical/focused crawls, this problem of having very slow fetch performance at the end of a crawl is a fundamental problem caused by not enough unique domains. If a crawler is polite, then once the number of unique domains drops significantly (because you’ve fetched all of the URLs for most of the domains), the fetch performance always drops rapidly, at least if your crawler is properly obeying robots.txt and the default rules for polite crawling.

Just for grins, I tracked a number of metrics at the tail end of a vertical crawl I was just doing using Bixo – that’s the vertical crawler toolkit I’ve been working on for the past two months. The system configuration (in Amazon’s EC2) is an 11 server cluster (1 master, 10 slaves) using the small EC2 instance. I run 2 reducers per server, with a maximum of 200 fetcher threads per reducer. So the theoretical maximum is 4000 active fetch threads, which is way more than I needed, but I was also testing memory usage (primarily kernel memory) of threads, so I’d cranked this way up.

I started out with 1,264,539 URLs from 41,978 unique domains, where I classify domains using the “paid level” ontology as described in the IRLbot paper. So http://www.ibm.com, blogs.us.ibm.com, and ibm.com are all the same domain.

Here’s the performance graph after one hour, which is when the crawl seemed to enter the “long tail” fetch phase…

The key things to note from this graph are:

The 41K unique domains were down to 1700 after an hour, and then slowly continued to drop. This directly impacts the number of simultaneous fetches that can politely execute at the same time. In fact there were only 240 parallel fetches (== 240 domains) after an hour, and 64 after three hours.

Conversely, the average number of URLs per domain climbs steadily, which means the future fetch rate will continue to drop.

And so it does, going from almost 9K/second (scaled to 10ths of second in the graph) after one hour down to 7K/second after four hours.

I think this represents a typical vertical/focused crawl, where a graph of the number of URLs/domain would show a very strong exponential decay. So once you’ve fetched the single URLs from a lot of different domains, you’re left with lots of URLs for a much smaller number of domains. And your performance will begin to stink.

The solution I’m using in Bixo is to specify the target fetch duration. From this, I can estimate the number of URLs per domain I might be able to get, and so I pre-prune the URLs put into each domain’s fetch queue. This works well for the type of data processing workflow that the current commercial users of Bixo need, where Bixo is a piece in the data processing pipeline that needs to play well (ie doesn’t stall the entire process).

Anyway, I keep thinking that perhaps some of the reported problems with Nutch’s Fetcher2 are actually a sign that the fetcher is being appropriately polite, and the comparison with the old fetcher is flawed because that version had bugs where it would act impolitely.

While incredibly detailed and accurate, his data set wasn’t all that useful to me when thinking about climbing trips. I’d still wind up hunched over my old “Guide to the John Muir Wilderness and Sequoia-Kens Canyon Wilderness” maps, with R. J. Secor’s “The High Sierra” book in hand, trying to figure out possible routes to interesting areas.

So I wrote a program to convert his data into a Google Earth-compatible KML file, which I could then use to visual the peak list in glorious 3D. The resulting file has proven very useful, so I thought I’d share it via this blog post – and provide a bit of commentary regarding the program/process at the same time.

I use different color pushpins to denote the difficulty of reaching the summit. Green is for class 1 or 2, yellow for class 3, and red for class 4. I didn’t factor in the higher difficulty of the summit block, as many of the peaks are class 2 or 3 to the base of the summit block, but the block itself is class 4.

In the peak description, I tried to generate links to trip reports on Climber.org, but not all of these will be valid. Usually this is because the peak in question has no Climber.org trip report, but a few are due to issues with reverse-engineering the “shortened name” algorithm used at that site when grouping trip reports.

The same thing is true for links to Secor’s “The High Sierra” book at Google Books. I have page numbers, but not all pages are available (as one would expect), and sometimes the peak name used to highlight entries on the page won’t match the name that Secor used.

Next, some notes on the KML format:

The on-line documentation is really good, especially the KML Reference provided by Google.

I ran into a few minor problems, where no error would be reported by Google Earth when loading my file, but problems in the data meant that I wouldn’t see the expected result. For example, I’d accidentally specified the <color> value as hex-ified RGB (e.g. “ffffff” for white) instead of ABGR (alpha/blue/green/red), which needs eight hex digits. Also I’d added an <IconStyle> element with a an <href> child, but I needed to put the href inside of an <Icon> element. Minor things, but a bit frustrating to debug without any useful error being resported by Google Earth.

I wanted to use different built-in icons, but didn’t see a document listing all of these on Google. Eventually I found the list I needed in a Google forum post titled “Setting KML icon colors“.

I’ve posted source for the Java program used to generate the KML file. It’s located in my GitHub account, at the peaks2kml repository.

This Java program should have been trivial to write – basically convert from a text file dump of a database into the KML format. But I ran into one painful issue, which was converting from the NAD27 UTM locations into longitude/latitude. Seems like this bites everybody, and the lack of a universal, high quality Java package is frustrating.

I’m using the GeoTransform package, but I didn’t see a clean way to specify the source UTM datum as NAD27. I did figure out that the Clarke 1866 ellipsoid was the right one to use for conversion, and dumped out some results. I compared these with manual results from an excellent on-line UTM conversion page, and then used the delta (which appeared to be relatively constant) to adjust my results. Ugly, but close enough for a first cut.

And if I had to do it again, I’d probably use something like the KML beans (e.g. StyleType.java) from the Luzan project, and an XML package to convert the resulting object graph to a textual KML representation.