Sometimes, a solution is purchased. Many libraries, for example, use 360 Search.

Here at UCSF, we are among the libraries that have built our own federated search. Twice.

There are (at least) three ways to pull data out of other resources in real time.

Cool, they have an API for that!This almost never happens.

I will screen-scrape the #*%!?@ out of your website!This is by far the most common scenario.

Web New-dot-Oh: It’s full of JavaScript that injects the content.An edge case, but one that is becoming more important all the time. Make friends with PhantomJS to scrape these sites.

When trying to implement these solutions, a common scenario is to build your screen-scraping federated search tool with traditional server-side languages like Java or PHP.

These strategies and technologies bring with them pitfalls to be avoided. Recompiling your WAR every time one of your target systems modifies their HTML layout, anyone?

Here’s another pitfall: Our group built a solution years ago (our first one) that is implemented in Drupal with no external facing API. So, if I want to experiment with a different results interface, I need to write it in Drupal. This tight coupling prevents experimentation with other technologies or things that don’t fit neatly into the Drupal paradigm.

A lot of the pitfalls can be avoided by following sound software architecture principles. But one thing should be uncontroversial:

No programming language has a more robust and widely-understood set of conventions and tools for processing blobs of HTML than JavaScript.

So how about building your federated search server using Node.js? Or maybe even take it a step further and just let your user’s browser execute the federated search entirely by itself, no need to talk to your server! If it’s all just JavaScript, why not?

(Or tell me about your project that already does this better and I need to fold up shop or at least steal all your ideas.)

While I’m at it with the small-text thing, here’s a caveat on the Browserify-ed version: The one thing the browser couldn’t do was launch the PhantomJS headless browser for scraping sites that depend on JavaScript execution to display results. Fortunately for us, that was needed in the LibGuides plugin only. And LibGuides offers an API, so if we really wanted LibGuides results, we could use the API. We initially implemented it that way, actually, but found that the API results differed from the LibGuides search page results. We thought that might be confusing to users, so we went with PhantomJS-assisted scraping.