According to Steven Arnold, FirstGov – which has been renamed USASearch.gov — is by far the most effective US government-specific search engine. But there’s something odd about it; whatever the query, it’s determined to give no more than a little over 100 results. Queries for which I’ve noted results in this quantity range include Bush (and this covers all family members), Cheney (ditto), Kennedy (ditto), Condaleeza, Scalia, Coolidge, Red Sox,big dig, Burlingame, Redmond, Pluto, ethanol, spotted owl, and topology. The only ones I’ve found so far coming out above that results range – perhaps inevitably — are death (137) and taxes (177). Read more

OK. I have a vision of one way search could evolve, which I think deserves consideration on at least a “concept-car” basis. This is all speculative; I haven’t discussed it at length with the vendors who’d need to make it happen, nor checked the technical assumptions carefully myself. So I could well be wrong. Indeed, I’ve at least half-changed my mind multiple times this weekend, just in the drafting of this post. Oh yeah, I’m also mixing several subjects together here too. All-in-all, this is not my crispest post …

Anyhow, the core idea is that large enterprises spider and index a subset of the Web, and use that for most of their employees’ web search needs. Key benefits would include:

Filtering out spam hits. This is obviously important for search, and in some cases could help with public-web text mining as well. It should be OK to be more aggressive on spam-site filtering in an enterprise-specific index than it is in general web search.

Filtering out malicious/undesirable downloads of various sorts. I’m thinking mainly of malware/spyware here, but of course it can also be used for netnannying porn-prevention and the like as well. Again, this is more easily done for the enterprise market than for the search world at large. (I anyway think that Google could blow Websense out of the water any time they wanted to – except, of course, for the not-so-small matter of not being seen as participating in the censorship business — but that’s a separate discussion.)

Capturing employees’ search strings. This could be useful for various purposes, including discerning their interests, and building the corporate ontology for internal web search.

Freshness control. If there’s a site you really care about, you can make sure it’s re-indexed frequently.

Gartner and Forrester have high opinions of FAST. Not coincidentally, you can download both those firms’ recent search industry survey reports from almost any page of www.fastsearch.com. Of the two, Forrester’s is both better and more recent.

Summarizing brutally, the big firms’ consensus seems to be:

FAST and Autonomy are the clear leaders.

Endeca has great technology and is coming on strong.

Everybody else is a niche player, at least for now.

Convera is in deep yogurt.

Forrester is particularly harsh on Convera. Presumably this has much to do with the fact that Convera did not cooperate well with the survey process. I shall not speculate as to which way the causality runs there – but I should note that Convera was quite cooperative with my research last week.

Web search and enterprise search are in many ways fundamentally different problems. The biggest problem in web search is screening out pages that deliberately pretend to be relevant to a search. The second biggest problem is picking out the crème de la crème from a long list of essentially good hits. In enterprise search, on the other hand, the biggest problem is finding a single document, or single fact, that is lonely at best, and if you’re unlucky doesn’t exist in the corpus at all. Document structures are also completely different, as are linking structures and almost every other input to the ranking algorithms except the raw words themselves.

Even so, the businesses and technologies of web and enterprise search are beginning to combine. Read more

Once upon a time, more than a decade before the founding of Autonomy, a New Mexico inventor had the idea for a generic pattern recognition tool. He implemented it on a PC add-in board that, if I recall correctly, plugged into the Apple II. This was the genesis of the company Excalibur Technologies.