Large relational database vendors have long argued that stuffing as many of your documents as possible into a database is the way to go. Hence, the ongoing war of words between Oracle Corp. and IBM over whose database software provides faster storage and retrieval of XML data.

But enterprise search software such as Fast, Autonomy Corp. or Endeca Technologies Inc. lets you go the other way and search for information in a database, either in unstructured binary large object or “Blob” form, or if it’s numbers, even in cells.

Search software is actually faster than executing a SQL run to find data in a database, though it can’t manipulate or numerically analyze the data, according to Yves Schabes, co-founder and president of Teragram Corp., a Cambridge, Mass.-based enterprise search vendor.

If I can use Google, can I easily learn to use enterprise search software?

Probably. Most software today displays a single initial box into which a user can enter keywords separated by Boolean logic commands such as AND and OR. After getting a set of results, users then look to the side for drop-down menus where they can narrow the search down by what Schabes calls “facets” such as information source, by country or by date.

What kinds of information are does search software have difficulty finding?

Enterprise search software tends to be bad at searching information that has already been offloaded to tape archives, according to Schabes. For that, companies still tend to rely on specialized e-discovery and storage management tools.

Enterprise search also has problems handling multimedia such as podcasts, pictures and video files. Metadata is usually scarce or not useful. Those files still need to be transcribed or processed by speech-to-text software to be indexable by enterprise search software.

In addition, enterprise search software isn’t good at filtering out multiple versions of the same document, Schabes says. This data cleansing, data de-duplication or master data management is already an established field in the structured relational database realm. But tools are slow to emerge in the unstructured enterprise search arena, he says.

Nothing you’ve said sounds like enterprise search would be more difficult than the task of cataloguing and searching the entire World Wide Web.

But consider the challenges of looking in every nook and cranny of a corporate network, reading all the various file formats, and handling who has permission to see what. For instance, an enterprise search product might index everyone’s private e-mail. But only certain employees should be allowed to search those e-mails — in fact, those e-mails should not even appear in the search results of unauthorized employees, Schabe says. To enforce that, enterprise search software needs to be tied into group-policy software such as Microsoft’s Active Directory — no easy task.

Moreover, corporate documents lack useful metadata to help give context, claims Schabe. “People rarely search for the author of a document, or what business unit it came from, or the date it was created,” he says.

HTML pages on the Web have lots more useful metadata that makes it easier for systems to index and determine their relevancy. One of the most important is the Web page’s popularity, which algorithms such as Google PageRank infer by looking at the number times it has been clicked and to where else it is linked.

Another problem, according to Ali Riaz, former president and chief operating officer of Fast, is that expectations “are higher since the assumption is ‘It’s our data, and we know what should be there.'”

For example, 62 per cent of U.S. scientists and engineers surveyed late last year were dissatisfied with their existing enterprise search systems, according to independent search analyst Stephen Arnold.

To make up for the lack of metadata information, corporations rely on their enterprise search software and related tools to automatically apply taxonomies to their documents, which they later refine. That’s a bit like tagging a photo you upload to Flickr, except in this case, Flickr would do all of the tagging for you. TeraGram Corp. offers software for this.

Are there enterprise search software packages that try to encourage employees that view documents to categorize documents or to leave comments and ratings?

Sure, though Schabes is a skeptic. “The ratio of documents to people is so much higher in an enterprise than on the Web,” he claims. As a result, few documents — popular human resources documents or accounting spreadsheets notwithstanding — are going to get viewed enough to generate many comments or ratings, especially from busy employees. Moreover, asking employees to rate documents isn’t useful “because a manager will look at things differently than a salesperson,” he said.

Is Google’s search technology inferior to Fast for enterprises?

That’s what Microsoft’s Raikes asserted when he claimed that PageRank, because it relies so heavily on page views and hyperlinks, is a flawed method for enterprise search. Google pointed out, however, that the Search Appliances it sells to enterprises use a tweaked version of its famed algorithm. It also said that PageRank is “one among more than a hundred factors that determine the relevancy of universal search.”

Former Fast executive Riaz, who now runs an information software maker Attivio Inc., argued that “slapping in a search appliance doesn’t do it.”

But many companies appear to find Google’s appliances up to snuff. According to AMR Research Inc. analyst Jim Murphy, Google’s inexpensive appliances are a “huge success.”

“Enterprises have an enormous pent-up need for simple, easy-to-deploy-and-administer search for general audiences,” he wrote, adding that Google is winning a “significant foothold in the enterprise.”

So will Microsoft’s buy kick off a wave of consolidation?

Charles King, an analyst at Pund-IT Inc., said, “It is easy to see how enterprise-focused vendors, such as Oracle, with active acquisition histories, would see the wisdom of purchasing specialized search technologies. In addition, enterprise search solutions might prove attractive to companies like EMC and IBM, which are pursuing sophisticated content management strategies. Fi