I believe there are two ways search will improve significantly in the future. First, since talking is easier than typing, speech recognition will allow longer and more accurate input strings. Second, search will be informed by much more persistent user information, with search companies having very detailed understanding of searchers. Based on that, I expect:

A small oligopoly dominating the conjoined businesses of mobile device software and search. The companies most obviously positioned for membership are Google and Apple.

The continued and growing combination of search, advertisement/recommendation, and alerting. The same user-specific data will be needed for all three.

Enterprise search is greatly disappointing.My main reason for saying that is anecdotal evidence — I don’t notice users being much happier with search than they were 15 years ago. But business results are suggestive too:

Web search, while superior to the enterprise kind, is disappointing people as well. Are Google’s results any better than they were 8 years ago? Google’s ongoing hardwork notwithstanding, are they even as good?

Consumer computer usage is swinging toward mobile devices. I hope I don’t have to convince you about that one.

In principle, there are two main ways to make search better:

Understand more about the documents being searched over. But Google’s travails, combined with the rather dismal history of enterprise search, suggest we’re well into the diminishing-returns part of that project.

Understand more about what the searcher wants.

The latter, I think, is where significant future improvement will be found.

CMS/search (Content Management System) expert Alan Pelz-Sharpe recently decried “Shadow IT”, by which he seems to mean departmental proliferation of data stores outside the control of the IT department. In other words, he’s talking about data marts, only for documents rather than tabular data.

Notwithstanding the manifest virtues of centralization, there are numerous reasons you might want data marts, in the tabular and document worlds alike. For example:

Price/performance. Your main/central data manager might be too expensive to support additional large specialized databases. Or different databases and applications might have sufficiently different profiles so as to get great price/performance from different kinds of data managers. This is particularly prevalent in the relational world, where each of column stores, sequentially-oriented row stores, and random I/O-oriented row stores have compelling use cases.

Different SLAs (Service-Level Agreements). Similarly, different applications may have very different requirements for uptime, response time, and the like. (In the relational world, think of operational data stores.)

Different security requirements. Different subsets of the data may need different levels of security. This is particularly prevalent in the document world, where security problems are not as well-solved as in the tabular arena, and where it’s common for a search engine to index across different corpuses with radically different levels of sensitivity.

Integrated application and user interfaces. In the relational world, there’s a pretty clean separation between data management and interface logic; most serious business intelligence tools can talk to most DBMS. The document world is quite different. Some search engines bundle, for example, various kinds of faceted or parameterized search interfaces. What’s more, in public-facing search, a major differentiator is the facilities that the product offers for skewing search results.

The newsletter/column excerpted below was originally published in 1998. Some of the specific references are obviously very dated. But the general points about the requirements for successful natural language computer interfaces still hold true. Less progress has been made in the intervening decade-plus than I would have hoped, but some recent efforts — especially in the area of search-over-business-intelligence — are at least mildly encouraging. Emphasis added.

For example, Artificial Intelligence Corporation’s Intellect was a natural language DBMS query/reporting/charting tool. It was actually a pretty good product. But it’s infamous among industry insiders as the product for which IBM, in one of its first software licensing deals, got about 1700 trial installations — and less than a 1% sales close rate. Even its successor, Linguistic Technologies’ English Wizard*, doesn’t seem to be attracting many customers, despite consistently good product reviews.

*These days (i.e., in 2009) it’s owned by Progress and called EasyAsk. It still doesn’t seem to be selling well.

Another example was HAL, the natural language command interface to 1-2-3. HAL is the product that first made Bill Gross (subsequently the founder of Knowledge Adventure and idealab!) and his brother Larry famous. However, it achieved no success*, and was quickly dropped from Lotus’ product line.

*I loved the product personally. But I was sadly alone.

In retrospect, it’s obvious why natural language interfaces failed. First of all, they offered little advantage over the forms-and-menus paradigm that dominated enterprise computing in both the online-character-based and client-server-GUI eras. If you couldn’t meet an application need with forms and menus, you couldn’t meet it with natural language either. Read more

A features and syntax page reveals that the basic Google search box now gives you flight times, weather, stock quotes, sports scores, currency conversion, calculator results, and a lot more. Wow. I did not know.

Since the early 1980s, I’ve thought that natural language interfaces — spoken or otherwise — would someday win. While this versatility isn’t natural lanaguage per se, it still in my opinion is evidence in favor of that belief.

1. Clearly, in Google’s mission to “organize all the world’s information,” there are several web areas it isn’t yet doing well in, and one of those is microblogs. What’s more, much as in the case of YouTube, it’s hard to see how Google would do that organizing any time soon unless it owned or otherwise was in bed with the leading platform for that kind of content — i.e., Twitter.

2. The YouTube example is apt in another way as well — it’s not clear where the monetization would come from. Google famously doesn’t make much advertising revenue from YouTube. And Twitter is even worse as an advertising platform; sticking ads into the tweetstream would quickly drive users elsewhere, and any other advertising scheme would likely fail because of the broad variety of interfaces — such as various mobile phones — Twitterers use to get at the service.

3. I’ve been suggesting all along that Twitter needs radical user experience enhancements.But when has Google ever made made user experience enhancements to a service? Its core search engine always looks pretty much the same. Ditto GMail. Ditto Blogger. Ditto YouTube.

TechCrunch pointed out a Twitter jobs page. The specific job TechCrunch mentioned* isn’t up there any more, but at the moment I write this, 18 others are (copied below). That’s considerable growth, given that the same page says Twitter has fewer than 30 current employees. Note the emphasis on search and the mention of Japan.

*Care and feeding of celebrity tweeters. Celebrity tweeting is actually a subject I’ve written and even been interviewed about several times.

Andy Beal has a blog post up to the effect that NoFollow is a bad thing. (Edit: Andy points out in the comment thread that his opposition to NoFollow isn’t as absolute as I was suggesting.) Other SEO types are promoting this is if it were some kind of important cause. I think that’s nuts, and NoFollow is a huge spam-reducer.

The weakness of Andy’s argument is illustrated by the one and only scenario he posits in support of his crusade:

The result is that a blog post added to a brand new site may well have just broken the story about the capture of Bin Laden (we wish!)–and a link to said post may have been Tweeted and re-tweeted–but Google won’t discover or index that post until it finds a “followed” link. Likely from a trusted site in Google’s index and likely hours, if not days, after it was first shared on Twitter.

Helloooo — if I post something here, it is indexed at least in Google blog search immediately. (As in, within a minute or so.) Ping, crawl, pop — there it is. The only remotely valid version of Andy’s complaint is that It might take some hours for Google’s main index to update — but even there there’s a News listing at the top. This simply is not a problem.

Now, I think it would be personally great for me if all the links to my sites from Wikipedia and Twitter and the comment threads of major blogs pointed back with “link juice.” On the other hand, even with NoFollow out there, my sites come up high in Google’s rankings for all sorts of keywords, driving a lot of their readership. I imagine the same is true for most other sites containing fairly unique content that people find interesting enough to link to.

So other than making it harder to engage in deceptive SEO, I fail to see what problems NoFollow is causing.

At Lynda Moulton’s behest, I spoke a couple of times recently on the subject of where “semantic” technology is or isn’t likely to be important. One was at the Gilbane conference in early December. The slides were based on my previously posted deck for a June talk I gave on a text analytics market overview. The actual Gilbane slides may be found here.

My opinions about the applicability of semantic technology include:

The big bucks in web search are for “transactional” web search, and semantics isn’t the issue there. (Slides 3-4)

When UIs finally go beyond the simple search box — e.g. to clusters/facets or to voice — semantics should have a role to play. (Slide 5)

Public-facing site search depends — more than any other area of text analytics — on hand-tagging. (Slide 7)