Steven Arnold | The search continues

Steven Arnold got an early start on search engines. In 1971, hisemployer, Halliburton Co., assigned him to digitize thecompany's technical reports in order to make them searchable.He has worked in the field ever since. In the past decade, he hasmoved over to consultancy, starting his own practice, Arnold IT. In2000, he helped generate the technical plan for the first iterationof the General Services Administration's FirstGov governmentsearch engine. (His son, Erik Arnold, currently works on FirstGov.)More recently, he launched the Google Government Report(www.ggreport.com), a newsletter and electronic information serviceoffering tips on how to be better recognized by Google. We caughtup with Arnold to get his views on what is happening with bothenterprise and Web search.

GCN: What is the state of search these days?

Steven Arnold: This has been a time when people arerealizing that enterprise search doesn't work.

Folks with enterprise search systems are really on the lookoutfor technologies that make search more useful for the users.

So what will carry us into 2007 is a collection of technologieswe think of as text mining, where software algorithms look atdocuments and find the names of people, places and things andattempt to relate them to one another.

GCN: What is wrong with search technologies today?

Arnold: Take a real-life example: You and your significantother go to England and she says on the way home, 'I lovedthat jumper that I saw at Harrods.'

So, if you're like me you don't have any clue aboutwhat she's talking about. So now I go to a search system,like at Neiman Marcus Online or Overstock.com, and type in the word'jumper.' What I get back is not that sweater that shesaw in Harrods. I get stuff back unrelated to that idea. And thatis a very common problem.

GCN: What are the text mining companies doing that thesearch companies can't?

Arnold: From a technical point of view, companies likeAttensity Corp. and nStein Technologies are not focused on search.They are focused on figuring out the nuances, relationships and theimportant concepts in a document. Their systems generate indexterms that an enterprise search system can suck in.

GCN: So how does this technology work?

Arnold: Every one of the new companies that I have lookedat'and I have a text mining report where I have been trackingthis'are approaching the problem both mathematically and bydoing vocabulary and knowledge-based analysis. Their softwaredecomposes sentences into subjects, verbs and adjectives andanalyzes the results with the predictive algorithms.

What is unusual is that the computer chips are so darn powerfulnow. These new companies are basically saying that they are goingto use these chips and throw everything we learned in computerscience at the search problem and it will work out just fine.

So what you have now are hybrid [search/text mining] systems andby golly they are very interesting. When you use one of those[services] against scientific and technical information, youraccuracy can be 85 or 90 percent. If you run it against generalbusiness news, you can hit 80 to 85 percent accuracy. That'salmost as good as a human search.

The system can run automatically, and when something [unusual]comes up, you have a person who has been through school look at theerror report and fix the mistakes. When someone spells Al Qaidadifferently, and an exception comes up, an analyst can look at thatand say, 'This is an OK spelling.'

GCN: So we had no idea this sea change was happening in theindustry.

Arnold: Yeah, it is going on under everybody's nosebecause so much focus is given to Google.

Google actually has some nifty technology like this, but it isso anchored in the consumer space. These smaller companies are likecigarette boats racing around the destroyer.

The age of innovation for this is not over. I've been atconferences this year where people are saying, 'Yeah, wellit's over. Microsoft is giving it away, or Oracle is there,Google is there. There is no room for innovation.' I justdon't agree.

GCN: Why should agencies care about Google?

Arnold: Let me give you an anecdote. I got invited to meetwith the people from a large insurance company in Denmark. I askedhow much of their traffic came from Google [searches]. And theysaid they thought about 35 to 40 percent. I asked if there is a wayto check, and they said they could do it right there from a laptop.[When they checked], they looked at me and said 'You knowwhat? Last month Google was 80 percent of our traffic.'

GCN: So you are seeing an increase in Google-derived trafficwithin the last few months?

Arnold: Literally, within the last 12 to 16 weeks. Theanecdote underscores what we've seen in other work, thatGoogle is basically the search engine of choice for virtuallyeveryone in the world.

As people realize how much traffic comes to them from Google, itbecomes more important to understand what other people are doing tomake sure their Web site is indexed by Google and their sites comeup in the context of the proper keywords. You really now have topay attention to Google not because Google is the greatest companyon the planet, but because Yahoo and Microsoft just haven'tdone that good of a job competing.

GCN: So what can agencies do to better present their pages toGoogle?

Arnold: The first step is to create a sitemap that conformsto Google's guidelines, because Google has already convincedMicrosoft and Yahoo to follow its formula. So that is job one.

Job two is to take a very hard look at the page names and theURLs on your site. Most government Web pages I look at have verylong and complicated URLs, and Google's robots can processthose, but they prefer to process human-understandable URLs.

The third thing is the government needs to do a better job withcontent. The government has great information. The Department ofAgriculture has outstanding information, but it is presented insuch a way that makes it really hard to index and searcheffectively. If you want a good report, you have to download a hugePDF file.

So I think the government has to make more of its content easyto comprehend, and not put out these 5-megabyte globs of data.

GCN: Any thoughts on the battle between the two premierU.S.-government-focused search engines, FirstGov and Google U.S.Government search?

Arnold: There is no battle at all. I really believe FirstGovdoes much more focused indexing.

When Google sends its robots to an agency Web site, it looks atthe links, indexes the first 100,000 characters per page andfollows the links two levels down. But FirstGov looks hard at thesesites and goes very deep into the site. Remember the FirstGov[result] set will be much smaller and more focused than the Googleset, which will be very broad.

If you work for an information service, you certainly can starta search with Google, but if you want to be thorough, you will haveto look at FirstGov. If you're a government worker, you mightwant to start with FirstGov, but you definitely want to take a lookat what Google has indexed.

So think of FirstGov as drilling down into a topic and Googlegoing very broad across many topics. So the two services arecomplementary.