This is the third in a series of cost hot spots in behind-the-firewall search. This essay does not duplicate the information in Beyond Search, my new study for the Gilbane Group. This document is designed to highlight several functions or operations in an indexing subsystem than can cause system slow downs or bottlenecks. No specific vendors’ systems are referenced in this essay. I see no value in finger pointing because no indexing subsystem is without potential for performance degradation in a real world installation. – Stephen Arnold, February 29, 2008

Indexing: Often a Mysterious Series of Multiple Operations

One of the most misunderstood parts of a behind-the-firewall search system is indexing. The term indexing itself is the problem. For most people, an index is the key word listing that appears at the back of a book. For those hip to the ways of online, indexing means metatagging, usually in the form of a series of words or phrases assigned to a Web page or an element in a document; for example, an image and its associated test. The actual index in your search system may not be one data table. The index may be multiple tables or numeric values that “float” mysteriously within the larger search system. The “index” may not even be in one system. Parts of the index are in different places, updated in a series of processes that cannot be easily recreated after a crash, software glitch, or other corruption. This CartoonStock.com image makes clear the impact of a search system crash.

Centuries ago, people lucky enough to have a “book” learned that some sort of system was needed to find a scroll, stored in a leather or clay tube, sometimes chained to the wall to keep the source document from wandering off. In the so called Dark Ages, information was not free, nor did it flow freely. Information was something special and of high value. Today, we talk about information as a flood, a tidal wave, a problem. It is ubiquitous, without provenance, and digital. Information wants to be free, fluid, moving around, and unstable, dynamic. For indexing to work, you have a specific object at a point in time to process; otherwise, the index is useless. Also, the index must be “fresh”. Fresh means that the most recent information is in the system and therefore available to users. With lots of new and changed information, you have to determine how fresh is fresh enough. Real time data also provides a challenge. If your system can index 100 megabytes a minute and to keep up with larger volumes of new and changed data, something’s got to give. You may have to prioritize what you index. You handle high-priority documents first, then shift to lower priority document until new higher-priority documents arrive. This triage affects the freshness in the index or you can throw more hardware at your system, thus increasing capital investment and operational cost.Index freshness is important. A person in a professional setting cannot do “work” unless the digital information can be located. Once located, the information must be the “right” information. Freshness is important, but there are issues of versions of documents. These are indexing challenges and can require considerable intellectual effort to resolve. You have to get freshness right for a search system to be useful to your colleagues. In general, the more involved your indexing, the more important is the architecture and engineering of the “moving parts” in your search system’s indexing subsystem.Why is indexing a cost hot spot? Let’s look at some hot spots I have encountered in the last nine months.

Remediating Indiscriminate Indexing

When you deploy your behind-the-firewall search or content processing system, you have to tell your system how to process the content. You can operate an advanced system in default mode, but you may want to select certain features, level of stringency, and make sure that you are familiar with the various controls available to you. Investing time prior to deployment in testing may be useful when troubleshooting. The first cost hot spot is encountering disc thrashing or long indexing times. You come in one morning, check the logs, and learn no content was processed. In Beyond Search I talk about some steps you can take to troubleshoot this condition. If you can’t remediate the situation by rebooting the indexing subsystem, then you will have to work through the vendor’s technical support group, restore the system to a known good state, or – in some cases – reinstall the system. When you reinstall, some systems cannot use the back up index files. If you find that your back ups won’t work or deliver erratic results on test queries, then you may have to rebuild the index. In a small two person business, the time and cost are trivial. In an organization with hundreds of servers, the process can consume significant resources.

Updating the Index or Indexes

Your search or content processing system allows you to specify how frequently the index updates. When your system has robust resources, you can specify indexing to occur as soon as content becomes available. Some vendors talk about their systems as “real time” indexing engines. If you find that your indexing engine starts to slow down, you may have encountered a “big document” problem. Indexing systems make short work of HTML pages, short PDFs, and emails. But when document size grows, the indexing subsystem needs more “time” to process long documents. I have encountered situations in which a Word document includes objects that are large. The Word document requires the indexing subsystem to grind away on this monster file. If you hit a patch characterized by a large number of big documents, the indexing subsystem will appear to be busy but indexing subsystem outputs fall sharply.Let’s assume you build your roll out index based on a thorough document analysis. You have verified security and access controls so the “right” people see the information to which they have appropriate access. You know that the majority of the documents your system processes are in the 600 kilobyte range over the first three months of indexing subsystem operation. Suddenly the document size leaps to six megabytes and the number of big documents becomes more than 20 percent of the document throughput. You may learn that the set up of your indexing subsystem or the resources available are hot spots.Another situation concerns different versions of documents. Some search and content processing systems identify duplicates using date and time stamps. Other systems include algorithms to identify duplicate content and remove it or tag it so the duplicates may or may not be displayed under certain conditions. A surge in duplicates may occur when an organization is preparing for a trade show. Emails with different versions of a PowerPoint may proliferate rapidly. Obviously indexing every six megabyte PowerPoint makes sense if each PowerPoint is different. How your indexing subsystem handles duplicates is important. A hot spot occurs when a surge in the number of files with the same name and different date and time stamps are fed into the indexing system. The hot spot may be remediated by identifying the problem files and deleting them manually or via your system’s administrative controls. Versions of documents can become an issue under certain circumstances such as a legal matter. Unexpected indexing subsystem behavior may be related to a duplicate file situation.Depending on your system, you will have some fiddling to do in order to handle different versions of documents in a way that makes sense to your users. You also have to set up a de-duplication process in order to make it easy for your users to find the specific version of the document needed to perform a work task. These administrative interventions are not difficult when you know where to look for the problem. If you are not able to pinpoint a specific problem, the hunt for the hot spot can become time consuming.

Common Operations Become a Problem

Once an index has been constructed – a process often called indexation – incremental updates are generally trouble free. Notice that I said generally. Let’s look at some situations that can arise, albeit infrequently.Index RebuildYou have a crash. The restore operation fails. You have to reindex the content. Why is this expensive? You have to plan reindexing and then baby sit the update. For reindexing you will need the resources required when you performed the first indexation of your content. In addition, you have to work through the normal verifications for access, duplicates, and content processing each time you update. Whatever caused the index restore operation to fail must be remediated, a back up created when reindexing is completed, and then a test run to make sure the new back up restores correctly.Indexing New or Changed ContentLet’s assume that you have a system, and you have been performing incremental indexes for six months with no observable problems and no red flags from users. Users with no prior history of complaining about the search system complain that certain new documents are not in the system. Depending on your search system’s configuration, you may have a hot spot in the incremental indexing update process. The cause may be related to volume, configuration, or an unexpected software glitch. You need to identify the problem and figure out a fix. Some systems maintain separate indexes based on a maximum index size. When the index grows beyond a certain size, the system creates or allows the system administrator to create a second index. Parallelization makes it possible to query index components with no appreciable increase in system response time. A hot spot can result when a configuration error causes an index to exceed its maximum size, halting the system or corrupting the index itself, although other symptoms may be observable. Again – the key to resolving this hot spot is often configuration and infrastructure.Value-Added Content ProcessingNew search and content processing systems incorporate more sophisticated procedures, systems, and methods than systems did a few years ago. Fortunately faster processors, 64-bit chips, and plummeting prices for memory and storage devices allows indexing systems to pile on the operations and maintain good indexing throughput, easily several megabytes a minute to five gigabytes of content per hour or more.If you experience slow downs in index updating, you face some stark choices when you saturate your machine capacity or storage. In my experience, these are:

Reduce the number of documents processed

Expand the indexing infrastructure; that is, throw hardware at the problem

Turn off certain resource intensive indexing operations; in effect, eliminating some of the processes that use statistical, linguistic, or syntactic functions.

One of the questions that comes up frequently is, “Why are value-added processing systems more prone to slow downs?” The answer is that when the number of documents processed goes up or the size of documents rises, the infrastructure cannot handle the load. Indexing subsystems require constant monitoring and routine hardware upgrades.Iterative systems cycle through processes two or more times.Some iterative functions are dependent on other processes; for example, until the linguistic processes complete, another component – for example, entity extraction – cannot be completed. Many current indexing systems are be parallelized. But situations can arise in which indexing slows to a crawl because a software glitch fails to keep the internal pipelines flowing smoothly. If process A slows down, the lack of available data to process means process B waits. Log analysis can be useful in resolving this hot spot.Crashes: Still OccurMany modern indexing systems can hiccup and corrupt an index. The way to fix a corrupt index is to have two systems. When one fails, the other system continues to function.But many organizations can’t afford tandem operation and hot failovers. When an index corruption occurs, some organizations restore the index to a prior state. A gap may exist between the points in the back up and the index state at the time of the failure. Most systems can determine which content must be processed to “catch up”. Checking the rebuilt indexes is a useful step to take when a crash has taken place and the index restored and rebuilt. Keep in mind that back ups are not fool proof. Test your system’s back up and restore procedures to make sure you can survive a crash and have the system again operational.

Wrap Up

Let’s step back. The hot spots for indexing fall into three categories. First, you have to have adequate infrastructure. Ideally your infrastructure will be engineered to permit pipelined functions to operate rapidly and without latency. Second, you will want to have specific throughput targets so you can handle new and changed content whether your vendor requires one index or multiple indexes. Third, you will want to understand how to recover from a failure and have procedures in place to restore an index or “roll back” to a known good state and then process content to ensure no lost content.In general, the more value added content processing you use, your potential for hot spots increases. Search used to be simpler from an operational point of view. Key word indexing is very straight forward compared to some of the advanced content processing systems in use today. The performance of any system fluctuates to some extent. As sophisticated as today’s systems are, there is room for innovation in system design, architecture, and administration of indexing subsystems. Keep in mind that more specific information appears in Beyond Search, due out in April 2008.

Last year about this time (February 2007), I wrote a 20-page white paper about Google’s publishing inventions for a consulting firm providing advisory services to the “traditional” publishing industry. You can get read my full analysis of Google’s publishing inventions in my Google Version 2.0 study.

I’m inclined — perhaps incorrectly — to think of traditional publishing as a business sector that emphasizes school ties, connections, and a business model that would be recognizable to Gutenberg.

After writing the white paper, part of the deal was that I would give a short talk at an invitation-only, super-exclusive publishing industry enclave at an exotic resort. Before I gave my talk, the sleek, smooth-talking facilitators guided about two dozen publishing moguls through an agenda shot through with management buzz words I hadn’t heard since my days at Booz, Allen & Hamilton. I thought, “People still pay money to hear this baloney, I guess.”

My Remarks: The Three-Minute Version

In my brief talk, I reviewed five points about Google and publishing; to wit:

Technology. Google has actively invested in systems, methods, and companies that allow a Google user to create, edit, format, share, and distribute content for more than eight years. Based on my analyses of Google’s patent applications and engineering documents, Google’s content creation components are one of a half dozen sub systems designed to allow the Google application platform to function as an integrated content system. Think of Google as building a system that performs search and online ads plus the “value adding” functions of a traditional publisher. JotSpot, acquired in 2006.

“Fabric” tactics. Instead of creating a single publication to compete, Google is building out a fabric of functions. In traditional publishing, competition was gentlemanly and involved identifying a niche, running some tests, and launching a new magazine, publishing a book, or creating an information service. The approach has been unchanged since broadsheets competed for readers for hundreds of years. Publishers were polite, observed certain rules of engagement, and attacked aggressively while following a “code of conduct”. The approach is similar to that used by Alexander the Great or Caesar. The competitive battles are one-to-one fights between armies using well-known, obvious tactics. Google operates in a different way, poking and probing niches. Instead of bull dozing forward, Google lets beta tests pull the company where there’s an opportunity.

Real real time data and real time adaptation. Traditional publishers are not real-time operations. Even the wonderful Wall Street Journal has characteristics of a peer-reviewed journal. Stories, particularly the feature-like analyses, can be in process for as much as six months. A math journal review process can take a year or more. Daily newspapers chatter about real-time, but the publications close at a specific time, and if something important happens after that time, editors cover the news in another edition, maybe a day or more later. Google, on the other hand, pays attention to traffic and user behavior, and it can adjust quickly. To see this in action, navigate to Google News and hit the refresh button every few minutes. You may see the changes take place as the Google system adapts to users, information, and system functions.

Polymorphism. A traditional publisher often keeps a low profile; for example, like a Thomson, Reed Elsevier, or the New York Times Co. The idea is that a particular “property” will manifest the image of the organization. Thomson is better known by “information products” such as West Publishing. The New York Times has many interests, but unless there’s some upheaval like the executive ouster at About.com, most people perceive this outfit as a “gray lady”, the New York Times newspaper. Google, on the other hand, is perceived in terms of search and advertising. With Type A, Wall Street wizards counting ad revenues, there’s no reason to worry about any Google activity that doesn’t generate billions of dollars every 90 days. A traditional publisher trying to figure out Google has to cut through a lot of static to get to the Google base station. When a person does get closer to Google’s non-search and ad interests, the flashy Google Maps, Google Books, or Google Docs seem suggestive to me. Once again, perception can be off kilter. Information, Napoleon is alleged to have said, “is nine-tenths of any battle.” I think publishers have pegged Google incorrectly. As I said in my February 2007 discussion, “Google poses a different problem… asymmetric threats in multiple sectors simultaneously.”

Cost advantages. Traditional publishers face cost challenges. Whether the challenge is rebellious writers, inflexible union contracts, or raw material scarcity — publishers struggle to generate a profit. Organic growth is an ever tougher problem even for blue-chips like Dow Jones & Co. Few outside of the closed book publishing insiders know that a block buster keeps some publishing companies in business. A company with a hot college textbook can plunge into red ink with the loss of a text book adoption in college psychology or economics courses. A newspaper can take a painful financial karate chop when one local auto dealer cancels her full page, full color ads in next Sunday’s newspaper. The Google infrastructure operates on a different costbasis, and Google has different business model options to exercise.

Not surprisingly, my remarks met with a less-then-enthusiastic reception. There was push back that I heard as ineffectual. This particular group of publishing giants argued without knowing much about Google. The themes of Google’s naiveté, its failure to respect copyright, its track record of failure outside of search, the dependence on online advertising, and the other arguments were those I had heard before. Instead of arguing, I let these superstars convince themselves that Google was an anomaly. I talked with a couple of people and left. I was surprised when I was paid by the meeting organizer I concluded that I would be stiffed since I upset the blue-bloods, and these folks did not want their world view challenged by someone who lived in Kentucky where literacy ranked in the lower quartile in the United States.

Here’s a screen shot from JotSpotbefore Google acquired it. Take a look, and scan my observations about Joe Kraus’s company.

Copyright Google 2006

JotSpot is a content creation tool, a content management system, and a dissemination system that supports collaboration. It is a next-generation, social publishing system that allows a user to select a template, enter content directly or via a script, and take content far beyond the confines of ink on paper. JotSpot is a component that complements other publishing-related functions in the Google system; for example, the little-known invention at Google that assembles custom content pages with ads automatically in response to Google actions. See, for example, patent application US20050096979. Screen shots for the Google Sites’ version of JotSpot are here, but you may need to scroll down to see the thumbnails. I fancy the one that looks like a magazine.

Any content entered in this system is structured; that is, tagged. The information is, therefore, indexable and contains metatags about the meaning and context of the information. As a result, the content can be sliced and diced, what traditional publishers call “repurposed”. The difference is that traditional publishers store content in XML data structured and rely mostly on human editors to “add value” with some automatic processes. At Google, software systems and methods perform most of the repurposing, and humans can be involved if deemed necessary. The processed content can be manipulated by Google’s library of processes, procedures, and functions, guided by Google’s smart software. (I’m working on a report about Google’s use of computational intelligence at this time (February 2008).

The Google Sites’s service gives Google a wiki capability. However, that’s just one use of the system. JotSpot embedded in Google Apps allows organizations to take a baby step away from the expensive, overly complex, and poorly engineered content management systems that plague users. In addition, the JotSpot function makes it easy for Google to approach a well-known expert, ask her to input information on a specialty, and make that knowledge available as part of a beta service such as Google Health.

I know it’s difficult to conceptualize Google as a digital version of Henry Ford’s River Rouge facility. Raw iron ore goes in one end, and a Ford automobile comes out the other end, gets put on a Ford truck, and is delivered to a Ford dealership. This type of integration is out of favor in our era of outsourcing. Google is breaking with the received wisdom and using its application platform to marry its systems and methods with certain integrated manufacturing business practices. Google has taken the extreme integration of Henry Ford’s vision and implemented it in digital form. This marriage of an old idea with Google’s platform is remarkable for its scope, efficiency, and scalability. Publishers don’t “think” like Google. It’s hard, then, for publishers to get their arms around Google.

What makes Google interesting to me is its application platform. Publishing is just one of the business sectors that the company can probe. Keep in mind that these initiatives are tests, conducted in plain sight, and available for anyone to analyze.

Navigate to Techmeme.com or Megite.com. Scan the postings about. You will find brilliant commentaries, insightful analyses, and great writing about Google Sites and its features.

What’s missing is the connection between the functionality of Google Sites and these particular functions and the broader implications for content acquisition, processing, distribution, repackaging, and vending. The particular use of JotSpot is indeed interesting, but the more important way to think about Google Sites is in a broader context.

Traditional publishers will point out that Google Sites is basic, lacks features, and can’t deliver the “value adds” that define the high-value products produced by the Financial Times, McGraw-Hill, Elsevier Science, Wolters-Kluwer, and others in “real” publishing, not the fuzzy world of Google “publishing”.

I’m thinking that Google may have a significant impact on publishing with comparatively little investment, effort, saber rattling, or ramp up. Cost, distribution, real-time response, personalization, search, online ads, and interesting systems and methods could trigger an earthquake in the insular world of print-on-paper publishing. What do you think?

In both The Google Legacy and Google Version 2.0, I include a recital of the flaws that could cripple or kill Google. Most of these are now routine furniture in the warehouse of articles, books, and reports about the company.

In light of the data from comScore, summarized at MarketWatch Google’s magnetism for Wall Street accolades has been reduced? The larger question is, “Will Google rebound, or will it be forced to limp forward? Wall Street wants its champions to be like a youthful Alexander — a vanquisher, not just a winner.

I want to revisit the weaknesses I began cataloging in 2002 when I started paying attention to Google.

Technology. My position has been and remains that Google has engineered a technical competitive advantage. As the company becomes larger and places more demands on its plumbing, a risk exists and becomes greater moving forward. In some ways, Google is a more innovative technical platform than Microsoft or Yahoo. Amazon, which has been working overtime to out-Google Google in cloud-based services, watched as its S3, EC2, and SimpleDB platform choked, then survived on life support. Google has not faced this problem, but the company has as recently as February 26, 2007, experienced slow downs in Google image search, some glitches with Gmail, and similar hiccups.

Management. My view is that the troika of Eric Schmidt, Sergey Brin, and Larry Page has been one important ingredient in Google’s success. Now that some Googlers can cash in their Google options to become Xooglers (ex-Googlers), there may be some gaps for the company to fill. There’s a serious, global shortage of Google-grade mathematician – computer scientists who can manage. There’s also a shallower pool of 20-something whiz kids, but that’s a secondary issue. Management may well be the more critical challenge at Google. As recently as Monday, February 25, 2008, I heard from a person interacting with Google, “These guys don’t follow through.” Maybe this is one person’s opinion, but its a datum.

Competitors. Most organizations face competition. The Sears – KMart operation has to deal with Target. Google has no comparable direct nemesis.In fact, since the morphing of Backrub into Google in 1997, Google has had a free run. Competitors either fell short (Yahoo) or retired from the field of battle (Lycos). For the foreseeable future, Google is operating in an arena where the opponents are still in their locker rooms.

Lawyers. The legal process is always a great way to hobble an organization. Look at Microsoft’s squabbles with the European Union. Microsoft’s kinder, gentler ways have done little to get the EU’s regulators to look at European monopolies such as professional publishing, weapons, and pharmaceuticals. Google has parked most of its attorneys about a mile from the senior manager’s offices. Despite the reams of legal papers dropped on Google’s feet, the company has not been seriously encumbered. Regulators, as I have asserted in my talks and writings for various firms, don’t know what to make of Google. If you don’t understand a company’s business, it’s tough for a specific regulatory group to craft an action. Google is search and ads to most politicos.

Revenue. Google remains a one-trick pony when it comes to cash. Logically a report that suggests the Google’s core business is slowing becomes the indicator of the Google’s future. In the Google Legacy and Google Version 2.0, Google has been working hard to diversify its revenue. The problem the company faces is that new revenue streams are small when compared to revenue from online ads. The Google Enterprise unit is doing well, according to my sources. The Google Search Appliance, I have heard for those close to the company, has more than 8,,500 licensees. The Google geospatial products are hot, hot, hot, opening doors for the new Google Apps Software as a Service business. At the end of calendar 2007, the Enterprise unit was generating revenue in the $350 million range. In 36 months, the company is larger than other vendors in the behind-the-firewall search sector, but in comparison with ad revenue, Google’s 36-month-old enterprise unit is loose change.

The media reaction to Google’s sub-$500 share price, the downturn in Google growth, and the comScore report leaves me with the impression that the honey moon with Google is drawing to a close. For the first time since the company emerged from Stanford University, you can see the Wall Street leopards spots. Criticism of Google’s tag line “Do no evil” is ratcheting upwards. Privacy wonks are salivating over Google’s alleged warehouse of user data. For a positive view of Google — before the Google stock price rebounded — navigate to A VC: The Musings of a VC in New York. Is it the beginning of the end for Google? I don’t think so. Let me foreshadow my speech at AIIM in Boston, Massachusetts, next week:

Google has an application platform. The company has not leveraged that platform as effectively or as rapidly as it could have. Going forward, I believe Google will become increasingly aggressive in multiple business sectors. Mobile is one sector, and Google has probes into health, publishing, back office services, and others.

Google has been playing coy with integrators and resellers. Google won’t accept companies on sometimes fuzzy logic. Google’s impact in the large enterprise market can be increased with tweaking of its partner – reseller – integrator strategy.

Enterprises have data management problems. Google is a reasonably competent data management company. With some rifle shot marketing, Google is in a position to approach certain large firms and land business because many organizations are unable to get their IBM DB2, Oracle, and SQLServer database systems to handle “big data”.

Monetization options. Google’s patent applications reveal a wide range of monetization options. But I want to ask a question, “Would you pay to access the Google search system?” I would, and I would pay for premium access. Let my father who is in his mid-80s surf for weather in Brazil and news. I would pay for the types of Google features I find most useful; for example, personalization, redundant data storage, and Google Trends, to name three.

I am not ready to pooh – pooh the comScore numbers. I am not ready to ignore the substantial body of research I have amassed about Google’s technology. I think the Google has some staying power. It will take a lot more than some Web traffic and click analysis before I see Googzilla as a five – inch chameleon basking in the sun.

In May 2007, I will be doing the end note talk at the Enterprise Search Summit 2008. This is a conference owned and managed by Information Today, Inc. This may be the third or fourth year that I have anchored the program. Last year, Sue Feldman, IDC’s well-known search wizard, and Robert Peck, Managing Director of BearStearns’ Internet unit “debated” me last year. The idea is that I am known to be controversial, so representatives of received wisdom about “enterprise search”, a term I don’t like. For May 2008, I’m not certain what Information Today has planned to counter balance by contrarian views of behind-the-firewall search.

I worked yesterday to locate my remarks from 2007 here and come up with observations based on my research since May 2007. I have two studies under my belt in the last 10 months– Google Version 2.0 and Beyond Search: What to Do When Your Search System Doesn’t Work. Google is an interesting company, and I will be talking about its impact on enterprise software at the AIIM Show in Boston on March 4, 2008. My research for Beyond Search unearthed a number of interesting facts and insights. I am inclined to lean heavily on that information for the Enterprise Search Summit 2008 “controversial” end note.

I want to outline my preliminary thinking for my May 2008 remarks and invite comments on my views. Accordingly, here’s the table I created yesterday:

Buy outs, staff reductions, and repositioning are making it tough for potential buyers to know what search vendors have on offer. Examples: Autonomy and Zantaz; Inxight becoming part of Business Objects, then BO getting acquired by SAP, then SAP investing in Endeca.

Customers

More confident in their ability to select the right system than in 2007

As I reflect on these points, I see three characteristics of the 2008 search market that are not addressed. Let me summarize each:

A naive dismissal of the Google Search Appliance, OneBox API, and Google Apps as not important to the major players in behind-the-firewall search. My data suggest that Google has about 8,500 licensees of the maligned GSA. Interest in Google Apps is climbing, often following the sky rocketing interest in Google Maps. Google is going to reshape the behind-the-firewall market for search and other applications.

Growing importance of international vendors. I am continually surprised that many of the organizations with whom I speak about behind-the-firewall search are essentially ignorant of important North American vendors such as Attensity, Cognition Technologies, Siderean Software, or Thetus. But I am thunderstruck when these informed and bright people look baffled when I mention Bitext, Copper Eye, Polyspot, and Lingway. I haven’t mentioned the innovators in behind-the-firewall search in the Pacific Rim. Big changes are afoot, and few in the U.S. seem to care very much. There’s more curiosity about new Apple iPods than enterprise information systems, I surmise.

Over confidence in search expertise and knowledge. I have been amazed on several occasions in the last six months at the lack of knowledge about the “gotchas” in search and the incredible hubris of certain procurement teams. In addition to refusing to consider a hosted or managed solution, these folks have zero knowledge of viable solutions developed in far-off, mysterious places like far-off France. Amazing. I meet many 25-year-olds who have “mastered” the intricacies of behind-the-firewall search. I conclude that it must be wonderful to be so smart so young. I’m still learning by plodding along. I’ve been at this more than 30 years and know I don’t know very much at all.

Let me close with an anecdote. One of my long-time friends and colleagues told me that her firm’s behind-the-firewall search system didn’t work. I think the word she used was sucked. Young people are quite colloquial.

I said, “Didn’t I try to flag you off that vendor?”

She replied, “Yes, but our VP of Information Technology made the decision. He knew what he wanted and made the deal happen.” I think she made a sound like an annoyed ocelot, a grrrr sound.)

What’s interesting about this exchange is that company with the search system that “sucked” conducts analyses of text mining, knowledge management, and “enterprise” search systems — for a fee.

I am struggling with how to communicate the need for those who want to procure a behind-the-firewall search system to make a decision based on understanding, facts, and specific, pragmatic requirements. I thought it was my generation who watched Star Trek and believed that technology would make it possible to issue voice commands to computers or say “Beam me up” to move from place to place. I learned in 2007 that recent graduates of prestigious computer science programs have absorbed Star Trek’s teachings.

Just one problem. Behind-the-firewall search remains a complex challenge. I document in Beyond Search 13 “disasters” and provide guidance on how to extricate oneself from the clutches of these problems. There’s no “beam me up” solution to the rats’ nest of issues that plague some behind-the-firewall search solutions — yet.

François Bourdoncle, the engaging founder of Exalead, reveals an important new service now available from Exalead. BAAGZ is a social and semantic system. Mr. Bourdoncle said, “BAAGZ is, for the record, the first social network to allow people to connect because they have shared interests.” BAAGZ is Exalead’s new semantic system.

Mr. Bourdoncle added, “We are listening to our users, and we believe that the time for simplistic and “naked” search engines is over. Now is the time for full-fledged “search products”, not simplistic “search engines”. Think of the difference between a car’s engine and the car itself. BAAGZ, which some of the alpha testers have seen, was described as “a new form of search-inspired social networking”. Another alpha tester called it “a new form of social networking-inspired search”.

BAAGZ will be released in a public beta this week (February 25, 2005,) You can try this new service at www.baagz.com.

Mr. Bourdoncle continued, “At Exalead, from day one, we focused on multi-threaded, 64-bit architectures from Day One…. Today, Exalead has, I know, the most mature, robust and scalable search software. We make full use of today’s multi-core processors. Our products are also able to adapt automatically to various memory / processor / disk configurations.”

Exalead, based in Paris, is one of the four vendors whose system has been identified in Beyond Search: What to Do When Your Search System Doesn’t Work as a “company to watch.” Exalead has a growing presence in the United States and a technical capability that parallels Google’s.

If you though Paris was only for lovers, you need to expand your
horizons. Paris is a place for new approaches to Web and behind-the-firewall
search technology. My website contains the entire interview with Mr. Bourdoncle.

In 2004, I began work on The Google Legacy: How Google’s Interent Search Is Transforming Application Software. The study grew from a series of research projects I did starting in 2002. My long-time colleague, friend, and publisher — Harry Collier, Infonortics Ltd. in Tetbury, Glou. — suggested I gather together my various bits and pieces of information. We were not sure if a study going against the widely-held belief that Google was an online ad agency would find an audience.

The Google Legacy focused on Google’s digital “moving parts” — the sprockets and gears that operate out of sight for most. The study’s major finding was that Google set out to solve the common problems of performance and relevance in Web search. By design or happenstance, the “solution” was a next-generation application platform.

The emergence of this platform — what I called the Googleplex, a term borrowed from Google’s own jargon for its Mountain View headquarters — took years. Its main outlines were discernable by 2000. At the time of the initial public offering in 2004, today’s Googleplex was a reality. Work was not finished, of course, and probably never will be. The Googleplex is a digital organism, growing, learning, and morphing.

The hoo hah over Google’s unorthodox IPO, the swelling ad revenue, the secrecy of the company, and the alleged arrogance of Googlers (Google jargon for those good enough to become full-time employees) generated a smoke screen. Most analysts, pundits, and Google watchers saw the swirling fog, but when The Google Legacy appeared, few had tried to get a clearer view.

Google provided some tantalizing clues about what its plumbing was doing. Today, you can download Hadoop and experiment with an open source framework similar to Google’s combo of MapReduce and the Google File System. You can also experiment with Google’s “version” of MySQL. Of course, your tests only provide a partial glimpse of the Google’s innards. You need the entire complement of Google software and hardware to replicate Google.

Google also makes available a number of technical papers, instructional videos, lectures, and code samples. A great place to start your learning about Google technical innovations is here. If you have an interest in the prose of folks more comfortable with flashy math, you can read Google technical papers here. And, if you want to dig even more deeply into Google’s mysteries, you can navigate to the US Patent & Trademark Office and read the more than 250 documents available here. The chipper green and yellow interface is a metaphor for the nausea you may experience when trying to get this clunky, Bronze Age search system to work. But when it does, you will be rewarded with a window into what makes the Google machine work its magic.

The Google Legacy remains for some an unnerving look at Google. Even today, almost three years after The Google Legacy appeared, many people still perceive Google as an undisciplined start up, little more than a ersatz college campus. The 20-somethings make money by selling online advertising. I remember reading somewhere that a Microsoft executive called the Google “a one-trick pony”.

You have to admit. For a company that will be 10-years-old in a few months, a canny thinker like Steve Ballmer has perceived the company correctly. But why not ask this question, “Has Microsoft really understood Google?” A larger and more interesting question, “Have such companies as IBM, Oracle, Reed Elsevier, and Goldman Sachs grasped Google in its entirety?”

In this essay, I want to explore this question. My method will be to touch upon some of the information my research uncovered in writing the aforementioned The Google Legacy and my September 2007 study, Google Version 2.0. Then I want to paraphrase a letter shared with me by a colleague. This letter was a very nice “Dear John” epistle. In colloquial terms, a very large technology company “blew off” my colleague because the large technology company understood Google and didn’t need my colleague’s strategic counsel about Google’s enterprise software and service initiatives.

I want to close by considering one question, “If Microsoft is smart enough to generate more than $60 billion in revenue in 2007, why hasn’t Microsoft been clever enough to derail Google?” By correspondence, if Microsoft didn’t understand Google, can we be confident that other large companies have “figured out Google”?

Microsoft Should Stalk Other Prey, Says New York Times

Today is February 25, 2008, there’s still a “cloud of unknowing” around Google. One Sunday headline “Maybe Microsoft Should Stalk Different Prey” caught my eye. The article here, penned by Randall Stross includes this sentence:

Having exhausted its best ideas on how to deal with Google, Microsoft is now working its way down the list to dubious ones — like pursuing a hostile bid for Yahoo.

Now Microsoft has been scrutinizing Google for years. Microsoft has power, customers, and money. Microsoft has thousands of really smart people. Google — in strictly financial measures — is a dwarf to Microsoft’s steroid stallion. Yet I was learning from the outstanding, smart reporter Randall Stross revealing that the mouse (Google) has frightened the elephant (Microsoft). Furthermore, the elephant can’t step on the mouse. The elephant cannot move around the mouse. The elephant has to do the equivalent of betting the house and children’s college fund to have an chance to escape the mouse.

Microsoft seems to be faced with some stark choices.

For me, the amusing part of this Sunday morning “revelation” is that by the time The Google Legacy appeared in 2005, Microsoft was between a rock and a hard place with regards to Google. One example from my 2005 study will help you understand my assertion.

Going Fast Cheaply

In the research for The Google Legacy, I read several dry technical papers about Google’s read speed on Google’s commodity storage devices. A “read speed” is a measure of how much data can be moved from a storage device to memory in one second. Your desktop computer can move megabytes a second pretty comfortably. To go faster, you need the type of engineering used to make a race car outperform your family sedan.

These papers, still available at “Papers Written by Googlers” included equations, data tables, and graphs. These showed how much data Google could “read” in a second. When I read these papers, I had just completed a test of read speed on what were in 2004 reasonably zippy servers. These were IBM NetFinity 5500s. Each server had four gigabytes of random access memory, six internal high-speed SCSI drives, IBM Serveraid controllers with on board caching, and an EXP3 storage unit holding 10 SCSI III drives. For $20,000, these puppies were fast, stable, and reliable. My testing revealed that a single NetFinity 5500 server could read 65 megabytes per second. I thought that this was good, not as fast as the Sun Microsystems’ fiber server we were testing but very good.

The Google data reported that using IDE drives identical to the ones available at the local Best Buy or Circuit City, Google engineers reported read speeds of about 600 megabytes per second. Google was using off-the-shelf components, not exotic stuff like IBM Serveraid controllers, IBM-proprietary motherboards, IBM-certified drives, and even IBM FRU (field replaceable unit) cables. Google was using the low cost stuff in my father’s PC.

A Google server comparable to my NetFinity 5500 cost about $600 dollars in 2004. The data left me uncertain of my analysis. So, I had two of my engineers retrace my tests and calculations. No change. I was using a server that cost 33 times as much as Google’s test configuration server. I was running at one-tenth the read speed of Google’s server. One-tenth the speed and spending $19,400 more per server.

You don’t have to know too much about price – performance ratios to grasp the implications of these data. A Google competitor trying to match Google’s “speed” has to spend more money than Google. If Google builds a data center and spends $200 million, a competitor in 2004 using IBM or other branded server-grade hardware would have to spend orders of magnitude more to match Google performance.

The gap in 2004 and 2005 when my study The Google Legacy appeared was so significant as to be unbelievable by those who did not trouble to look at the Google data.

This is just one example of what Google has done to have a competitive advantage. I document others in my 2007 study Google Version 2.0.

In the months after The Google Legacy appeared, it struck me that only Amazon of Google’s Web competitors seemed to have followed a Google-like technical path. Today, even though Amazon is using some pretty interesting engineering short cuts, Amazon is at least in the Google game. I’m still watching the Amazon S3 meltdown to see if Amazon’s engineers have what it takes to keep pace. Amazon’s technology and research budget is a pittance compared to Google’s. Is Amazon able to out-Google Google? It’s too early to call this horse race.

Do Other Companies See Google More Clearly than Microsoft Did?

Now, let me shift to the connection I made between Mr. Stross’s article and the letter I mentioned.

Some disclaimers. This confidential letter was not addressed to me. A colleague allowed me to read the letter. I cannot reveal the name of the letter’s author or the name of my colleague. The letter’s author is a senior executive at a major computer company. (No, the company is not Microsoft.)

My colleague proposed a strategy analysis of Google to this big company. The company sent my colleague a “go away” letter. What I remember are these points: [a] (I am paraphrasing based on my recollection) our company has a great relationship with Google, and we know what Google is doing because our Google contacts are up front with us. [b] Our engineers have analyzed Google technology and found that Google’s engineering poses no challenge to us. [c] Google and our engineers are working together on some projects, so we are in daily contact with Google. The letter concluded by saying (again I paraphrase), “Thanks, but we know Google. Google is not our competitor. Google is our friend. Get lost.”

I had heard similar statements from Microsoft. But when wrapping up Google Version 2.0, I spoke with representatives of Oracle and other companies. What did I hear? Same thing: Google is our partner. Even more interesting to me was that each of these insightful managers told me their companies had figured out Google.

How Did Microsoft Get It Wrong?
This begs the question, “How did Microsoft and its advisors not figure out Google?” Microsoft has known about Google for a decade. Microsoft has responded on occasions to Google’s hiring of certain Microsoft wizards like Kai Fu Lee. Microsoft has made significant commitments to search and retrieval well before the Yahoo deal took shape. Microsoft has built an advanced research capability in information retrieval. Microsoft has invested in an advertising platform. Microsoft has redesigned Microsoft Network (MSN) a couple of times and created its own cloud-computing system for Live CRM, among other applications.

I don’t think Microsoft got it wrong. I think Microsoft looked at Google through the Microsoft “agenda”. Buttressed by the received wisdom about Google, Microsoft did not appreciate that Google’s competitive advantage in ads was deeply rooted in engineering, cost control, and its application platform. Perhaps executives in other sectors may want to step back and ask themselves, “Have we really figured out Google?”

Let’s consider Verizon’s perception of Google.

I want to close by reminding you of the flap over the FCC spectrum bid. The key development in that process was Verizon’s statement that it would become more open. Since the late 1970s, I have worked for a number of telcos, including the pre-break up AT&T, Bell Labs as a vendor, USWest before it became Qwest, and Bell Communications Research. When Verizon used the word open, I knew that Google has wrested an important concession from Verizon. Here’s Business Week’s take on this subject. At that moment, I lost interest in the outcome of the spectrum auction. Google got what it wanted, openness. Google’s nose was in Verizon’s tent. Oh, Verizon executives told me that Google was not an issue for them as recently as June 2007.

What’s happening is far larger than Microsoft “with wobbly legs, scared witless” to quote Mr. Stross. Microsoft, like Verizon, is another of the established, commercial industrial giants to feel Google’s pressure wave. Here’s a figure from The Google Legacy and the BearStearns’ report The Google Ecosystem to illustrate what Googl’s approach was between 1998 and 2004. More current information appears in Google Version 2.0.

You can figure out some suspects yourself. Let me give you a hint. Google is exerting pressure in its own Googley way on the enterprise market. Google is implementing thrusts using these pressure tactics in publishing, retail, banking, entertainment, and service infrastructure. Who are the top two or three leaders in each of these sectors? These are the organizations most likely to be caught in the Google pressure wave. Some will partner with Google. Others will flee. A few will fight. I do hope these firms know what capabilities Google can bring to bear on them.

The key to understanding Google is setting aside the Web search and ad razzle dazzle. The reality of Google lies in its engineering. Its key strength is its application of very clever math to make tough problems easy to resolve. Remember Google is not a start up. The company has been laboring for a decade to build today’s Google. It’s also instructive to reflect on what Google learned from the former AltaVista.com wizards who contributed much in the 1999 – 2004 period; many continue to fuel Google’s engineering prowess today. Even Google’s relationship with Xooglers (former employees who quit) extends Google’s pressure wave.

I agree that it is easy, obvious, and convenient to pigeon hole Google. Its PR focuses on gourmet lunches, Foosball, and the wacky antics of 20-year-old geniuses. Too few step back and realize that Google is a supra-national enterprise the likes of which has not been experienced for quite a while.

My mantra remains, “Surf on Google.” The alternative is putting the fate of your organization in front of Googzilla and waiting to see what happens. Surfing is a heck of a lot more fun than watching the tsunami rush toward you.

The key findings from this two and a half year effort are two:

Google has morphed into a new type of global computing platform and services firm. The implications of this finding mystify wizards at some very large companies. The perception of Google as a Web search and online ad company is so strongly held that no other view of Google makes sense to these people.

The application platform is more actively leveraged than most observers realize. Part of the problem is that Google is content to be “viral” and low key. Pundits see the FCC spectrum bid as a huge Google initiative. In reality, it’s just one of a a number of equally significant Google thrusts. But pundits “see” the phone activities and make mobile the “next big thing” from Google.

It’s Saturday, February 23, 2008. It’s cold. I’m on my way to the gym to ensure that my youthful figure defies time’s corrosive forces. I look at the headlines in my newsreader, and I am now breaking my vow of “No News. No, Really!”

Thomas Claburn, Information Week journalist, penned a story with the headline, “Google-Powered Hacking Makes Search a Threat.” Read the story for yourself. Do you agree with the premise that information is bad when discoverable via a search engine?

With inputs from Cult of the Dead Cow and a nod to the Department of Homeland Security, the story flings buzzwords about security threats and offers some observations about “defending against search”. The article has a pyramid form, a super headline, quotes (lots of quotes), and some super tech references such as “the Goolag Scan”, among others. This is an outstanding example of technical journalism. I say, “Well done, sir.”

My thoughts are:

The fix for this problem of “bad” information is darn easy. Get one or two people to control information. The wrong sort of information can be blocked or the authors arrested. Plus, if a bad “data apple” slips through the homogenization process, we know with whom to discuss the gaffe.

The payoff of stopping “bad information” is huge. Without information, folks won’t know any thing “bad”, so the truth of “If ignorance is bliss, hello, happy” is realized. Happy folks are more productive. Eliminating bad information boosts the economy.

The organizations and individuals responsible for “threats” can be stopped. Bad guys can’t harm the good guys. Good information, therefore, doesn’t get corroded by the bad information. No bad “digital apples” can spoil the barrel of data.

I’m no Jonathan Swift. I couldn’t edit a single Cervantes’ sentence. I am a lousy cynic. I do, however, have one nano-scale worry about a digital “iron maiden”. As you may know, the iron maiden was a way to punish bad guys. When tricked out with with some inward facing spikes (shown below), the bad buy was impaled. If the bad guy was unlucky, death was slow, agonizing I assume. The iron maiden, I think, was a torture gizmo. Some historical details are murky, but I am not too keen on finding out via a demo in “iron” or in “digital” mode.

I think that trying to figure out what information is “good” and what information is “bad” is reasonably hard to do. Right, now, I prefer systems that don’t try to tackle these particular types of predictive tasks for me. I will take my chances figuring out what’s “good” and what’s “bad”. I’m 64, and so far, so good.

In behind-the-firewall systems, determining what to make available and to whom is an essential exercise. An error can mean a perp walk in an orange suit for the CEO and a pack of vice presidents.

Duplicating this process on the Web is — shall we say — a big job. I’m going to the gym. This news stuff is depressing me.

Let’s pick up the thread of sluggish behind-the-firewall search systems. I want to look at one hot spot in the sub system responsible for document processing. Recall that the crawler sub system finds or accepts information. The notion of “find” relates to a crawler or spider able to identify new or changed information. The spider copies the information back to the content processing sub system. For the purposes of our discussion, we will simplify spidering to the find-and-send back approach. The other way to get content to the search system is to push it. The idea is that a small program wakes up when new or changed content is placed in a specific location on a server. The script “pushes” the content — that is, copies the information — to a specific storage area on the content processing sub system. So, we’re dealing with pushing or pulling content. The diagram to which these comments refer is here.

Now what happens?

There are many possible functions a vendor can place in the document processing subsystem. I want to focus on one key function — content transformation. Content transformation takes a file — let’s say a PowerPoint — and creates a version of this document in an XML structure “known” to the search system. The idea is that a number of different file types are found in an organization. These can range from garden variety Word 2003 files to the more exotic XyWrite files still in use at certain US government agencies. (Yes, I know that’s hard to believe because you may not know what XyWrite is.)

Most search system vendors say, “We support more than 200 different file types.” That’s true. Over the years, scripts that convert a source file of one type into an output file of another type have been written. Years ago, there were independent firms doing business as Data Junction and Outside In. These two companies, along with dozens of others, have been acquired. A vendor can license these tools from their owners. Also, there are a number of open source conversion and transformation tools available from Source Forge, shareware repositories, and from freeware distributors. However, a number of search system vendors will assert, “We wrote our own filters.” This is usually a way to differentiate their transformation tools from a competitor. The reality is that most vendors get use a combination of licensed tools, open source tools, and home-grown tools. The key point is the answer to two questions:

How well do these filters or transformation routines work on the specific content you want to have the search system make searchable?

How fast do these systems operate on the specific production machines you will use for content transformation?

The only way to answer these two questions with accuracy is to test the transformation throughput on your content and on the exact machines you will use in production. Any other approach will create a general throughput rate value that your production system may or may not be able to deliver. Isn’t it better to know what you can transform before you start processing content for real?

I’ve just identified the two reasons for unexpected bottlenecks and, hence, poor document processing performance. First, you have content that the vendor’s filters cannot handle. When a document processing sub system can’t figure out how to transform a file, it writes the file name, date, time, size, and maybe an error code in the document processing log. If you have too many rejected files, you have to intervene, figure out the problem with the files, and then take remedial action. Remedial action may mean re keying the file or going through some manual process of converting the file from its native format, to a neutral format like ASCII, doing to manual touch up like adding sub heads or tags, and then putting the touched up file into the document processing queue. Talk about a bottleneck. In most organizations, there is neither money nor people to do this work. Fixing the content transformation problems can take days, week, or never be done at all. Not surprisingly, a system that can’t process the content cannot make that content available to the system users. This glitch is a trivial problem when you are first deploying a system because you don’t have much knowledge of what will be transformed and what won’t. Imagine the magnitude of the problem when a transformation problem is discovered after the system is up and running. You may find log files over writing themselves. You may find “out of space” messages in the folder used by the system to write files that can’t be transformed. You may find intermittent errors cascading back through the content acquisition system due to transformation issues. Have you looked at your document processing log files today?

The second problem has to do with document processing hardware. In my experience exactly zero of the organizations with which I am familiar have run pre-deal tests on the exact hardware that will be used in production document processing. The exception are the organizations licensing appliances. The appliance vendors deliver hardware with a known capacity. Appliances, however, comprise less than 15 percent of the installed base of behind-the-firewall search systems. Most organizations’ information technology departments think that vendor estimates are good enough. Furthermore, most information technology groups believe that existing hardware and infrastructure are adequate for a search application. What happens? The system goes into operation and runs along until the volume of content to be proc3essed exceeds available resources. When that happens, the document processing sub system slows to a crawl or hangs.

Performance Erosion

Document processing is not a set-it and forget-it sub system. Let’s look at why you need to invest the time engineering, testing, monitoring, and upgrading the document processing sub system. I know before I summarize the information from my files that few, if any, readers of this Web log will take these actions. I must admit indifference to the document processing sub system generates significant revenue for consultants, but so many hassles can be avoided by taking some simple preventive actions. Sigh.

Let’s look at the causes of performance erosion:

The volume of content is increasing. Most organizations whose digital content production volume I have analyzed double their digital content every 12 months. This means that if one employee has five megabytes of new content when you turn on the system, that employee will have on her computer, 12 months after you start the search system, you will have the original five megabytes in the index and the new five megabytes for a total of 10 megabytes of content. No big deal, right? Storage is cheap. It is a big deal when you are working in an organization with constraints on storage, an inability to remove duplicate content from the index, and an indiscriminate content acquisition process. Some organizations can’t “plug in” new storage the way you can on a PC or Mac. Storage must be ordered, installed, and certified. In the meantime, what happens? The document processing system falls behind. Can it catch up? Maybe. Maybe not.

The content is not new. Employees recycle, save different drafts of documents, and merge pieces of boiler plate text to create new documents. Again, if you work on one PowerPoint, you can index any PowerPoint. But when you have many PowerPoints each with minor changes and the email messages like “Take a look at this an send me your changes”, you can index the same content again and again. A results list is not just filled with irrelevant hits; the basic function of search and retrieval is broken. Does your search system return a results list of what look like the same document with different date, time, and size values? How do you determine which version of the document is the “best and final” one? What are the risks of using the incorrect version of a document? How much does your organization spend on figuring out which version of a document is the “one” the CEO really needs?

As you wrestle with these questions, recall that you are shoving more content through a system which unless constantly upgraded will slow to a crawl. You have set the stage for thrashing. The available resources are being consumed processing the same information again and again, not processing the meaningful documents one time and again only when a significant change is made. Ah, you don’t know what documents are meaningful? You are now like the snake eating its tail. Because you don’t have an editorial policy or content acquisition procedures in place, you have found the slow down in document processing is nothing more than a consequence of an earlier misstep. So, no matter what you do to “fix” document processing, you won’t be able to get your search system working the way users want it to. Pretty depressing? Furthermore, senior management doesn’t understand why throwing money at a problem in document processing doesn’t have any significant pay off to justify the expense.

XML and Transformation

I’m not sure I can name a search vendor who does not support XML. XML is an incantation. Say it enough times, and I guess it becomes the magic fix to what ever ailments a content processing system has.

Let me give you my view of this XML baloney. First, XML or extensible mark up language is not a panacea. XML is, at its core, a programmatic approach to content. How many of you reading this column program anything in any language? Darn few. So the painful truth is, you don’t know how to “fix” or “create” a valid XML instance, but you sure sound great when you chatter about XML.

Second, XML is a simplified version of SGML which is in turn a decendent of CALS (computer aided logistics system) spawned by our insightful colleagues in the US government to deal with procurement. Lurking behind a nice Word document in the “new” DOCX format is a DTD, document type definition. But out of sight, out of mind, correct? Unfortunately, no.

Third, XML is like an ancient Roman wall in 25 BCE. The smooth surface conceals a heck of a lot of rubble between some rigid structures made of irregular brick or stone. This means that taking a “flavor” of XML and converting it to the XML that your search system understands is a programmatic process. This notion of converting a source file like a WordPerfect document into an XML version that the search system can use is pretty darn complicated., When it goes wacky, it’s just like debugging any other computer program. Who knows how easy or hard it will be to find and fix the error? Who knows how long it will take? Who knows how much it will cost? I sure don’t.

If we take these three comments and think about them, it’s evident that this document transformation can chew up some computer processing cycles. If a document can’t be transformed, the exception log can grow. Dealing with these exceptions is not something one does in a few spare minutes between meetings.

Nope.

XML is work, which when done properly, greatly increases the functionality of indexing sub systems. When done poorly, XML is just another search system nightmare.

Stepping Back

How can you head off these document processing / transformation challenges?

The first step is knowing about them. If your vendor has educated you, great. If you have learned from the school of hard knocks, that’s probably better. If you have researched search and ingested as much other information as you can, you go to the head of the class.

An increasing number of organizations are solving this throughput problem by: [a] ripping and replacing the incumbent search system. At best, this is a temporary fix; [b] shifting to an appliance model. This works pretty well, but you have to keep adding appliances to keep up with content growth and the procedure and policy issues will surface again unless addressed before the appliance is deployed; [c] shifting to a hosted solution. This is an up-and-coming fix because it outsources the problem and slithers away from the capital investment on-premises installations require.

Notice that I’m not suggesting slapping an adhesive bandage on your incumbent search system. A quick fix is not going to do much more than buy time. In Beyond Search, I go into some depth about vendors who can “wrap” your ailing search system with a life-support system. This approach is much better than a quick fix, but you will have to address the larger policy and procedural issues to make this hybrid solution work over the long term.

You are probably wondering how transforming a bunch of content can become such a headache. You have just learned something about the “hidden secrets” of behind-the-firewall search. You have to dig into a number of murky, complex areas before you make your search system “live.”

I think the following checklist has not been made available without charge before. You may find it useful, and if I have left something out, please, let me know via the comments function on this Web log.

How much information in what format must the search system acquire and transform on a monthly and annual basis?

What percent of the transformation is for new content? How much for changed content?

What percent of content that must be processed exists in what specific file types? Does our vendor’s transformation system handle this source material? What percent of documents cannot be handled?

What filters must be modified, tested, and integrated into the search systems?

What is the administrative procedure for dealing with [a] exceptions and [b] new file types such as an email with an unrecognized attachment?

What is the mechanism for determining what content is a valid version and which content is a duplication? What pre-indexing process must be created to minimize system cycles needed to identify duplicate content; that is, how can I get my colleagues to flag only content that should be indexed before the content is acquired by the document processing system?

What is the upgrade plan for the document processing sub system?

What content will not be processed if the document processing sub system slows? What is the procedure for processing excluded content when the document processing subsystem again has capacity?

What is the financial switch over point from on-premises search to an appliance or a hosted / managed service model?

What is the triage procedure when a document processing sub system degrades to an unacceptable level?

What’s the XML strategy for this search system? What does the vendor do to fix issues? What are my contingency plans and options when a problem becomes evident?

In another post, I want to look at hot spots in indexing. What’s intriguing is that so far we have brought or had content pushed to the search system storage devices. We have normalized content and written that content in a form the indexing system can understand to the storage sub system. Is any one keeping track of how many instances of a document we have in the search system at any one time? We need that number. If we run out of storage, we’re dead in the water.

This behind-the-firewall search is a no-brainer. Believe it or not, a senior technologist at a 10,000-person organization told me in late 2007, “Search is not that complicated.” That’s a guy who really knows his information retrieval limits!

I want to take a closer look at behind-the-firewall search system bottle necks. This essay talks about the content acquisition hot spot. I want to provide some information, but I will not go into the detail that appears in Beyond Search.

Content acquisition is a core function of a search system. “Classic” search systems are designed to pull content from a server where the document resides to the storage device the spider uses to hold the new or changed content. Please, keep in mind that you will make a copy of a source document, move it over the Intranet to the spider, and store that content object on the storage device for new or changed content. The terms crawling or spidering have been used since 1993 to describe the processes for:

Finding new or changed information on a server or in a folder

Copying that information back to the search system or the crawler sub system

Writing information about the crawlers operation to the crawler log file.

On the surface, crawling seems simple. It’s not. Crawlers or spiders require configuration. Most vendors provide a browser-based administrative “tool” that makes is relatively easy to configure the most common settings. For example, you will want to specify how often the content acquisition sub system checks for new or changed content. You also have to “tell” the crawling sub system what servers, computers, directories, and files to acquire. In fact, the crawling sub system has a wide range of settings. Many systems allow you to provide create “rules” or special scripts to handle certain types of content; for example, you can set a specific schedule for spidering for certain servers or folders.

In the last three or four years, more search systems have made it easier for the content acquisition system to receive “pushed” content. The way “push” works is that you write a script or use the snippet of code provided by the search vendor to take certain content and copy it to a specific location on the storage device where the spider’s content resides. I can’t cover the specifics of each vendor’s “push” options, but you will find the details in the Help files, API documentation, or the FAQs for your search system.

Pull

Pull works pretty well when you have a modest amount of new or changed content every time slice. You determine the time interval between spider runs. You can make the spider aggressive and launch the sub system every 60 seconds. You can relax the schedule and check for changed content every seven days. In most organizations, running crawlers every minute can suck up available network bandwidth and exceed the capacity of the server or servers running the crawler sub system.

You now have an important insight into the reason the content acquisition sub system can become a hot spot. You can run out of machine resources, so you will have to make the crawler less aggressive. Alternatively, you can saturate the network and the crawler sub system by bringing back more content than your infrastructure can handle. Some search systems bring back content that exceeds available storage space. Your choices are stark — limit the number of servers and folders the crawling sub system indexes.

When you operate a behind-the-firewall search system, you don’t have the luxury a public Web indexing engine has. These systems can easily skip a server that times out or not revisit a server until the next spidering cycle. In an organization, you have to know what much be indexed immediately or as close to immediately as you can get. You have to acquire content from servers that may time out.

The easy fixes for crawler sub system problems are likely to create some problems for users. Users don’t understand why a document may not be findable in the search system. The reason may be that the crawler subsystem was not able to get the document back to the search system for many different reasons. Believe me, users don’t care.

The key to avoiding problems with traditional spidering boils down to knowing how much new and changed content your crawler sub system must handle at peak loads. You also must know the rate of growth for new and changed content. You need the first piece of information to specify the hardware, bandwidth, storage, and RAM you need for the server or servers handling content acquisition. The second data point gives you the information you need to upgrade your content acquisition system. You have to keep the content acquisition system sufficiently robust to handle the ever-larger amount of information generated in organizations today.

The cause of a hot spot in content acquisition is due to:

Insufficient resources

Failure to balance crawler aggressiveness with machine resources

Improper handling of high-latency response from certain systems whose content must be brought back to the search storage sub system for indexing.

The best fix is to do the up front work accurately and thoroughly. To prevent problems from happening, a proactive upgrade path must be designed and implemented. Routine maintenance and tuning must be routine operations, not “we’ll do it later” procedures.

Push

Push is another way to reduce the need for the content acquisition sub system to “hit” the network at inopportune times. The idea is simple, and it is designed to operate in a way directly opposite from the “pull” service that gave content “pull” a bad reputation. PointCast “pulled” content indiscriminately, causing network congestion.

The type of “pull” I am discussing is a fall out of the document inventory conducted before you deploy the first spider. You want to identify those content objects that can be copied from their host location to the content acquisition storage sub system using a crontab file or a script that triggers the transfer when [a] the new or changed data are available and [b] at off-peak times.

The idea is to keep the spiders from identifying certain content objects and then moving those files from their host location to the crawler storage device at inopportune moments.

In order to make “push” work, you need to know which content is a candidate for routine movement. You have to set up the content acquisition system to receive “pushed” content, which is usually handled via the graphical administrative interface. You need to create the script or customize the vendor-provided function to “wake up” when new of changed content arrives in a specific folder on the machine hosting the content. Then the script consults the rules for starting the “push”. The transfer occurs and the script should verify in some way that the “pushed” file was received without errors.

Many vendors of behind-the-firewall search systems support “push”. If your system does not, you can use the API to create this feature. While not trivial, a custom “push” function is a better solution than trying to get a crashed content acquisition sub system back online. You run the risk of having to reacquire the content, which can trigger another crash or saturate the network bandwidth despite your best efforts to prevent another failure.

Why You Want to Use Both Push and Pull

The optimal content acquisition sub system will use both “push” and “pull” techniques. Push can be very effective for high-priority content that must be indexed without waiting for the crawler to run a CRC, time stamp, or file size check on content.

The only way to make the most efficient use of your available resources is to designate certain content as “pull” and other content as “push”. You cannot guess. You must have accurate baseline data and update those data by consulting the crawler logs.

You will want to develop schedules for obtaining new and changed content via “push” and “pull”. You may want to take a look at the essay on this Web log about “hit boosting”, a variation on “push” content with some added zip to ensure that certain information appears in the context you want it to show up.

Where Are the Hot Spots?

If you have a single server and your content acquisition function chokes, you know the main hot spot — available hardware. You should place the crawler sub system on a separate server or servers.

The second hot spot may be the network bandwidth or lack of it when you are running the crawlers and pushing data to the content acquisition sub system. If you run out of bandwidth, you face some specific choices. No choice is completely good or bad. The choices are shades of gray; that is, you must make trade offs. I will hightly three, and you can work through the others yourself.

First, you can acquire less content less frequently. This reduces network saturation, but it increases the likelihood that users will not find the needed information. How can they? The information has not yet been brought to the search system for document processing.

Second, you can shift to “push”, de emphasizing “pull” or traditional crawling. The upside is that you can control how much content you move and when. The downside is that you may inadvertently saturate the network when you are “pushing”. Also, you will have to do the research to know what to “push” and then you have to set up or code, configure, test, debug, and deploy the system. If people have to move the content to the folder the “push” script uses, you will need to do some “human engineering”. It’s better to automate the “push” function in so far as possible.

Third, you have to set up a re-crawl schedule. Skipping servers may not be an option in your organization. Of course, if no one notices missing content, you can take your chances. I suggest knuckling down and doing the job correctly the first time. Unfortunately, short cuts and outright mistakes are very common in the content acquisition piece of the puzzle.

In short, hot spots can crop up in the crawler sub system. The causes may be human, configuration, infrastructure, or a combination of causes.

Is This Really a Big Deal?

Vendors will definitely tell you the content acquisition sub system is no big deal. You may be told, “Look we have optimized our crawler to avoid these problems” or “Look, we have made push a point-and-click option. Even my mom can set this up.”

Feel free to believe these assurances. Let me close with an anecdote. Judge for yourself about the importance of staying on top of the content acquisition sub system.

The setting is a large US government agency. The users of the system were sending search requests to an Intranet Web server. The server would ingest the request and output a list of results. No one noticed that the results were incomplete. An audit revealed that the content acquisition sub system was not correctly identifying changed content. The error caused more than four million reports to be incorrect. Remediation cost more than $10 million. Upon analyzing the problem, facts came to light that the crawler was incorrectly configured when the system was first installed, almost 18 months before the audit. In addition to the money lost, certain staff were laterally arabesqued. Few in Federal employ get fired.

Pretty exciting for a high-profile vendor, a major US agency, and the “professionals” who created this massive problem.

Now, how important is your search system’s content acquisition sub system to you?

I posted a page here that provides more information about Beyond Search: What to Do When Your Search System Won’t Work. The publisher, The Gilbane Group, has a form here which you can use to receive more specifics about the 250-page study. You can also write beyondsearch at gilbane.com.

If you find the information on this Web log useful, you may want to think about getting a copy of the study. The information I am publishing on this Web log is useful, but it wasn’t directly on target for the Beyond Search study.

Search the site

Stephen E. Arnold monitors search, content processing, text mining
and related topics from his high-tech nerve center in rural Kentucky.
He tries to winnow the goose feathers from the giblets. He works with colleagues
worldwide to make this Web log useful to those who want to go
"beyond search". Contact him at sa [at] arnoldit.com. His Web site
with additional information about search is arnoldit.com.