Recommended search architectures

If you plan to deploy more than one server farm running Microsoft Office SharePoint Server 2007 geographically, there are several search architectures that are practical for wide area network (WAN) environments. This article discusses these architectures. The following poster-size model provides an overview of the supported global solutions and recommended search architectures: Deploying Microsoft Office SharePoint Server Geographically (http://go.microsoft.com/fwlink/?LinkId=110982&clcid=0x409). This model was created in Microsoft Office Visio. If you do not have Visio installed, you can download a free viewer (http://go.microsoft.com/fwlink/?LinkId=73526&clcid=0x409). A plotter works best for printing this file.

Note:

This poster is not yet updated with information about the federated search architecture.

Planning for a search architecture balances the following types of requirements based on the priorities of an organization:

In some cases, understanding search architecture options in a WAN environment will help determine which of the supported global solutions are most appropriate for your organization. For more information about these solutions, see Supported global solutions for Office SharePoint Server.

This article does not discuss the performance characteristics of issuing search queries over the WAN or crawling content over the WAN. However, understanding how well your WAN environment supports these types of operations is crucial for planning a global environment. For more information about how Office SharePoint Server 2007 performs over the WAN, see Plan for bandwidth requirements.

Centralized search

With the centralized search architecture, the search service at the central farm crawls content at all regional farms. Search queries of regional users are sent to the central farm.

The following figure shows a centralized search architecture.

If WAN links support crawling content at regional sites, this is the recommended architecture because it provides a unified search experience for users that includes the following aspects:

Users always access the central farm for searching.

Search relevancy is retained in search results.

Users can search on all content across the organization that they have permissions to view.

One drawback to this architecture, however, is that there is no way to prioritize or distinguish local content in search results unless a search scope is created based on the farm location of content. That is, if a user at a regional site is searching for a document stored at the regional site, there is no easy way to distinguish where documents reside when they are listed in search results.

If WAN links do not perform well, this architecture can introduce several risks. Crawling content can overload a WAN link, which diminishes the performance of serving user requests. If there is a high volume of data with high rates of change, indexing jobs might not be able to keep up with the changes. However, there are ways you can optimize Office SharePoint Server 2007 to optimize content crawling over a WAN. These optimizations can reduce the time and network traffic used during the indexing process. For more information, see "Optimizing for content crawling" in Optimizing Office SharePoint Server for WAN environments.

Finally, although WAN links affect whether you can crawl content remotely — and consequently whether it is feasible to use the centralized search architecture — slower WAN links might also play a role in how usable search is for regional users. Slow WAN links can discourage users from issuing queries. You can optimize performance of the WAN during business hours by scheduling content crawling and other operations that can diminish performance to take place during off-peak hours. Even with optimization, though, you should determine how well the centralized search architecture serves the needs of regional users over the existing WAN links.

The following table summarizes the tradeoffs of the centralized search architecture.

Advantages

Disadvantages

Search relevancy is retained.

Shared Services Provider (SSP) management is centralized.

Crawling content over the WAN uses bandwidth.

Keeping indexes current can be difficult in environments with high volumes of data and high rates of change.

Query performance is subject to the performance of WAN links.

Regional SSPs with synchronized content

If WAN links do not support the centralized search architecture and you want to provide search as a service to regional sites, you can host an SSP at each regional site.

There are several different search architectures that encompass hosting SSPs at regional sites. The first of these architectures relies on synchronizing content throughout your organization so each regional site has a copy of all content that is necessary for workers at that regional site. This approach to managing content throughout a global organization is described in Design global information architecture and governance. Because content is synchronized, there is no need to crawl content remotely over the WAN.

The following figure illustrates this architecture.

In the figure:

Projects that are ready to share across the organization are published to the central site, regardless of where the content is created.

After the content is published to the central site, read-only versions of the projects are synchronized to all sites.

The search service at each farm crawls only the content within the farm.

Similarly, company information is synchronized throughout the organization in the same way, as illustrated in the following figure.

Although this architecture eliminates the need to crawl content over the WAN, it does require the use of WAN links to synchronize content across the environment. To minimize the effect on WAN performance, you can schedule these operations for off-peak hours. The primary benefit is that regional users have local access to content by using the local search service. Given this architecture, The use of WAN links is scheduled and managed and users are not hindered by the performance of WAN links while performing their job responsibilities.

The following table summarizes the tradeoffs of this search architecture.

Advantages

Disadvantages

Content is crawled locally.

Search query performance is not subject to the performance of WAN links.

Search relevancy is retained within each farm.

Multiple SSPs increase administrative costs.

Synchronizing content across an organization increases the complexity of the solution.

Centralized search plus distributed search

You can design a search architecture that combines centralized search and distributed search. With this architecture, the search service at each region crawls all content at that region and the central farm crawls content across all farms in the organization.

With this architecture, regional users can search local content without using WAN links. Regional users can search across the global organization by issuing queries at the central farm.

Each farm hosts an SSP. The search service provided by the local SSP crawls local content at each regional farm.

The search service provided by the SSP at the central farm also crawls content at regional farms.

The primary benefit of this architecture is that query performance is optimized for local content while global search is provided as an option. This architecture works well under the following circumstances:

Regional workers use search primarily to access local content.

WAN links support crawling content at regional sites.

Similar to the central search architecture, however, this architecture relies on heavy use of WAN links for crawling content. However, with local search as an option, global search does not play so critical a role in the overall search architecture, and you can factor that into crawl schedules and service level agreements.

The following table summarizes the tradeoffs of this search architecture.

Advantages

Disadvantages

Query performance is optimized for local content.

This option greatly reduces the amount of queries over the WAN compared to the centralized search model.

Search relevance is optimized based on the scope of the search (local or global).

Multiple SSPs increase administrative costs.

Crawling content over the WAN uses bandwidth.

For regional users who perform global queries, query performance is affected by the performance of WAN links.

Distributed search

If WAN links cannot support the ability to synchronize content across a global environment or to crawl remote content at regional farms, you can provide search at only the regional farm level. With the distributed search architecture, each regional farm hosts its own SSP, and the search service that is provided by each regional SSP crawls only local content.

The following figure illustrates the distributed search architecture.

Consider implementing the distributed search architecture under the following circumstances:

Regional sites are not well connected with WAN links.

Regional sites are autonomous from other regional sites.

Regional sites do not rely heavily on a connection to the central site — for example, an organization with branch offices that operate autonomously.

There are a large number of regional sites and the business model and WAN links do not support a centralized model — for example, an organization with a large number of branch offices that are not well connected by WAN links.

The following table summarizes the tradeoffs of the distributed search architecture.

Advantages

Disadvantages

Search relevancy is retained.

Content is not crawled over WAN links.

Search is not enterprise-wide.

Users at regional farms must connect to the central farm to search content at that farm.

Federated search

Federated search is a feature that is added in the Infrastructure Update for Microsoft Office Servers. This feature is also included in Microsoft Search Server 2008. Federated search enables end users to issue a query that searches multiple sources and displays results in separate Web Parts on a single search results page. These sources can be enterprise content repositories, other search engines, or portions of your Search Server index. Using federation enables you to provide more extensive query results for your users without devoting your server resources to crawling and indexing content.

In a distributed environment with server farms in different regions, federated search can be configured on each of the regions representing a different federated location. The user will see search results from each region in a different federated results Web Part. The results can be displayed as soon as they are received. For example, search results from the local server farm will most likely be returned before search results that are received over WAN connections.

The following diagram illustrates the use of federated search in a geographically distributed environment in which Microsoft Office SharePoint Server is deployed to each region.

In this diagram:

A user at Regional Farm 2 issues a query.

The query traffic is sent to a Web server at the local farm. The Web server forwards the query to the federated search locations.

Query A and B are federated locations and are sent to the geographically distributed farms.

Query C is a local search that is served by the local farm.

Search results are displayed on one Web page in separate Web Parts.

Configuring federated search in distributed environments

Using federated search, each server farm crawls its own content. For server farms running Office SharePoint Server, this requires an SSP at each regional farm. You create a federated connection to a remote server farm running Office SharePoint Server by creating (on the local server farm) an OpenSearch federated location. The OpenSearch federated location must point to the RSS feed of a search results page within a search center on the remote farm. You include the local farm in federated search by creating a “local search index” type of federated location. To implement federated search in a distributed environment, configure each farm with federated locations to the other farms.

The following diagram illustrates in greater detail a federated search connection to a remote farm.

In this diagram:

On the Central Farm, a Search Center is added to the Company Info site collection. This Search Center is configured with the scope that allows users to search across the farm. This Search Center includes a Search Results page. An RSS feed is enabled for this page.

On the Regional Farm, a federated search connection (callout A) is configured to connect to the Search Results page of the Central Farm. This allows local users at the Regional Farm to search across content at the Central Farm.

In many environments with multiple server farms, not all of the content on a server farm is relevant to users located near other farms. For example, company policies of a specific region may only apply to that region. If you know there is a subset of content that is relevant for users at other regions to search, create a scope on the farm that scopes search to the relevant subset of content. When you create a federated connection to the remote farm, connect to the same Search Results page RSS feed but add the scope as a URL parameter. For example: http://server/searchcenter/_layouts/srchrss.aspx?k={searchTerms}&s=<yourcustomscope>

For more information on implementing a Search Center, see the following articles:

After you’ve finished creating and configuring federated search locations, you need to connect each of these to a Federated Results Web Part so that users can see results from the location displayed in a Search Center. Configure a different Federated Results Web Part for each farm. When you configure the Federated Results Web Part properties, be sure to select the option to render results asynchronously (this is the default setting). With this setting, the results will be displayed as they are received and users do not have to wait for slower connections to start viewing results. By default, if asynchronous is not selected, the results will not render until each of the federated locations have either returned results or timed out. The timeout period is set to 90 seconds and cannot be changed.

An important consideration to evaluate when using federated search is security-trimming of search results. By default, security-trimming of search results persists for results returned by the following:

Local search index locations (the local farm).

OpenSearch locations that use Common Credentials (a single set of credentials for all users).

OpenSearch locations that use Per User Kerberos authentication.

However, user credentials are not passed automatically for authentication protocols other than Kerberos. To ensure that results are security-trimmed for the current user for these scenarios, extend the Federated Results Web Part to collect user credentials. For more information, see Creating a Custom Federated Search Web Part with a Credentials UI (http://go.microsoft.com/fwlink/?LinkId=121779&clcid=0x409).

Also consider using the Top Federated Results Web Part which displays the top results from multiple federated locations. However, this Web Part is configured to display results synchronously and this setting cannot be changed. Consequently, the page-load time is as fast as the slowest location configured in your Top Federated Results Web Part. If Kerberos authentication is not used, you’ll also need to extend this Web Part to collect user credentials if you want to ensure that search results for OpenSearch locations (all locations other than the local farm) are security-trimmed for each user.

Finally, while federated search gives users a view into multiple search sources, users are limited to the standard search options. Advanced search options cannot be used with federated search.

Using federated search with farms running Windows SharePoint Services

To use federated search with a farm running Windows SharePoint Services, upgrade the farm running Windows SharePoint Services to Search Server 2008 Express or Search Server 2008. Upgrading provides the advantage of offering a farm-wide search of the farm running Windows SharePoint Services, instead of a search scoped only to each content database. Plus, Search Server is required to provide RSS feeds on results. RSS is required to create an OpenSearch federated location to a remote farm’s results so that the results can be shown on the aggregated page.

The following diagram illustrates a geographically distributed environment with farms running Windows SharePoint Services upgraded to Search Server 2008 at the regional locations.

Summary of federated search

There are many advantages to using federated search in a geographic deployment. Federated search eliminates the need to crawl content over WAN connections or to synchronize content over WAN connections. Displaying the results in separate Web Parts helps users distinguish where content is located, making it easy to identify local content. Understanding where content is located can also help a user determine which results are most likely to be relevant.

There are a few drawbacks to this architecture, though. First, enterprise-wide relevance in search results cannot be achieved. Instead, relevance is scoped to each federated location. Next, query performance for remote locations is subject to WAN links. However, users typically receive search results for the local farm rather quickly.

The following table summarizes the tradeoffs of the federated search architecture.

Advantages

Disadvantages

Provides enterprise-wide search.

No limitation to the number of documents or items that can be searched.

Content is not crawled or synchronized over WAN links.

Query performance is optimized for local content while at the same time providing results for remote content.

Users can search different locations without connecting to each location separately.

Each content store can be managed separately.

Windows SharePoint Services with Search Server 2008 can be used at regional farms, instead of Office SharePoint Server.

Security-trimming is preserved for the local farm and for remote farms if Kerberos authentication is used.

Search relevancy is not enterprise-wide. Relevancy is scoped to each content source.