A question from one of my students earlier today brought to my attention the fact that even though I sometimes try to cover some deeper aspects of the FAST Search for SharePoint platform, there are still some architectural concepts and guidance that need to be fully absorbed before we can take this deep dive into FS4SP, so in this post I will try to cover the Crawling –> Processing –> Indexing architecture of FS4SP, as well as its ties to SharePoint 2010 to show which points should you pay attention to when configuring/administering your search architecture.

To put it in a simple way, this is the “diagram” of the components involved from the time a document is crawled until it is indexed:

FAST Content SSA (or FAST Search Content SSA, or FAST Search Connector, etc.): The sole purpose of this SSA is to crawl content (in the SP farm, since it is configured in a SP server) and send it to be processed in the FAST farm. Apart from the fact that the content is been sent to FS4SP to be processed, this SSA will have the same crawling behavior as it would for a regular SharePoint Search SSA, therefore the same rules of scaling and fault tolerance applies. Now it is important to understand how to scale this component and how to monitor its performance (to find out if you need to scale ).

Content Distributor: first component on the FAST farm, its main role is to receive batches of documents from the FAST Content SSA and route them to the Document Processors to be processed. This is a very lightweight component that should not impact your system’s performance, so the main observation here is in regards of scaling for fault tolerance.

How (and why) to scale: deploy this component in at least two servers in your FS4SP farm so that in a case of failure of the primary Content Distributor, the second Content Distributor can pick up the work while you troubleshoot the failure on the other server. You can do this either during initial deployment of your FAST farm or later by modifying your farm deployment. The important thing is that you remember to add the address/port of both your Content Distributors when configuring the FAST Content SSA (either through Central Administration or through the Set-SPEnterpriseSearchExtendedConnectorProperty cmdlet, separating the multiple hostname:port values with semicolons).

Document Processor (s): this guy is always busy in a FS4SP deployment with ongoing crawling/feeding as it has the tough role of processing the content before it is sent to be indexed. Among its tasks are language and encoding detection, tokenization/word processing, stemming/lemmatization, property extraction, document conversion (extract content from PDF, Office documents, etc.), mappings from crawled properties to managed properties, etc., etc., etc.. As I said, very busy guy. And all of this is done in memory, which means that the primary resources consumed by this process will be memory and CPU.

How (and why) to scale: you will definitely want to add multiple instances of this component to your FAST farm either during initial deployment or even later on by simply opening a command prompt and issuing the command “nctrl add procserver” in any FS4SP server. This can be very helpful during the initial load of your system, when you can temporarily add multiple instances of this component to the search nodes (which are not yet been used by your users), and you can just as easily remove them later by executing “nctrl remove procserver”. A good rule of thumb in a dedicated processing server is that you could have 1 2 document processor per CPU (after some tests done by a friend, I would recommend you to be conservative, with just 1 per CPU – the recommendation for observing your CPU/memory usage during peak processing times is still valid, so you can properly assess your resource consumption/availability), so in a non-dedicated server you need to make sure there are enough CPU resources for both document processing and the other processes.

How (and what) to monitor: the main indicator if this component is your bottleneck is constantly reaching 90-100% CPU in the servers hosting document processors. Each Document Processor component will be a separate instance of procserver.exe. You can also monitor the Document Processor performance counters.

Indexing Dispatcher: remember what I said about the Content Distributor? The Indexing Dispatcher will have a similar role, receiving batches of processed content and forwarding them to the Indexer component to be effectively indexed. This component is also very lightweight and should not impact your performance.

How (and why) to scale: add at least two instances of this component across different servers for fault tolerance reasons, either during the initial deployment of your FAST farm or later on by reconfiguring it.

Indexer: I could spend a whole other post just talking about the Indexer component, but if you got through here, just understand that this component will effectively receive the processed documents that were in memory, save them to disk (by default on C:\FASTSearch\data\data_fixml) and then build the actual optimized binary index (along with sorting tables, summary tables, etc., etc.) also in the disk (by default on C:\FASTSearch\data\data_index).

How (and what) to monitor: to check the activity in this component you can use the Indexer Performance counters. The main concept to understand is that all documents arrive in an Indexer at Partition 0 (a different concept and purpose than Index Partitions in SharePoint Search), where they get indexed and move to the upper partitions (Partition 1-4), so you will want to monitor how fast your Partition 0 can reindex itself, making the new content available to search as it arrives. If this seems confusing (and it most likely will), check the FS4SP capacity planning guide mentioned above as it explains the index partitions in more detail.