SharePoint 2013: Crawl Scaling recommendations

SharePoint 2013: Crawl Scaling recommendations

In SharePoint 2013 Search, crawling, filtering, and indexing are no longer tied to a single component (i.e. the Crawler in SharePoint 2010). The Crawl Component in 2013 is only responsible for downloading documents ("gathering") and feeding these to the
Content Process Component(s).

By offloading the filtering ("content processing") and indexing tasks, the crawler is no longer I/O or CPU intensive, and does not need to be scaled past a couple of components (for fault tolerance and network throughput to content sources). With 2x1Gbit/s
connections, content farms are likely to be the bottleneck, rather than the crawler itself. Host distribution rules are also gone (see
http://blogs.msdn.com/b/sharepoint_strategery/archive/2013/06/30/why-host-distribution-rules-dont-apply-to-sharepoint-2013.aspx), but due to the new search architecture, not needed either. Likewise, with the architectural changes, Crawl DBs are added just
for content volume now, not crawl performance.

This will sound strange to those coming from a SharePoint Search background (e.g.
http://blogs.msdn.com/b/russmax/archive/2010/04/16/search-2010-architecture-and-scale-part-1-crawl.aspx), where the bottlenecks are in the number of crawl components, crawl databases, and the possible property database IOPS. Like FAST Search for SharePoint
2010, Search in SharePoint 2013 does not have a property store (managed property data is instead written to disk by the Index Component), and does not do document filtering in the crawl component. As a consequence of this, crawl performance is scaled up primarily
by increasing the number of Content Processing Components (analogous to procservers in FS4SP, with contentdistributor and indexingdispatcher functionally rolled in).

The CPCs also scale on their own, based on CPU availability, up to a limit (default is good for up to 12 cores -
http://technet.microsoft.com/en-us/library/cc262787.aspx#Search). For most SharePoint content, a CPC will process 5-10 items per second per core. So for example, on an 8 core server,
with an Admin Component, Crawl Component, and a Content Processing Component (with ~6 cores to itself), you might see a crawl rate of ~45 items per second (e.g. 6 cores at an average 7.5 items per second), assuming no content source or index bottleneck.

In SharePoint 2010, the index creation and searching of that index were separated across Crawl and Query components. In 2013, all indexing (shadow & merge), propagation, and searching of an index happens within the Index component, on its local disks rather
than through databases; for this reason, it needs separate storage in any production environment. During small crawls/shadow index builds, the Index Component utilizes small writes (~256kB) sustained at a rate of 100 IOPS. For handling queries, the component
utilizes small reads (~64kB), with about 30 IOPS per query. To support 10 QPS at low latency for example, the storage subsystem would need to be capable of 300 IOPS for 64kB reads. If a crawl happens to change more than 10% of current indexed items, a master
merge will be triggered, leading to large reads & writes by the indexer (~100MB per operation), which can cause the performance of both small writes (shadow index) and small reads (queries) to drop. Benchmarking index storage is critical in ensuring that query
latency and crawl rate stay at acceptable levels throughout master merge.

So to the summarize the 2013 crawl performance scaling story:

Response time from content sources and network bandwidth to content sources

CPU resources for the Content Processing Component

I/O resources for the Index Component

I/O requirements described above are lower than in FAST Search for SharePoint 2010. The new minimums are as follows:

64 KB read – 300 IOPS [10 queries per second]

100 MB read – 200 MB/s [master merge]

100 MB write – 200 MB/s [master merge]

256 KB write – 100 IOPS [shadow index]

The TechNet documentation related to the index storage requirements and benchmarking can be found here: