Indexing a massive catalog can take hours to complete because you need to preprocess a large amount of data from the queries used to populate the temporary tables, construct a massive catalog hierarchy, as well as combine this massive data set to populate the index.

The regular process for preprocessing involves preprocessing each temporary table (TI_XXXXXX) sequentially, waiting for the current table to finish preprocessing before moving onto the next table. This process can be inefficient if your environment can handle parallel processing (ex. multi-core CPU) as most tables needed for preprocessing are independent of each other, so you don't always need data in TI_XXXXXX to populate TI_YYYYYY.

To take advantage of parallel preprocessing and distributed indexing for your index, you will need to split the index into shards. Sharding the index is the process of splitting up the index into individual shards for processing. There are two types of shards: horizontal and vertical shards.

Horizontal shards are shards that split up the index by catentry ID. For example, Horizontal Shard A will process catentry ID 1 - 500,000, Horizontal Shard B will process catentry ID 500,001 - 1,000,000, etc. Each horizontal shard will perform preprocessing and indexing only for its set of catentries. Once all the horizontal shards have completed indexing, these indexes will be merged to make up a single index.

Vertical shards are shards that will perform preprocessing that require all catentries to be processed at the same time. For example, catalog hierarchy processing requires all catentries to be processed at the same time to make sure the hierarchy is consistent. As a result, catalog hierarchy preprocessing will be performed in a vertical shard. Vertical shards can also be configured for preprocessing customizations that require all catentries to be processed at the same time. Once the vertical shards have completed preprocessing, this data will be combined with each individual horizontal shard for indexing.

Below is an example configuration of the horizontal and vertical shards:

With Sharding, we can take advantage of parallel processing by performing an intensive process like catalog hierarchy preprocessing while perform other index preprocessing processes at the same time, significantly reducing the amount of time it takes for indexing.

To perform delta indexing, you need to run delta index preprocessing and delta buildindex. To run a delta index preprocessing, you need to set fullbuild parameter to false when running the di-preprocess script like so:

You may notice that this is the first parameter we pass into the preprocess script. This path is used by the script to find the location of the preprocessing xmls to be able to build and populate the temporary tables.

Note that this will result in much more logging done during preprocessing, so if you need to see what is happening in the middle of preprocessing (rather than at the end when it fails, for example), then you may need to increase the FileHandler limit (file size) or count (number of historical files), for example:

java.util.logging.FileHandler.limit=80000000

java.util.logging.FileHandler.count=10

5. What data do I collect for a di-preprocess issue and how do I begin troubleshooting it?

Once you have collected this data, you should first look into the type of issue you are having by reviewing wc-dataimport-preprocess.log. If there is an issue with a particular temporary table (TI_XXXXXX), you can verify if the data is consistent with what you are expecting to be inside. If the data in the table is consistent, then you should review the corresponding xml file from the pre-processConfig directory used to build this table, to verify that the query used will grab the expected data.

6. What data can I capture to determine why di-preprocess taking so long to complete?

There are multiple parts to preprocessing that you can check in wc-dataimport-preprocess.log to see what part(s) are taking the longest to complete. Pre-FEP7, you can search for "completed in" to see how long it took to process each table:

7. Why is index preprocessing failing due to "another indexing process is currently running"?

To make indexing run as an atomic action, we put lock records into TI_DELTA_CATENTRY/TI_DELTA_CATGROUP. When running preprocess, there will one lock record in these tables:

P (action) = indexing is currently in progress

When preprocess ends, it will one more lock record to indicate that preprocessing has completed and buildindex can be started:

B (action) = index preprocessing completed, buildindex can be ran

When a new index preprocess is started, it will first check TI_DELTA_CATENTRY/TI_DELTA_CATGROUP to see if the table has the P or B action. If it does, then it will cause the new index preprocess to fail, so that it does not interrupt the current preprocess/buildindex process running. If you are sure that there isn't another preprocess/buildindex process running, you can either delete these records from the table, or run preprocess script with -force true parameter.

8. What types of changes can we cover using delta indexing? What types of changes require full indexing?

9.Do I need to run preprocess/buildindex if I use the UpdateSearchIndex scheduled job?

No, that is not necessary since the UpdateSearchIndex scheduled job is used to automatic the indexing process by scheduling indexing to run at specific times. However, you can run preprocess/buildindex manually after making changes so you don't need to wait until UpdateSearchIndex runs again to have those changes added to the index. Behind the scenes, UpdateSearchIndex essentially performs preprocess/buildindex in a single process. In the end, an index update either from UpdateSearchIndex or preprocess/buildindex is equivalent so you can choose to use either scenario for updating the index, or a mix of both. For example, you can schedule hourly UpdateSearchIndex runs, while running preprocess/buildindex manually to trigger immediate updates after making a change to the index. For more information about configuring UpdateSearchIndex, you can review the following Knowledge Center page: http://www-01.ibm.com/support/knowledgecenter/SSZLC2_7.0.0/com.ibm.commerce.admin.doc/tasks/tsdschedsearchupdateindex.htm?lang=en

10. How can I change the behaviour of the preprocessing script?

You can add extra parameters to the script to change the behaviour of preprocessing. For example:

(FEP7+) -skipDeltaNoEntry<true/false>: When performing delta preprocessing with this parameter set to true, the script will check if there are any delta updates to perform. If there are no updates to perform, then delta preprocessing will end. This is different than the default behaviour, which will reconstruct all of the temporary tables but they will be empty since there was no delta updates.

(FEP8)-nonLangTables <true/false>: When performing preprocessing with this parameter set to true, only the non-language specific tables will be processed (ex. TI_CATENTRY_0). You can quickly identify these tables as only having one number appended to the name, which is used to identify the index they are associated with (ex. MC_10001 = _0, MC_10051 = _1, etc...)

(FEP8)-langTables <true/false>: When performing preprocessing with this parameter set to true, only the language specific tables will be processed (ex. TI_ATTR_0_1). You can quickly identify these tables as having two numbers appended to the name. The first number is used to identify the index they are associated with (ex. MC_10001 = _0, MC_10051 = _1, etc...). The second number is used to identify the language ID (en_US = -1 which turns into _1).

(FEP8)-deepSequence <true/false>: When performing preprocessing with this parameter set to true, products will be sequenced in the index based on the deep search sequencing functionality. By default, when using category navigation, only the category's products are displayed. However, with deep search, all of the subcategories' products will also be displayed. If these products have sequence values, then we will need to process sequencing differently to account for deep search, which is what deep search sequencing is for. If you are using deep search, as well as sequencing for your products, then you can use this parameter to enable deep search sequencing. However, if you don't use deep search or sequencing, then this functionality isn't applicable to you.

(FEP8)-deepUnpublish <true/false>: When performing preprocessing with this parameter set to true, preprocessing will be performed based on the deep category unpublish functionality. The deep category unpublish feature allows immediate child categories and all of their underlying subcategories and products to be hidden from shoppers in the storefront. If you are using this functionality, then this parameter will prevent these products/categories from being indexed as published.

(FEP8)-publishedOnly <true/false>: When performing preprocessing with this parameter set to true, preprocessing will be performed to allow only products from published categories to be indexed when deep category unpublish is enabled.

You may have the following scenario where you want to have products published but some of their associated items unpublished. This can cause an issue with the facets showing the item that you expect to be unpublished. You can remove your unpublished catentries from the search index to prevent this issue by doing the following:

Open wc-dataimport-preprocess-fullbuild.xml located in CommerceServer70/instances/<INSTANCE_NAME>/search/pre-processConfig/MC_<MASTERCATALOG_ID>/<DB_TYPE>.

If you use the 'display to customers' flag in management center to unpublish select SKUs, you may have noticed that the facet count on the storefront is still tallying these unpublished items. For example:

1. Create a product: 'Shirt'

2. Assign it an attribute 'Colour' (and make this attribute facetable)

3. Create a couple of items for the product, with attribute values 'Red', 'Blue', 'Green'.

4. For the shirt SKU of color 'Blue', uncheck 'display to customers'

5. Run Preprocess/buildindex

6. On the storefront, search for 'shirt'. On the left nav, you'll see all size attributes appear, including 'Blue', which only belongs to the unpublished SKU

7. Click on the 'blue' facet. 0 products/SKUs are returned.

This can be frustrating for shoppers, who think that a blue shirt is available, but then can't actually display the product.

You can change this behaviour by updating a di-preprocess XML. Locate the file wc-dataimport-preprocess-attribute.xml for the CatalogEntry index you want to change, for example: