Share

Content crawling in Office SharePoint fails

Problem

Office SharePoint fails to fully crawl a content source containing Excel
files with think-cell links and the following message is found in the crawl log:

Error in the Site Data Web Service. (Invalid high surrogate
character (0xXXXX). A high surrogate character must have a value from range
(0xD800 - 0xDBFF).)

Cause

This is due to a bug in Excel 2000 and Excel XP that results in
the generation of Excel files with corrupt metadata. The problem occurs when
a string custom document property with a linked source is added to an Excel
document and the source cannot be resolved. In later versions of Excel the
document property value is set to something valid (e.g. an empty string).
In Excel 2000 and Excel XP, however, the value contains garbage and may cause
the Office SharePoint crawler to fail. The Excel documentation explicitly states
that the document property value is set to a default value before being updated
when the source is resolved, and so this behavior is an Excel 2000 and
Excel XP bug.

Press Alt+F11
to open the macro window and run the AddDocumentProperty
routine.

Go to File → Properties and select the
Custom tab.

The value associated with the newly added TestProperty
is garbage.

Solution

think-cell uses custom document properties and, after noticing this behavior,
we altered our code to add our document properties with type boolean rather
than string. Both Excel 2000 and Excel XP set the document property
to a valid boolean value and this value remains valid if the link source
cannot be resolved.

Files created using think-cell 5.0 and higher use this workaround and
should be successfully crawlable by Office SharePoint.