Indexing content in complex Umbraco data types

Techniques for advanced content indexing

Robert Foster·25 September 2018

The Umbraco CMS is a very flexible and open platform for building an expressive and intuitive Content Editors interface, but sometimes that comes at a price when it comes to indexing the produced content for Lucene/Examine based searching.

Plugins like Stacked Content, along with Umbraco's built-in Grid Editor and Nested Content data types use JSON as a storage format, and that doesn't lend itself to indexing and searching without some help. So we're going to look at some code to help extract the relevant information by hooking into Examine's Indexing events.

In the first example, we're going to focus on the Nested Content and InnerContent (aka Our.Umbraco.InnerContent) based data types.

If you haven't come across these data types before, Nested Content is build into the Umbraco Core software, and InnerContent is an api supporting derived datatypes Stacked Content and Content List. These two can be installed using NuGet or via the Umbraco Package Manager. Essentially, these data types are based on the concept of using "unattached" Content Nodes and can be rendered to lists of IPublishedContent from a single property. Unattached, because they aren't actually part of the Content Tree and hence don't have a parent node at all. They are instead stored in the Published content cache serialised as JSON formatted objects.

So our first example effectively attempts to map the raw JSON value of these unattached content nodes and extract out the fields that are of interest:

The above code recursively walks through the JSON token structure - if it's a JArray it loops recursively calls itself on each element; and if it's a JObject it loops through each property and extracts out the value of any that match one of the targetedFields passed in, combining them into a single string for return.

Pretty much the properties you'd expect to be useful for indexing in any Content item.

Now lets look at how we can do the same for the Grid Editor. Because the Grid Editor also uses JSON to store it's property value, we can use the same method above that we've used for Nested Content and InnerContent data. However, the Grid Editor has only a few properties that are desirable for indexing, dependent on how you have set up your Grid. Out of the box, the JSON keys you may want to target using the targedFields parameter might be:

value (for RichText, Heading 1, etc.)

caption (for an Image property)

The easiest way to work out what you want to index and what you want to discard is to inspect the JSON value itself in the Umbraco.config cache file.

Now to glue it all together. Now we know how to extract the content for indexing from a JSON object, we need to be able to pull the JSON string out of the properties we're indexing in the first place. The following method does that by using the PropertyEditorResolver to retrieve the appropriate editor, which in turn gives us the content we need:

Note the default case (lines 99-101) for the switch statement is to simply assume we can index the property value without any special processing.

Line 104 takes all of the extracted content and puts it in the index with the given key - we're effectively combining the values from a whole lot of properties into one index field, which makes querying a lot simpler.

All we need to do now is hook into the Indexing event and call the AggregateFields method to populate our fields:

In this example, we're creating two new fields - _title and _content - and splitting up the properties we want to index amongst them.

Hook it up to the GatheringNodeData event, and we're in business. We're also only interested in the External index, not the Internal ones, so we filter out those indexes.

TL;DR

By using a complex properties' raw JSON value, we can target specific fields/keys/data within that value and aggregate it into a single index field to simplify search querying. We've covered Nested Content, Grid Editor, and InnerContent derived data types, but the same technique can be used for other complex data types as well.

Umbraco have just released a patch for an issue affecting a library used by the CMS that could lead to exposure of private information and recommends applying the patch to affected sites as soon as possible. We look into how to determine if your site is likely to be affected, and what to do about it.