Getting Started With XML Indexes

XML Indexes make a huge difference to the speed of XML queries, as Seth Delconte explains; and demonstrates by running queries against half a million XML employee records. The execution time of a query is reduced from two seconds to being too quick to measure, purely by creating the right type of secondary index for the query.

Introduction

Today, many organizations store and transmit data in an XML data format. Because of the ubiquity and readability of the XML format, it has become convenient and popular to both consume and provide data using XML. Because XML data is used so widely, it is increasingly important to be able to query XML values efficiently. XML indexes were designed to complement the XML data type and have been available since SQL Server 2005. When these indexes are used properly, they can dramatically reduce execution time in queries made against XML columns.

How Do XML Indexes Work?

When no indexes are present, an XML field must be ‘shredded’ when the query is executed. This means that the XML data is peeled apart from the XML tags, and organized into a relational format. An XML index does this work ahead of time, representing the XML data in an already-shredded version, thereby allowing easy filtering.

Some XML Index Ground Rules

XML indexes can be applied only to columns of the XML data type. In order to use XML indexes on a specific table, the table must have a clustered index on the primary key column (a primary key constraint includes a clustered index upon creation by default).

There are 2 types of XML indexes – primary and secondary.

A primary XML index essentially contains one row of information for each node in the XML column. This information is made up of XML node values, types, and paths. A primary index is a ‘pre-shredded’ representation of the XML blob – an easy-to-filter copy of the XML data.

Secondary XML indexes are dependent on primary XML indexes – you cannot create any secondary indexes without first having created a primary index.

PATH secondary indexes often help to optimize queries that make use of a path. An exist() method in the query usually indicates that a PATH index may improve the query execution.

PROPERTY secondary indexes are designed to be helpful with queries where the primary key field is used, and where multiple values are retrieved from a single XML instance.

VALUE secondary indexes are good for queries that use wildcards in a path to find known values – if the full path is not known, but the value being filtered IS known, a VALUE index will probably optimize the query execution.

Using a Primary XML Index

We have an existing Employee table with 500000 employee records. A sampling of the table is shown below:

--get sample from employee data tableSELECT TOP 1 * FROM EmployeeGO

We will need to generate XML data from the Employee table for the purposes of testing the benefits of XML indexes. To do this, we’ll create a script that builds out an XML representation of the Employee data fields – but first, we’ll need to create a table to hold the XML data:

We’ve created the table with a primary key, which by default creates a clustered index on the primary key column – a necessity if we plan on creating XML indexes on this new table. We can now run a script that populates the new EMP_XML table. Our script will do this via the FOR XML clause, using the PATH mode to create the ‘Employee’ root node:

WHILE @counter <= 500000BEGIN SET @row = ( SELECT SSN AS '@ssn',DOB AS '@dob',FirstName AS 'FirstName',MiddleName AS 'MiddleName',LastName AS 'LastName',[State] AS 'Address/@state',ZipCode AS 'Address/@zip',Street AS 'Address/Street',City AS 'Address/City',ID AS 'Internal/@ID',Salary AS 'Internal/Salary',HireDate AS 'Internal/Hiredate',Department AS 'Internal/Department'FROM EmployeeWHERE ID = @counterFOR XML PATH ('Employee') )

INSERT EMP_XML VALUES(@counter,@row)SET @counter += 1END

Notice that we’ve designated some of the source table’s fields as nodes or nested nodes, and others as node attributes in the XML table. We now have a table containing all 500000 records of employee data, in XML format:

--get sampling of new xml recordsSELECT TOP 5 * FROM EMP_XMLGO

The first record’s EMP_DETAILS field, expanded:

We can see that all fields from the source table are now represented as XML nodes and attributes. We’ll use this table to test the performance benefits of the XML index.

Comparing Query Execution Times

Let’s now look at some basic filtering operations that we may want to perform on our new employee XML data.

Before running any queries, let’s turn on STATISTICS TIME to measure the query execution time:

--turn on statistics timeSET STATISTICS TIME ONGO

We’ll now use a simple example to demonstrate the benefits of using a primary XML index. To locate the street address for an employee having a Social Security Number of 574582264, we can use the following XQuery statement:

The elapsed time, under SQL Server Execution Timesin the Messages tab, represents the actual query execution time in milliseconds. We can see that the query time was 2463 ms. Now let’s add a primary XML index to the table:

We’ve created a primary index on the EMP_DETAILS column of the table – the only XML field. Creating the primary XML index on a table with this many records may take a few minutes. XML data is being collected and stored during the primary XML index creation, including node values, node types, and the paths from each node to the root.

Now that a primary XML index is on the EMP_DETAILS field, we’ll run the same query again:

We see that the elapsed time and CPU time were both reduced by about 60% as a result of creating the primary XML index! We can tell for sure that the XML index was used by examining the graphical execution plan. To see this in SSMS, go to the Query menu before running the query and then select Include Actual Execution Plan. After running the query, look at the results in the Execution Plan tab:

A node that is displayed for a PrimaryXML index seek or scan in the graphical execution plan indicates that the XML index was indeed used for the query.

Disabling XML Indexes

Let’s assume we have received a request to find all address information for employees who:

Live in Michigan

Have a first name of ‘Desiree’

Before we create a query for this request, let’s disable the primary XML index, so we can compare the before and after time statistics. By disabling an index, we can retain the index definition whilst clearing out the stored data about the table. A disabled index is not used by the query execution plan. The current status of XML indexes can be checked by using:

This query gives us the results we are looking for; every element and attribute value under the Address node, and it also includes FirstName and LastName values for clarity. The FLWOR expression iterated the Address node and returned the State, Zip, Street, and City values. After running this query a few times, the elapsed execution time consistently shows a time of around 9539ms. Now that we’ve tested this query without using any XML indexes, let’s re-enable our primary XML index.

Rebuilding XML Indexes

An XML index, like any index, can be re-enabled by simply using the REBUILD option in an ALTER INDEX statement. To rebuild our primary XML index, we’ll run the following:

The primary XML index is now re-enabled. Be aware that rebuilding an index can have serious performance implications, limiting access to the table from other processes. Other types of indexes accept the REBUILD argument with the WITH (ONLINE = ON) hint to reduce this performance hit, but not XML indexes. If we were to attempt a REBUILD WITH (ONLINE = ON) on our primary XML index, we’d get the following error:

Our previous execution time was 9539 milliseconds. We’ve reduced the time by 90% by using just a primary XML index! Although this is a great optimization gain, let’s see if we can increase efficiency for this query even more, using a secondary XML index.

Using Secondary XML Indexes

PATH secondary indexes are designed to help improve queries that contain a good amount of path expressions. Queries that use the exist() method are usually good candidates for a PATH index. Our last query uses 2 instances of the exist() method in its WHERE clause, and also includes path expressions in the select list. The PATH index seems promising for our query.

PROPERTY secondary indexes are intended for queries where the primary key is a known value, and multiple return values are required from the same XML instance. We are returning multiple return values from the same XML instance. A PROPERTY index also seems like it would be helpful with our query.

VALUE secondary indexes are useful for searching for known element or attribute values, without necessarily knowing the element or attribute names, or the full path to the value. Queries that use wildcards for portions of a path would probably benefit from a VALUE index. In our query, we do know the paths to the elements and attributes we are filtering, as well as the values. A VALUE index could help, but probably not as much as a PATH index would.

Using PATH Secondary XML Indexes

The PATH secondary XML index is known to help with queries that have an exist() expression in the WHERE clause. Since our query uses 2 instances of the exist() method, let’s try aPATH secondary index, and see if it improves performance. We’ll create thePATH index as follows:

Note that we need to reference the primary XML index (IX_EMP_DETAILSx) in the index create statement with the USING clause, to specify that it will be a secondary XML index. A primary XML index is required in order to implement any secondary index. The ‘FOR PATH’ option is what designates the index as a PATH index. Now that we have a PATH secondary index, let’s re-run our query and view the execution time results:

We see another huge performance gain by using the PATH secondary XML index – another reduction in execution time by more than 60%.

Using PROPERTY Secondary XML Indexes

With our ‘addresses’ query, a PROPERTY secondary XML index could be slightly more advantageous than a PATH index, because in addition to using paths, our query also returns multiple values. Let’s experiment and see if adding a PROPERTY index will increase optimization.

After running the ‘addresses’ query again, this time with a PROPERTY index in place, we see yet another reduction in execution time – from 332 ms. to 190 ms. It seems pretty obvious that the PROPERTY index was the reason for the performance gain here, but how can we tell for sure? We now have 2 different secondary indexes on our table, PATH and PROPERTY, and their nodes both look similar in the graphical execution plan:

To find out, all we have to do is hover over the SecondaryXML node and look in the ‘Object’ section of the popup – OR right-click the node, select Properties, and examine the ‘Object’ section of the pane:

We can clearly see the name of the index (IX_EMP_DETAILS_propertyx) that was used in the query. The PROPERTY index was utilized for our query. This is a good example of how experimenting with secondary XML indexes can result in highly optimized queries.

Using VALUE Secondary XML Indexes

We have received another request: write a query that returns all details for all employees living in zip code 98679. Let’s imagine that we aren’t familiar with the XML structure, but still need to query it without bothering to look up schema details. We are not sure of the node attribute name – is it ‘zip’, ‘zipcode’, or ‘zip code’? We do know that it is listed as a node attribute, not a node element. We’ll choose a query that reflects what we know about the XML structure:

We’ve used the ‘//’ path wildcard to indicate that we don’t know the full path to the attribute, and the ‘*’ attribute wildcard to allow for the value to be found in any element attribute. The ‘@’ shows that we know that the zip code value is an element attribute value.

The results are processed in 1809 ms. If we look at the graphical execution plan, we see that the PATH secondary index was used almost exclusively:

Since our ‘zipcode’ query uses an incomplete path AND a wildcard, it’s a good candidate for a VALUE secondary index. Let’s create a VALUE index, and test the execution time from a second query run:

The cumulative index space used for the table was increased by more than 2GB after adding the primary XML index and the 3 secondary indexes. That is a substantial disk space cost overhead for the XML indexes , considering that the data size of the 500000-row table is only about 300MB.

Conclusion

We’ve looked at some of the reasons that using XML indexes can increase a query’s performance, and have gone over some of the basic rules for using them.

We’ve delved into some specific test cases in order to prove the performance benefits of XML indexes on the XML data type. We found by comparing time statistics that using a primary XML index can reduce query execution times dramatically. We looked at the different types of secondary XML indexes, and saw that by implementing the correct secondary index, a query’s performance can be optimized even further.

We also reviewed some disadvantages to using XML indexes, such as increased table modification times due to index maintenance, and increased disk space consumption. However, when XML indexes are used correctly to support specific query demands, any increased disk and index maintenance costs may be seen as a small price for such large query performance gains.

Tried this around two years ago. I found that the primary index does not make any difference when the XML data size is important , in my case, 100kb.Also the index affects the insert performance as i insert an average of 500k XML per day.

This seem fine for low data volumes and small xml size.

Subject:

Xml indeces slowed down my query!

Posted by:

George (not signed in)

Posted on:

Monday, October 1, 2012 at 8:56 AM

Message:

When I applied all 4 indeces to the query below, it took almost 10 minutes for the query to complete. Then by trial an error I removed all indeces and the query is taking less that 1 minute.

George, try a primary XML index only, without any secondary indexes. Make sure to run the query at least twice to allow the query optimizer to build the best execution plan - sometimes the very first execution IS slower... I also remember reading that there was a bug that caused primary XML indexes to be slower... I am not sure which version of SQL Server that was on, so make sure you have the latest SQL Server updates from MS.

As a side bar, though, if anyone tried to store such XML in one of my databases except for storing it in a temporary staging table to convert it to a "normal" table, we'd have to have a real serious meeting about it out in the wood shed. ;-)

I have a situation where i have a 110MB file that needs to be dumped into a single row on an XML column type. Then I want to query this large XML test using XPath in SQL queries. Is there any efficient way to perform this query or if any indexes can be created on XML data contained in this XML column.