Handling Arrays as Inputs

Executive Summary: Learn how to use a T-SQL split function to handle arrays as inputs for Microsoft SQL Server stored procedures.

Although SQL Server stored procedures don’t support using arrays as inputs, DBAs often need to run procedures that can accept an unknown number of input parameters. For example, suppose you need to develop a stored procedure in the Northwind database that accepts a list of order IDs as input and returns all orders from the Orders table whose keys appear in the input list. A common solution is to use dynamic SQL; in “Handling Arrays” (InstantDoc ID 100453), I explain why this solution isn’t preferable.

Another option is to create a table function that accepts a string with a separated list of elements as input (@arr), and possibly a second argument with the separator character (@sep), and returns a table result with a row for each element. You can then use such a split function in your stored procedure and join its output table with the Orders table to identify the qualifying orders.

In this article I discuss how to use a T-SQL implementation of the split function solution. Next month I’ll discuss a Common Language Runtime (CLR) implementation of the split function approach.

Getting Started

To use the T-SQL split function approach, first run the code in Listing 1 to create and populate an auxiliary table of numbers. Next, run the code in Listing 2 to create the T-SQL implementation of the fn_split function. The split logic implemented by the function’s query can be divided into three parts: generating copies, extracting elements, and calculating position.

Generating copies. The first part of the query produces a copy of the @arr value for each element in the array. This action requires the Nums table that the script in Listing 1 created; as you can see, the query in Listing 2 is against the Nums table. The predicate in the WHERE clause is in charge of producing a match for each element in the array. In simple terms, a match is found for each number n from the Nums table that represents the character position of @sep (the list’s separator) in @arr. For x number of elements, there are x - 1 separators. In order to obtain x matches for x elements, you must add a separator in front of the first element. That is, you need to look for @sep in @sep + @arr instead of in @arr. In addition, because the Nums table can have many more numbers than the number of characters in @arr, you filter only the numbers that are smaller than or equal to the length of @arr plus one for the separator you added in front of the first element. Here’s the complete predicate in the query’s WHERE clause:

WHERE n <= LEN(@arr) + 1 AND SUBSTRING(@sep + @arr, n, 1) = @sep

This filter gives you a match for each element in the array; within that match, n represents the position of @sep in @sep + @arr. Interestingly, in respect to @arr, n simply represents the position of the character where the element starts. Next, you need to extract a different element from each copy of @arr.

Extracting elements. The second part of the script in Listing 2 extracts elements from each copy of @arr. The column n in the Nums table represents the character position where the element of interest starts. Thus, a call to the SUBSTRING function extracts the element:

SUBSTRING(@arr, n, CHARINDEX(@sep, @arr + @sep, n) - n)

The SUBSTRING function operates on @arr as the input string and extracts a substring starting at character position n. The length to extract is the position of the next occurrence of @sep in @arr (calculated by the CHARINDEX function) minus n. Two things to note about the use of CHARINDEX to calculate the position of the next occurrence of @sep in @arr are that (1) in order to ensure that the separator is also found after the last element, the function looks for @sep in @arr + @sep rather than just in @arr and (2) the last argument to the function tells the function where to start looking for @sep. In my example, you would tell CHARINDEX to look for the next occurrence of @sep starting at the beginning of the current element, which is represented by n; hence n is provided as the last argument to the function.

Calculating position. The final portion of the script in Listing 2 calculates the position of the element within the array. In SQL Server 2000 the solution is quite tricky. The position of the element starting at character n can be calculated by counting the number of occurrences of @sep prior to the nth character (in the left n - 1 characters) plus 1. To count the number of occurrences of a character c in a string s, use the REPLACE function to remove all occurrences of c from s (call the result s2) by substituting them with an empty string, then simply subtract the length of s2 from s. The length of the original string minus the length of the string after removal of all occurrences of the separator character gives you the number of occurrences of the separator character. The original length of our string of interest is by definition (n - 1), therefore the complete expression that calculates the number of occurrences of @sep in the left n - 1 characters in @arr is:

(n - 1) - LEN(REPLACE(LEFT(@arr, n-1), @sep, N'))

By adding 1 to the expression above, you get the position of the element within the array. Note that if you’re working with SQL Server 2005 or later, you can calculate the position of the element with a simple expression based on the ROW_NUMBER function:

ROW_NUMBER() OVER(ORDER BY n)

Also, remember that as of SQL Server 2005 you can work with VARCHAR(MAX) or NVARCHAR(MAX) as the input parameters to the function and stored procedure, and thus support very long input strings (up to 2GB).

Now that you understand the logic behind this solution, try running the following code to test the function:

SELECT * FROM dbo.fn_split(N'10248,10250,10249', N',');

You’ll get the output in Table 1, which includes a row for each element and indicates its position in the array.

Continue on Page 2

Run the code in Listing 3 to create the usp_getorders procedure, which is based on a call to the function fn_split, as well as a join between the function and the Orders table to return only the qualifying orders. Note the use of a new GUID to inspect the plan caching and reuse behavior of the new solution.

Next, run the following code to test the stored procedure with two different inputs:

Because the query strings for both invocations are the same, the plan produced for the first invocation of the query can be reused for subsequent invocations. You can confirm this by inspecting the plans associated with the query in cache:

SELECT objtype, usecounts, sql FROM sys.syscacheobjects -- in 2000 query master .dbo.syscacheobjects WHERE SQL LIKE '%89CE9E3A-943A-42F2-AEB6- 0C924CB18387%' AND sql NOT LIKE '%sys%';

Note that there’s only one plan in cache, and the usecounts attribute indicates that it was used twice. This implementation has great performance benefits over the dynamic SQL approach that I discussed in “Handling Arrays” (InstantDoc ID 100453). This method optimizes the query only once, then reuses the cached plan. Also, because there’s only one plan in cache, the procedure creates no memory overhead. In addition, because this implementation of the procedure uses a static query, there’s no exposure to SQL injection, and the solution is therefore secure. Finally, because the fn_split function returns the position of the element in the array (pos attribute), you can sort the rows in the output by that position by adding ORDER BY pos to the query.

Choices, Choices

Using a T-SQL split function to handle arrays as inputs is a good solution—the approach is secure and doesn’t negatively affect performance. However, the implementation is unnecessarily complex. Because you should try to develop solutions that are as simple and straightforward as possible, you should consider additional alternatives. Next month I’ll present an approach that uses a CLR version of the split function, as well as an implementation based on SQL Server 2008’s table valued parameters.

Discuss this Article 5

Thank you for this article - it is very useful for something I am working on right now.
I don't like having dbo.nums as a separate table. It seems that I would also need a separate tables for strings and dates. Furthermore, I think I would need to make sure that the table is empty before processing.
I think I would like to create a temporary table or a table variable within the stored procedure that is calling the split function. It seems that the functionality of dbo.nums is temporary and ephemeral, exactly the sort of thing that temporary tables and table variables were designed for. However, I don't understand the performance considerations for such an implementation. I would think that if I put it in a stored procedure, then the query plan would be cached, and the performance wouldn't be too bad.
[Reply: There are many practical uses for a numbers table, including generating a sequence of date and time values. I would strongly recommend having a permanent one in the database.
Cheers,
Itzik]

Itzik,
Artical is great but I have one question.
I get wrong estimation of rows when I use fn_split function and because of that I have suboptimal query plan. Fn_split function in my case returns only two rows but estimated number of rows is 1100000 and becaouse of that I get hash join instead of nested loop with two other tables. Do you have some tip how to get a good estimation of rows.
Thanks,
Ognjen
[Reply: An inaccurate estimate could be a result of reuse of a previously cached plan. The estimate is based on compile time. Could also be a matter of referring to a variable who's value cannot be sniffed. In both cases, adding OPTION(RECOMPILE) to the query could help. If that’s not the case and it’s just a matter of inaccurate estimates there’s the option to first insert the result of the split into a temp table and then joining the temp table with the other. This will add some cost to the process due to the use of the temp table, but the estimates should be better, so overall the process could improve.
Cheers,
Itzik]

Thank you for the article, enjoyable as always.
This is a road I, like so many others, been down many times in the past. Using the Nums table is a clever technique for achieving a set based solution that is compatible with SQL8.
I can understand why reader danoland may be hesitant to create a permanent Nums table, but I also believe that he doesn’t fully comprehend how it is used. The Nums table is used as something of an anchor, something to select against when discovering the starting positions of the array elements; it’s the key to the set based solution. So, it doesn’t matter what the data type of the array is as long as the Nums table is sequential and the string length of the ‘array’ passed is not greater than than MAX(Nums.n).
I’ve found that if backwards compatibility is not an issue a recursive CTE solution runs significantly faster. One notable caveat is that due to MAXRECURSION the array length is limited to 32,767 items.
-----------------------------------------------
WITH splitterCte (pos, idx, element) AS (
SELECT CAST(1 AS INT) pos
,CHARINDEX(@sep, @arr+@sep, 1) idx
,SUBSTRING(@arr, 0, CHARINDEX(@sep, @arr+@sep, 1)) element
UNION ALL
SELECT pos+1 pos
,CHARINDEX(@sep, @arr+@sep, idx+1) idx
,SUBSTRING(@arr, idx+1, CHARINDEX(@sep, @arr+@sep, (idx+1))-(idx+1)) element
FROM splitterCte
WHERE CHARINDEX(@sep, @arr+@sep, idx+1)>0
)
SELECT pos, element
FROM splitterCte OPTION (MAXRECURSION 32767);
-----------------------------------------------
--Steve--
[Reply: Thanks for the input, Steve. BTW, my friend and fellow MVP Erland Sommarskog did very thorough benchmarks of the various methods and you can find his results here: http://www.sommarskog.se/arrays-in-sql.html. You can see the numbers for the CTE option vs. the numbers table among others.
Cheers,
Itzik]

When I insert result of fn_split function in to the temp variable and then join that temp variable with my other two tables I get nested loop join which is optimal for my query.
Ognjen
[Reply: Makes sense. :)
Cheers,
Itzik]

From the Blogs

Many organizations today cannot use public cloud solutions because of security concerns, administrative challenges and functional limitations. However, they still need a centralized platform where end users can conduct self-service analytics in an IT-enabled environment....More

It is crucial to move away from data and analytics stored on individual desktop computers. Today’s solutions must promote holistic, collective intelligence. The strong, continued alliance between Microsoft and Pyramid Analytics helps make all this possible....More

To become a truly data-driven enterprise, many business leaders recognize that they must extend the capabilities of self-service business intelligence (BI) and analytics to more of their business users. Many BI tools tackle part of this need, but they don’t offer a complete enterprise solution....More