In the public forums for SQL Server, you often see people asking How do I
use arrays in SQL Server? Or Why does SELECT * FROM tbl WHERE
col IN (@list) not work? This text describes several methods to do
this, both good and bad ones. I also present data from performance tests of
the various methods.

This is a quite a long article, and if all you want to know is how to handle
a comma-separated list, then just click one these three quick links depending
on which version of SQL Server you are using: SQL2000,
SQL7,
SQL6.5. If that answer does not meet your needs, come back
and read from the beginning. :-)

If you feel deterred by the sheer length of this article, you should be
relieved to know that this is the kind of article where you may come and go
as you please. If you are a plain SQL programmer who want to know "how do
I?", you may find the section on performance data be too much of nitty-gritty
details. On the other hand, true SQL buffs who are curious about the
performance numbers, might find the explanations of the methods a little
tedious and may prefer to skim these parts.

You have a number of key values, identifying a couple of rows in a table, and
you want to retrieve these rows. If you are composing the SELECT statement in
the client code, you might have some code that looks like this:

This fails, because we are no longer composing an SQL statement dynamically,
and
@ids
is just one value in the IN clause. An IN clause could also read:

... WHERE col IN (@a, @b, @c)

Others that run into the problem say to themselves: If T-SQL has arrays
like any other normal language, then this is not a problem. Well, T-SQL
does have arrays. They are called tables, and for all matters they are much
more general in nature than arrays. If all you want is plain integer-indexed
array with a single value in each cell, this is easily emulated in a table.
However, you cannot pass a table as a parameter to a stored procedure. When
one stored procedure calls another stored procedure this is not much of a
problem, because they can agree on a common table to use for their
interchange. This is not possible when you call a stored procedure from a
client. But: a client can pass a string, and in SQL you can unpack that
string into a table, and most of the methods I describe in this article use
some technique to do this.

There are several possible solutions to this problem that can be implemented
more or less generically. I present the methods I know of. Here is a summary
of the methods that we shall look at, divided into two groups:

The good ones:

The iterative method. Looping through a
comma-separated list, and returning the elements in a table, either with a
used-defined function (UDF) or a stored procedure. Decent performance, and
easy to understand and easily extensible.

Using a table of numbers to unpack a comma-separated
list, either in a UDF or an SP. The fastest of all methods for a list with
delimiters.

Fixed-length array. Rather than using a
comma-separated list, use a string where all elements have the same length.
You unpack the string with a table of numbers in a UDF or SP. This is the
fastest method I know of.

XML. Overkill for a comma-separated list, but
marvellous for an array of structured data. Only possible on SQL 2000.
Decent performance, but slower than the iterative method.

The ones to stay away from:

Dynamic SQL. For a list of numbers, it may
appear simpler than any other method, but there are permissions issues to
consider. And messier to use for a list of strings. Performance is
unacceptable for long input. It is probably the only viable method on SQL 6.5 though.

Making the List Into a SELECT. Transform a delimited
list into an INSERT statement. Or many INSERT statements. Cute, but cannot
handle input longer than ~3-5000 chars, and has no other particular
benefit.

Really slow methods. Methods that uses
charindex, patindex or LIKE. These solutions are just
unbelievably slow even for short input.

If you want a shorter review of the various methods, SQL Server MVP Anith Sen
has a good summary on his web site at
http://www.projectdmx.com/tsql/sqlarrays.aspx. His assessment of the
methods does not always agree with mine, though.

Before I proceed to the various methods, I like to cover two general issues
that both are related to performance.

Note: the
corresponding section of the article for SQL 2005 is longer has more
information of which most is applicable to SQL 2000 as well, so you may be
interested in reading that part, even if you are on SQL 2000.

Most of these methods can be implemented as a table-valued functions or
stored procedures. This is very nice, because it permits you to pick a
solution and write one or two functions to put in your database, and then you
can use it all over the place. Here is an example of how you would use such a
function:

If you are on SQL7 you would have to use a stored procedure instead. Please see
the
SQL7 section for details on the technique.

Now, how should the interface of a function like charlist_to_table be
like? The input parameter should be a long text type, preferably text
or ntext, so that the function can handle input of any size. But some
methods use T-SQL functions that do not support text/ntext, in
which case a varchar(8000) or nvarchar(4000)
will have to do. You may also prefer to add a parameter that permits you to
define another delimiter than comma or similar type of usability parameters.
You will occasionally see such parameters in the examples in this text.

How should the table returned from the function look like? Obviously, it
should include a column which holds the value of the list elements, and we
will return to this in a minute. Sometimes you may also want a column which
indicates the position in the list. Such a column is easily achieved with
some of the methods I present, but are very difficult to achieve with others.

OK, list-element value, but which data type? You could write a function for
every possible data type in SQL Server. In this document I only discuss lists
of integers and strings, because it is my assumption that this covers 99% of
the cases.

If you have a list of integers, you will have to convert from string to
integer somewhere, because the input is text. It may be a matter of taste
whether you want to write a function that returns a table of integers, so you
can join directly with the table-function column, or whether you prefer to
write:

If you return a list of strings, you may think there is no conversion issue,
but watch out! SQL Server has two families of string types: The 8-bit
char/varchar/text data types and the 16-bit nchar/nvarchar/ntext
data types for Unicode. Which of these should your function return? You may
think that returning an nvarchar(4000) column is the best, because
then you will get the right result with both 8-bit and 16-bit data, but not
so fast! The answer is that if you are to join with a char/varchar
column, your function must return varchar and if you join with an
nchar/nvarchar
column you must return nvarchar. The reason for this is that if you
say:

tbl.varcharcol will be implicitly converted to nvarchar and this prevents SQL Server from using any index on tbl.varcharcol.
This can lead to a significant degrade in performance.

There are two ways to
address this. One is to write two functions: one that returns varchar(8000)
and one that returns nvarchar(4000). The alternative is to write one
function that accepts ntext as input and returns two columns, one
varchar and one nvarchar
column. For some methods, there is some minor performance cost (5-10%) for
the latter arrangement, whereas other methods do not seem to take a toll at
all. You will see examples of both strategies in this text.

Even if you have the data types correct, there is another potential problem
with joining directly to a table-valued function (or in case of
XML to OPENXML). The optimizer has no information of
what is in that table beforehand, and will have to build the query plan from
general assumptions. In many situations this will work alright, but sometimes
this will work better:

The reason for this is that a temp table has statistics, so SQL Server may
recompile the procedure prior to executing the SELECT statement and use the
statistics about the temp table to build a better query plan for the SELECT
statement. On the other hand, there is a cost for the recompilation, so you
may as well lose performance with using a temp table. As a general rule, only
consider a temp table, if you find that joining directly to the function
gives you bad performance. (One reader suggested using a table variable
instead. Since a table variable does not have statistics, it should not make
much difference as to join directly to a table-valued function.)

If all you want is a solution and then continue with your coding, this is my
suggestion for your pick. There are faster methods than this one, but you
need to have very long input to find performance for this method to be
intolerable. Where the iterative method wins is simplicity. You just load the
functions and then you are on the air. The faster methods require a help
table with correct data to work. While not a major issue, it's one more
component to keep track of. Also, the code for the iterative method is very
easy to understand, not the least if you have a background with C or Visual
Basic. This is not the least important if you have an input list with a
special format that you need to adapt the code for.

Yeah, that's right, that's not a comma-separated list. I figured that for a
list of integers, the comma does not really have any function, so I opted to
use space as separator instead when I wrote this function. Here is the code
for iter_intlist_to_table:

The table returned from this function has two columns. number is list
element, and
listpos the position of the number in the list.

Since this function
only deals with numbers, it may surprise you that I am using the Unicode
string types, ntext and
nvarchar. But I found that I got some 10% better performance with them
than the 8-bit data types. Exactly why I don't know, but my guess is that
charindex is internally implemented un Unicode, so the 8-bit data types
require conversion.

The iteration is a two-level loop. This is because the
input parameter is of the type
ntext to permit unlimited input. Not all string functions functions in
T-SQL accept text/ntext as parameters (and some that do, do not
operate beyond the limit of varchar/nvarchar). Therefore, I
chop up the input string in chunks of 4000 characters, before I move on to
the main loop where I use charindex to search for space which is my
delimiter. If you prefer comma, you can easily change that. When the inner
loop is completed, there may be some leftover piece that should be joined to
the start of the next chunk. And finally, when the last chunk is done, I pick
up the last element in the list.

You might notice that I use both
datalength and len. These are two functions that are deceivingly
similar, but you need to apply them with care. Both return the length of a
string, but datalength includes trailing spaces whereas len
does not. More importantly, datalength is the only to support the
text/ntext data types. datalength counts bytes, and len
counts characters, which is why I divide the return value from
datalength by 2, but not the result from len.

The function is naïve enough to assume that @list only includes
numbers. If this is not the case, you will get a conversion error, and the
batch will be aborted. If you create the list from internal identifiers in
client code, you may feel confident that the list is valid, and this is not
much of an issue for you. But what if you actually want to apply
error-checking? For instance, what if you are a DBA, and you don't trust your
client programmers? You can add code to verify that @str is a legal
number, but if the test fails, you cannot use RAISERROR since you are in a
function. The best you can do is to skip the illegal list item, or insert
NULL (in which case you must change the return table to permit NULL). If you
would like raise an error, then you need to implement the method in a stored
procedure instead. You find an example in the
SQL7 section.

This function is very similar to
iter_intlist_to_table. There is one difference in the interface: you can
specify which delimiter to use. There is a default value for the delimiter, but
even if you are satisfied with it, you must specify DEFAULT when you call the
function, because T-SQL does not permit you to leave to a function argument out
completely. No matter which delimiter you use, leading and trailing blanks are
stripped from the resulting strings.

The return table has two columns beside listpos, namely str and
nstr which both contain the list elements, one is varchar and the
other is nvarchar, and you should use the column that matches the column
you are joining with. Since Northwind..Customers.CustomerID
is nchar(10), nstr is the choice in the example. Do not use
nstr when you join with a varchar column, because this can give very
bad performance! (I discuss this in the section
General Interface Considerations above.) You
may think that this extra column comes with a performance cost, but I was not
able to detect this when I ran my performance tests.

One thing which is nice with this solution, is that it is fairly extensible. Say
that your input string looks like something like this:

"Yes, I want to", "Charlie", "Give it to me"

That is the list items are enclosed by quotes, and the list delimiter can
appear in the strings. The function above does not handle this format, but
writing a function that handles this sort of input is no big deal.

The fastest way to unpack a comma-separated list is to use a table numbers. A
table of numbers is simply a table with one single integer column, with the
values 1, 2, 3, ... Here is an easy way to fill in such a table:

While the iterative solution was somewhat long-winding but straightforward,
this approach is compact, but not all readers may grasp this SQL in the first
go. (I had a hard time myself.) The first thing to observe is the expression
',' + @param + ',' which reoccurs no less than four times.
By adding the delimiter in beginning and at the end of the string, the first
and last items in the string appear in the same context as all other items.

Next, let's turn to the WHERE clause. The expression:

substring(',' + @param + ',', Number, 1) = ','

evaluates to TRUE for all positions in the string where the delimiter
appears. The expression:

Number <= len(',' + @param + ',') - 1

simply sets an upper limit of which numbers we are using.

Let's now look at the SELECTed column Value. We extract the strings
with substring, starting on the position after the delimiter. We find
the length of the substring by searching for the next delimiter with
charindex, taking use of its third parameter which tells charindex
where to start searching. Once the next delimiter is found, we subtract the
position for the current delimiter and then one more, as the delimiter itself
should not be included in the extracted string, to get the length of the list
item.

Note here that, since this function does not strip leading and trailing
blanks, there must be no spaces in the list.

This function is an inline function, in difference from the iterative
functions which are multi-step functions. An inline function is basically a
macro, so that when the optimizer builds the query plan, it inserts the text
of the function into the query, making further optimizations possible. I find
it difficult to believe though, that for this particular function that you
will get much benefit from the function being inline. The function is too
convoluted for that, and I have not been able to detect any considerable
advantages in my performance tests.

The function inline_split_me limits the input to 7998 characters (or
3998 for nvarchar). It is not possible to use text or ntext
for the input parameter, because you cannot use text/ntext parameters
in expressions such as ',' + @param + ','.
Even more importantly, charindex does not handle text/ntext
data beyond the limit of varchar/nvarchar.

However, this is no fatal obstacle. As with the
iterative functions we can break up the text in chunks. Here is a
function that does this:

We first split up the text in slices that we put in the table variable
@slices. When we do this, we need to apply some care, so that the last
character in a slice is a delimiter. To find the last delimiter in a chunk
that we chopped off from the input parameter, we feed charindex the
result of the reverse function, neat little trick. While we're at it,
we make sure that all rows in @slices start and end with a delimiter.
You may note that if the input text is within the limits of a regular
nvarchar, we will never enter the loop, but just insert the text directly
in to the @slices table.

Once @slices is populated, we apply the core expression. Note here
that we do not need to iterate over @slices; we can join directly with
Numbers. We use a derived table, to save us from repeating the complex
expression with substring and charindex, and which we now have
augmented with trim functions to remove leading and trailing spaces. (A
derived table is a table constructed by a SELECT expression in the middle of
a query, this is a very very useful tool when building complex queries. The
alias AS x may not seem to serve a purpose, but the SQL syntax
mandates that a derived table must have an alias.)

As with iter_charlist_to_table,
this function returns a table with both a varchar and an nvarchar
column. In this case, though, my tests indicate that there is a cost of 3-5%
over returning a table with only a nvarchar column. Since we are
talking execution times in milliseconds, this cost is not likely to have any
importance. The performance data that I present, is from a function that only
returns an nvarchar column, though.

There is no listpos column here. This is somewhat more difficult to
achieve with this method. One way is to add a sliceno column to
@slices and an IDENTITY column to the result table, and then order by
sliceno and Number when selecting into the result table. However,
it is not fully clear whether you can rely on that the IDENTITY values are
actually assigned according to the ORDER BY clause. A more reliable way is to
join
Numbers with itself, as shown by an example on Anith Sen's
web site.

You might now be waiting for a function that unpacks a list of numbers, but I
am not presenting such a creature. You could write a new function which uses
convert in the right places, but you can also do like this:

In this text I focus on the case that the list you like to process is a
single list that comes from a client. Occasionally you might stumble on a
table contains one or more columns with comma-separated lists. For instance:

Here sizes and colours contain comma-separated lists of the
sizes and colours the jackets are available in. This database design violates
the first normal form, and let me stress that, in the very most cases, this
is a very poor design. In any case, say that you need to unpack these lists
into tables. You can of course use any of the functions that we have looked
at here, but you would need to iterate over the table, and unpack row by row,
because you cannot pass a table column as a parameter to a table-valued
function. As you may know, iterating over a table can be magnitudes slower
than applying a set-based operation on all rows in one statement, so
therefore it may be a better idea to join directly with the Numbers
table. I'm not giving an example here, but you saw the way to do it, in the
code for duo_text_split_me, when we joined with the
@slices table.

This is a method that was proposed by SQL Server MVP Steve Kass, inspired by
an idea in Ken Henderson's book The Guru's Guide to Transact-SQL.

This basic idea is that rather than defining the array as a delimited string,
the list elements have fixed length. An advantage from a usability point of
view is that you don't have to worry about a delimiter appearing in the data.
But the main reason for this approach is performance. This is the fastest of
all methods presented here (but see the caveats below). Here is a
demonstration of the technique, applied directly on a table:

Each element in the "array" has the same length, as specified by the
parameter
@itemlen. We use the substring function to extract each individual
element. The table Numbers that appears here, is the same table that
we created in the beginning of the section Using a table
of Numbers.

The purpose of the expression on the last line is to permit the last element
in the array to be shorter than the others, in case trailing blanks have been
stripped. Here is an example using fixstring_single with a list of
strings:

This function uses text for the input parameter, and returns a
varchar column, so for Unicode data you would need a twin function that
accepts ntext for input and nvarchar for return data. For other
methods in this article, I've opted to return varchar and nvarchar
from the same function, but this has a bigger overhead for this method
(because it's so fast). Beware that you don't use a Unicode function to join
with a char/varchar
column; this can wreck performance completely, see
General Interface Considerations.

The result set also includes the listpos column, which gives the
position of the item in the array. Very conveniently, this is the same as the
number from the
Numbers table.

fixstring_single handles input with as many elements as there are
numbers in Numbers. The number I have used in this article, 8000, is
likely to be sufficient for most applications. If you need to support more
elements, you can just add more numbers to Numbers. If you really feel
that you want to support unlimited input, you could write a multi-step
function that chops up the input string in slices. Here's another twist, that
Steve Kass came up with. This is a function that uses a self-join, so that
with 8000 rows in
Numbersfixstring_multi handles up to 64 million
list items:

This is a more complex function that fixstring_single. I leave it as
an exercise to the reader to understand what's going on, and I only make a
note about the line with CROSS JOIN: this saves me from hard-coding the
number of rows in Numbers.

Since what is special with this method is the input format, not the algorithm
itself, you could also use this input format with the
iterative method. This might appeal to you, if the table of numbers makes
you nervous. (If the table does not have the numbers as advertised, methods
relying on it will yield incorrect result.) However, the performance benefit
of using the iterative method with fixed-length input over delimited input is
not extraordinary, and it is slower than using a table with numbers and
delimited input. I'm not including an example of an iterative function using
fixed input format, but in the full test data,
you can see how well it fares.

If you look in the performance section
below, you will see that this method outsmarts everything else. But there are
a few caveats you should keep in mind, of which the last one is the most
important.

First caveat: The fixed format is more sensitive to disruptions. One
extra character somewhere, and the the rest of the string is completely
wrecked. When you compose a string in an application, this is not so much of
an issue, once you have gotten it right. This leads, however, to the

Second caveat: The cost for composing the fixed-length may take more
programming power in the client than composing a comma-separated list. I have
not considered this for my performance tests, as it would have made those
tests far more complex. However, consider these two points: 1) Client
languages are generally better in string handling than SQL. 2) Moving
execution from the server to the client means that the overall system scales
better. So even if there is a certain toll on the client side, it is not
likely to overshadow the gain on the server side.

Third caveat: The strings you pass from client to server when using
fixed length are longer, so there is more network overhead. This could be an
issue over slow links or on a crowded network. But it is not very likely that
this is something to worry about. However, the observation on size leads to
the

Fourth and most important caveat: It matters how you invoke a
procedure using this method. My test script for the performance tests uses
RPC (remote procedure call), and this is the recommended way to call a stored
procedure from a client program. (For instance, if you use
adCmdStoredProcedure for the CommandType on the Command object in ADO, you
use RPC.) The other way to call a stored procedure is to send a command-text
batch with an EXEC statement, which is what you do in Query Analyzer. I found
that for a certain length of the input string, this method is actually
somewhat slower than using a comma-separated list with the
table-of-numbers function
duo_text_split_me. In fact, I had an alternate version of
fixstring_multi
which was somewhat faster than the one above as long as I invoked it through
RPC. But when I invoked it in a command batch, the response time was several
seconds, far beyond what is acceptable. Why this is so, I don't know. But the
string size surely has something to do with it. I have not examined at which
size the effects sets in, but it might be around 8000, that is the limit for
a regular varchar.

XML has emerged as a standard for data interchange in recent years. Just like
HTML, XML is derived from the general SGML standard, so the visual appearance
of XML and HTML is similar. But there are important differences. Unlike HTML,
XML is case-sensitive and a single error in an XML document invalidates the
document. The purpose of XML is not presentation, but transport of structured
data. Microsoft added extensive support for XML in SQL Server 2000, and what
interests us here is the OPENXML function which extracts data from an XML
string and returns the data as a table.

This opens for the possibility to encode a list of values as an XML string,
and then extract the values with OPENXML in T-SQL. But to be blunt: this is
not the right way to do it. Not because of performance: while slower than any
other method we have looked at so far, performance is still perfectly
acceptable for most situations. No, what disqualifies XML is simply that it
is too complex for the task. Unless, of course, you already have an XML
document with the values you want to use. But to build an XML string only to
mimic an array is overkill in my opinion.

But: if you need to send SQL Server an array of structured data – which even
might be nested – to insert data into one or more tables, XML is truly a
great asset.

Let's nevertheless first look at the case of a simple list. This is how the
XML version of get_product_names looks like:

The important thing to notice here is that we first call
sp_xml_prepraredocument which causes SQL Server to build an internal
representation of the XML string. @idoc is a reference to that
representation, which we use when we invoke OPENXML, which is kind of a table
function, although the exact term used by Books Online is rowset provider.
Before we exit the routine, we call sp_xml_removedocument to free up
the memory allocated for the XML document. Failing to do this can lead to a
memory leaks which eventually can prevent sp_xml_preparedocument from
completing successfully. (By the way, while it is good habit to check return
values from stored procedures, don't do this for sp_xml_removedocument.
It always returns 1, which is a known bug. Check @@error instead.)

When you
compose an XML string, you need to be careful, because there are several
characters such as <, & and " serves as meta characters in XML, so you need
to encode them in the same manner as in HTML. The same applies to non-ASCII
data if you are not using Unicode. Also, you need to keep in mind that XML is
very strict on that all tags must be closed. Thus, you are probably better
off using some library routine to compose the XML string. This is not really
my area, but MS XML seems to offer some methods for this, as does the
XmlTextWriter class in the .Net Framework.

Where the power of XML comes into full play is when you have a bigger set of
data that you want to insert into SQL Server, for instance orders and their
order details. Before the advent of XML in SQL Server, you did not have much
better choice than to call a stored procedure for each row to insert. (There
has always been bulk-copy, but this is far from always a convenient solution.)
With XML, you can compose an XML document of all that data, and you have a
stored procedure which calls sp_xml_prepredocument once, and then you
invoke OPENXML twice: first for the orders, and next for the order details.
Replacing thousands of calls to stored procedures over the network with a
single one, this can give a huge performance benefit.

I'm not including an
example of this, as it would take up some space, and this article is already
far too long. Rather I leave this as a teaser. If you are completely
unacquainted with XML and using with XML with SQL Server, you might consider
picking up a book or two on the subject. You might also find some useful
information on SQL Server MVP Bryant Likes's site
http://www.sqlxml.org.

Note: those who know XML, know that the above is attribute-centred
XML. An alternative way to compose the XML string is to use element-centred
XML (or a mix of the two). I'm not including any example of element-centred
XML, as it there is little practical difference for unpacking a list, neither
is there in terms of performance.

Occasionally I have seen the suggestion that you should have a stored
procedure that accepts a delimited string as parameter, and which uses the
replace function in T-SQL to convert this string to XML before passing it
to OPENXML. In my opinion, this is a bad idea for several reasons:

Since the output of replace must be varchar, the resulting
XML string cannot be longer than 8000 characters and thus the input string
is even more restricted in length.

You need to run several passes with replace to handle all
characters that must be escaped (see above). The example I saw, completely
ignored to do this.

And since XML is slower than other methods that are easier to use, it is
completely pointless anyway.

This example looks very similar to the client-code example in the
beginning of this article. And in fact, this method is
just a variant of sending down SQL statements from the client and it has the
same issues, which we shall look at in a moment. First, though, let's look at
using this method with a list of strings, to see that in this case the method
is not equally attractive:

The nested quotes make procedures using this method difficult to call.

So what are the issues with this method?

Performance. This method is slower any of the other methods we have
looked at this far, and execution times grows very fast when the input
string grows in size. With a string of 1000 items, the other methods still
operates in sub-second, but dynamic SQL may take over ten seconds. It is
not the execution itself that takes up all that time; it's building the
query plan from that long IN expression that occupies SQL Server. The plan
is saved in cache, so if you use exactly the same list a second time,
execution time can be shorter than for any most other
methods. But it has to be a exactly the same list: one single character
that is different and SQL Server will spend another ten seconds to build a
plan.

Permissions. This method requires that the user has
direct SELECT permission to the involved tables. With a regular stored
procedure, the user only needs EXEC
permission on the procedure, and then
the rights on the tables of the procedure owner is transferred to the user.

SQL Injection. This is a method by which an intruder can enter data
into an input field, and if that input field is passed without further
control to build an SQL statement dynamically, the intruder can cause the
SQL batch to include something else than intended.

Often when people ask how to handle comma-separated lists in the SQL
newsgroups, dynamic SQL is presented as a solution. Therefore, I like to
stress that this is a poor solution, not the least when there are nice and
simple generic methods for converting a list to a table. So if you see this
advice on the newsgroups, ignore it. And if you have the habit of giving it,
please stop doing it.

There is one exception: on SQL Server 6.5 the other
methods can only support lists that fit into varchar(255), so dynamic
SQL is probably the only viable solution there.

This method builds on idea suggested by Jim Ebbers. Since the code is
somewhat obscure with its double use of dynamic SQL, before I explain it, I
directly proceed to a method suggested by SQL Server MVP Steve Kass, which is
easier to understand, albeit not as good:

The idea is that you transform the list to a SELECT statement with help of
the UNION ALL operator. (Only UNION removes duplicates, UNION ALL includes
them.) Then you use dynamic SQL to execute that statement and insert the data
into a temp table, of which the name is passed to the procedure. Since the
the dynamic SQL only refers to a temp table, there is no permissions issue
here.

In the first method, I have made use of what Jim made me aware of: If you say
INSERT tbl EXEC('SELECT 1 SELECT 2 SELECT 3'), INSERT will
handle all those result sets from EXEC(), as if it was one and only. Notice
that I also cater for the fact that there may be single quotes in the input.
This is in fact necessary with unpack_with_union as well, but if you
do it once in unpack_with_union, you should double that up in
unpack_without_union. Why, I leave as an exercise to the reader to
understand.

These method can possibly get a reward for being cute, but
that's about it. Both methods, as presented here, have a fairly low limit on
the maximum number of elements, since all the text you splice in takes up
space. Jim's method is better than Steve's since he splices in less text. But
for a list with 500 elements, that's still 3500 extra characters caused by
those SELECT. You could write a version that takes a ntext parameter
and breaks it up in slices. I have not pursued this, since my performance
tests indicate that these method are slower than OPENXML,
which in its turn is slower than the iterative method,
and using a
table with numbers. Neither is there any other particular
advantage with these methods.

When I originally conducted the
performance tests for this article, I was
not aware of Jim's method, and since then I have changed hardware, why it is
not possible to paste in the numbers for Jim's method (code-name NOUNION)
with the old data. However, re-testing only OPENXML, UNION and NOUNION, I
find that NOUNION is twice the speed of UNION, but still lags behind OPENXML.

SQL Server MVP Anith Sen proposed a similar method, where he simple converts
the list to a bunch of INSERTstatements:

You may recognize the theme from when we used a table
of numbers. By adding commas of both sides of the input string, we can
use charindex to find ",ALFKI," etc. (Note that you cannot have
embedded blanks here.)

The author noted in his column that this method
would not have good performance, since embedding the table column in an
expression precludes use of any index on that column, leading to a table
scan. But that's only a small part of the story. A plain table scan on my
test table takes less than 100 ms, once the table is entirely in cache.
This method needs 42 seconds, even for my shortest test list of 200
characters and 15 items!

The solution with patindex also needs another 42 seconds. The solution
with LIKE was actually four times faster on one of my test machines (but
exactly as slow as charindex and patindex on the other
machines.) Unfortunately, it is still 100 times slower than dynamic SQL, and
200 times slower than the best methods.

If you are using SQL Server 7, you don't have access to user-defined
functions and not to XML. Still you can implement the
iterative method or use a table of numbers, with
comma-separated or fixed-length input, in stored
procedures instead. To demonstrate this, I'm including examples for the
iterative method here. You could easily apply the same technique for the
other two methods.

The function for unpacking a list of strings had a dual return table, with
both a varchar and an nvarchar column. There is not really any
reason for this when you use a procedure. The caller specifies the data type
that fits with the kind of data he is working with. In the same manner, the
function does not populate any listpos column; if you want to keep
track of list position, the caller can include an IDENTITY column in the
table.

In an earlier
incarnation of this article, I suggested that you could have the table name
to be a parameter, and the use dynamic SQL to insert into that table.
However, this was poor advice. When I ran my performance tests, I found that
the performance penalty for the use of dynamic SQL for each INSERT statement
is too big to be acceptable. I discuss this further in the section
special observations in the performance tests.

There is however a potential performance problem with the solution above as
well. It depends on how you use it, but normally it will be a new temp table each
time. In this case, charlist_to_table_sp will be recompiled on each
invocation. In many situations, this would not be an issue. In fact, I found
in my
performance tests that this procedure was somewhat
faster the corresponding function to unpack a string into table, despite the
recompilation. However, on a busy system with an intensive call frequency you
could experience blocking from compile locks, see
KB article 263889.

One way to avoid this would be to use a permanent table, looking something
like this:

Yet another alternative would be to have the temp table local to the
procedure, and then produce a result set, which the caller can catch with
INSERT EXEC. For more details about this technique, and its weaknesses, see
my article How to share data between
stored procedures.

The technique in the previous section can of course be applied to a list of
integers as well, so what comes here is not a true port of the
iter_intlist_to_table
function, but a version that goes head over heels to validate that the list
items are valid numbers to avoid a conversion error. And to be extra
ambitious, the procedure permits for signed numbers such as +98 or -83. If a
list item is not a legal number, the procedure produces a warning. The
procedure fills in a temp table that has a listpos column; this column
will show a gap if there is an illegal item in the input.

The validation of the list item is in the sub-procedure insert_str_to_number.
For many purposes it would be sufficient to have the test:

@str NOT LIKE '%[^0-9]%' AND len(@str) BETWEEN 1 AND 9

which checks that @str only contain digits and is at most nine digits
long (that is, you disapprove ten-digit numbers as well as signed numbers).

You might guess that there is a performance cost for this extravaganza, and
indeed the procedure needs about 50% more time
than the corresponding function. Still, for many situations, the execution
time is acceptable.

One note about the warning produced with RAISERROR: with ADO, this warning
may be difficult or impossible to detect on client level. If you change the
severity from 10 to 11, it will be an error, and raise an error in your
client code.

If you are on SQL 6.5, you are far more limited in what you can do. The
varchar
data type is limited to 255 characters, and the substring function
does not work on text, so you cannot split up in chunks or extract
fixed-length elements. You can still implement stored procedures as for SQL 7, but since the input string would be limited to 255 characters, the
risk that you every now and then would pass a list exceeding this limit is
too considerable to be neglected.

Therefore, the only method that is robust enough on 6.5 is
dynamic SQL. You can still use text to pass a long list as in this
example:

Testing performance in a DBMS is not a simple task. There are so many factors
that affect performance. In my tests I have tested unpacking the list alone
into a table (for the methods where this is possible) and using the list to
get data from a table. Merely unpacking tells us something about the method
as such, but the great impact comes when you try to use the method to get
data from one or more tables, in which case the nature of the table(s)
involved (index, statistics, size etc) can have great importance. You may
find that some method that I claim to be fast, results in slow query plan for
your query. Another issue is how submit your SQL statements to the server.
For the fixed-string functions, I found that there was a considerable
difference between RPC calls and command-text batches. (All data I present
are from calls submitted through RPC.) Yet, another issue what happens when
the system comes under load. For the sake of simplicity, I have run the tests
on idle machines.

So take these numbers for what they are. They certainly serve to point out
some methods as hopelessly bad. But you will find that several of them have
response times that are perfectly reasonable when run on their own. If you
are worried about performance in a certain context, you should probably run
your own benchmarks.

As I have assembled data for this article, I have tested too many methods to
include data for all of them in the main body of this text. Some of the
methods have just been small variations of each other like returning
varchar
instead of nvarchar, and I have not even bothered to save data for
every little variation. Eventually I have data for 16 test cases. Of these, I
have selected the nine to include in the main section
of the test results in the article. Yet a few more appear in the section for
special observations where I highlight some details. The full data for
all 16 methods on the three test machines I have used are available as text
files, for which there are links
further down in this text.

I have given each method that I've tested a name, and I will use these names
as a convenient short-hand notation. All methods appear in the text above;
most of them in code examples, but some are only mentioned in passing.

The table contains 202985 rows, and the average length of the column word
is 9.7 characters. The longest word is 31 characters. The values in the
column wordno are in fact unique; the reason that the index is not
unique is simply a mistake that I didn't notice until I had ran all tests.
The order of wordno is not correlated with the order of word.
The column guid serves to make the table a little wider, and to be a
token non-indexed column.

For each test, my test script randomly selected rows from usrdictwords,
to construct one comma-separated list each of strings and integers from the
columns word and wordno
respectively. After each list item the script randomly added 0-3 spaces
between the list elements. (The reason for this is that the test script also
served to validate that the implementations of the methods yielded the
correct result.) The script kept adding to the lists until the list of
strings exceeded a pre-determined size. Thus, the integer lists were shorter
in length, but the number of list elements was the same.

I used five different list sizes:

Size

Length

Avg. no of list elements

Small

237

19

Medium

2456

201

Large

7950

653

X-large

23997

1978

XX-large

120000

9892

I ran all methods for the sizes Small and Medium. For Large and upwards, I
excluded UNION, INSERT,
REALSLOW and SLOW$LIKE, the
former two because they cannot handle input of this size, the latter two
because they are too slow for my patience. From X-Large and on, I excluded TBLNUM$IL (size constraint). For XX-Large I did not include FIX$SINGLE (size constraint) and
ITER$EXEC, EXEC$A
and EXEC$B (too slow).

Note that these list lengths apply to the list of string originally
constructed by the test script (see above). For many methods, the input was
adapted to the format required by the method, which could make the string
both shorter (embedded blanks stripped) and longer (XML or fixed-length
element). The actual list elements were still the same for all methods.

Unpacking the input list itself into a result set,
without involving another table. This test cannot easily be implemented for
all methods. You rarely have the need for this in real life, but the test
isolates the performance of the method as such.

JOIN

Using the input list to extract data from a table. With
most methods this means a join operation, and I denote this operation as JOIN
also for a method like dynamic SQL even if there
is no actual join taking place.

Thus, the procedure starts a timer, issues the query, inserting the result into
a table in the test database, stops the timer, and returns the data to the test
script which then checks that the method returned the expected data. I chose to
catch the data with SELECT INTO, as SELECT INTO is less logged than INSERT.
Returning the data while the timer is going would mean that network latency
could have had gross effect on the result. A side effect of the arrangement with
SELECT INTO is that the procedure is always recompiled when it hits the last
SELECT statement, but that is outside the timer. I also give an example of a
JOIN procedure to highlight one important detail with the string join:

Note here that when joining over the string column in the test table, I am
also retrieving the column guid which is not in any index. This forces
SQL Server to access the data pages of the table. (Otherwise the query would
be covered by the index on word.) This gives the optimizer the two
choices: using the non-clustered index on word with bookmark lookups
or scanning the table.

For most methods, there have been four procedures: UNPACK-Str, UNPACK-Int,
JOIN-Str and JOIN-Int.

The test script is written in Perl, connecting to SQL Server using DB-Library
(because this is the client library I know well). Using DB-Library has
imposed some limitations; test procedures cannot have ntext
parameters, nor
varchar longer than 255 characters. Therefore all procedures use text
for the input parameter. The test script issues the same set of SET commands
that OLE DB and ODBC do when they connect. It also issues SET NOCOUNT ON. The
test script is available in the accompanying zip file.

When running the tests, the test script creates list of strings and integers
as described above, and then passes the assembled
lists to all test procedures currently in the test database. The test script
uses a fixed seed for the random generator, so different runs of the script
generate the the same test data. This permitted to me test a method
separately from the rest, yet using exactly the same data as for the others.
(This was handy when I found an improvement for one method, and didn't want
to rerun all the others.) I ran the tests 100 times for each string size,
except for REALSLOW and SLOW$LIKE for which I only ran two test runs each, since these methods
are excessively slow.

This is my office workstation with a single Pentium III
996 MHz, 512 MB of RAM (though SQL Server is constraint to use around 120 MB)
with SCSI disks. SQL 2000 SP2.

KESÄMETSÄ

This is my home machine with a single Pentium 4, 2.266
GHz, 512 MB and IDE disks. SQL2000 SP3.

For all tests, I ran the test script from KESÄMETSÄ. To ABA0163 and JAMIE4K I
connected to our office over VPN. The machines were generally idle when I ran
the scripts.

In the main presentation of the results in this article, I only present the
results from JAMIE4K. I find JAMIE4K more representative of two reasons. 1)
This is the only test machine which is a real server, and therefore has more
resemblance with production machines in general. 2) As JAMIE4K has slower
CPUs, the timings are somewhat more accurate. (Time in SQL Server has a
resolution of 3.33 milliseconds, so timings below 50 ms are not very
reliable.)

Since the test table is small enough to be in cache throughout the tests,
it's mainly an issue of CPU power, and for most methods, ABA0163 is twice as
fast as JAMIE4K and KESÄMETSÄ is four times as fast. But there are some
exceptions from this rule, which may be due that JAMIE4K has been able to use
all four processors in those cases.

Here I present average execution times in milliseconds for the nine
main contenders on JAMIE4K. To see the full results
all sixteen methods for each machine with minimum, average and maximum
execution times, standard deviation and variation coefficient, here is one
link each for JAMIE4K,
ABA0163 and
KESÄMETSÄ.

There is one safe conclusion you can draw from this data: the methods
REALSLOW and SLOW$LIKE are not usable. (The reason SLOW$LIKE is four times
faster than REALSLOW for JOIN-Str is obviously due to that for some reason
SQL Server has not used all four processors on JAMIE4K for REALSLOW in this
case. Compare with the results on ABA0163
and
KESÄMETSÄ.)

For the other seven methods, results are very close to each other, although
EXEC$A sticks out a little. Thus, for this small input size, all these
methods are OK. I again like to stress that the resolution of the datetime
data type in SQL Server does not lend itself for accuracy on this level. On
the faster machines ABA0163 and KESÄMETSÄ all methods, save REALSLOW and
SLOW$LIKE, have at least one run which is timed for 0 ms.

The results for REALSLOW and SLOW$LIKE are now even more ridiculous. You see
now, why I didn't want to run them with string size Large.

The other seven methods are still showing results that most people would
consider perfectly acceptable. Nevertheless, we can see that EXEC$A and UNION
are starting to lag behind the other five. UNION has a size constraint, and
cannot handle input of the size Large, unless we chop the input in slices.
This is something that I have not pursued, and the reason is precisely the
numbers above. It does not seem likely that UNION would be able to give the
other methods serious competition.

One could note that on the other two machines, several of the methods still
has individual runs that are timed at 0 ms, and that the data size is still
too small for accurate conclusions.

The field is starting to spread out, and by now you can clearly rank the
methods in performance order. When it comes to just unpacking the string,
FIX$SINGLE is almost as fast with string size Small. But the method to pay
most attention to here, is dynamic SQL, EXEC$A. 2½ seconds to get the string
elements, when all other methods needs less than 350 ms, is simply not
acceptable.

At this size dynamic SQL really gets out of hand. And just how much it gets
out of hand, we will return to in a second when we look at
trend analysis.

Since this is the last size at which we see FIX$SINGLE (with 8000 rows in
Numbers, it does not handle size XX-Large), let's make a summary of
methods in terms of speed. You can very clearly see that using fixed-length
elements and unpacking them with a table of numbers is the fastest method.
You can also see that the simple
fixstring_single is distinctively faster than the more complicated
fixstring_multi. True, the latter permits virtually unlimited input.
Then again, you can add more numbers to Numbers
to push the limits of what fixstring_single may handle. 8000 elements
in a string array is a whole lot, already that.

Now, there are two issues that slightly reduce the superiority of the
fixed-length methods. The first has to do with how the method was invoked. In
my tests, I called all test procedures through RPC, which is the normal to do
from an application. I found that when I instead passed a command batch with
an EXEC statement, then over a certain length of the input parameter, the
fixed-length methods, and only they, took a toll. FIX$SINGLE was still the
fastest, but TBLNUM overtook FIX$MULTI for the #2 position.

The second issue arises when we look at the deltas between times for the
corresponding JOIN and UNPACK operations:

Exactly what these numbers say I don't really know. They could say that with
FIX$SINGLE, the time to get the data itself is longer, and thus with some
other table, some other method would be faster. (The reason for this would be
a different query plan.) Then again, since XML, the slowest method, has the
smallest difference, it may be that the times for the unpack the string and
getting the table data overlap, and this is an issue we should not pay any
attention to at all. If top-most performance is a key issue for you, you need
to investigate this further for your own case, as I pointed out in my
general disclaimer.

Finally, it is worth remembering that performance is not everything. If you
prefer to use a delimited list, you can still unpack with a table of numbers
for good performance. And if you don't want the extra table, use the
iterative method. Half a second to unpack a 24000
characters long list of 1980 elements is nothing to be alarmed of.

A reasonable assumption is that the execution time for a certain string
length follows the linear equation:

t(s) = y + k·s

Here s is the string size, y represents
some initial overhead, and k is some constant fairly close
to 1. The values of y
and k are of course different depending on method.

We can observe two things from this table: 1) it appears that the faster the
method the smaller is the slope (k). Thus, if input size
grows even more, the faster methods will keep their position. 2) Dynamic SQL
does not live up the assumption that k is close to 1, but
appears to have a slope somewhere near 2 (if the equation is linear at all).
Which means that if input size doubles, the execution time for dynamic SQL
quadruples. Did I say that this is a bad method?

Note: such a linear equation exists only as long as we don't run into some
resource limit. As I've noted the fixed-length method sees a slowdown over a
certain input size when they are invoked in a command-text batch. .

A version iter_charlist_to_table_sp that accepts
the name of a temp table as a parameter and that uses dynamic SQL to insert
the data in the table.

Let's first look at the data for an array of strings:

ITER

ITER$PROC

ITER$EXEC

String size

UNPACK

JOIN

Delta

UNPACK

JOIN

Delta

UNPACK

JOIN

Delta

Small

51

51

0

58

64

6

88

89

1

Medium

88

116

28

89

123

34

414

443

29

Large

191

269

78

182

284

102

1223

1321

98

X-Large

477

722

245

431

731

300

3570

3868

298

XX-Large

2211

3329

1118

1943

3019

1076

We see here that the procedure version actually is faster than the function
if we are only unpacking the string. This is somewhat surprising, not the
least since the procedure is recompiled each time. The two possible reasons I
can think is that a temp table in general is faster than a table variable, or
that a procedure in general is faster than a multi-step function. But when we
use the result to get data from the temp table, the procedure is no longer
the faster for sizes Large and X-Large. It seems that joining with the temp
table is more expensive than joining with the table variable. This could be
due to different query plans (recall that there is a
choice of table scan or index seek + bookmark lookup in this case). In any
case for XX-Large, the temp table is faster also for joining, and also have a
smaller delta.

Despite these numbers, personally I find the function interface much nicer to
work with, as I can slap it right into the query, which is why I have focused
on the functions in this text.

And ITER$EXEC... The idea was that you could pass a table name as a
parameter, and then use dynamic SQL to insert the list items into the table
like this:

The feature may look neat, but as you see in the table above, it is
expensive. Too expensive to be defensible, in my opinion. The reason it is
expensive, is that for each item in the list, SQL Server has to parse and
build a query plan for the INSERT statement. This is yet another example that
dynamic SQL in the wrong place gives you bad performance. (There are also
places where dynamic SQL is the right thing for best performance.) But at
least ITER$EXEC has a slope close to 1 in its equation, in difference to
EXEC$A.

Not surprisingly, we can conclude that the extravaganza has a price in
performance. Still, it is not as fatal as that little piece of dynamic SQL in
ITER$EXEC. And if you rewind to the main results, you
see that this procedure is still faster than XML.

What is more remarkable is the difference in delta for sizes Large and
X-Large. This can not be explained with the extravaganza. Again, possible
causes for the difference is the query plans. Since the
test table has a clustered index on its integer
column, one may think that there is no alternative strategies, but SQL Server
still have three types of joins to choose from. I have not investigated what
exactly is going on here, but rather I again refer to my
general disclaimer: you have to test with your case to get your exact
data.

I like to stress that the differences between procedure and function that we
have seen here relate to the iterative method, which performs more INSERT
statements than any other method. I would be cautious to extrapolate these
observations to the other methods.

There were two functions in the test for unpacking a delimited string with a
table of numbers in the test: the multi-step TBLNUM, and the inline
TBLNUM$IL, limited to 7998 characters in input. Here are the results for
these two methods for string size Large:

You might get the impression that the multi-step function is actually somewhat
faster than the inline function, but partially this is a trick of the eye. The
multi-step function uses nvarchar and the inline varchar (else it
would not be able to handle this size). When both use varchar the inline
function is maybe 5% faster. But since execution times at this size are very
moderate anyway, this difference is mainly of academic interest.

Again, I like to stress that the difference (or lack of) between inline vs.
multi-step relates to delimited input. I have not performed similar studies for
fixed-length input, for which the revelations might be different.

EXEC$B is not really a method. EXEC$B is when you execute the exactly the same
dynamic SQL statement for a second time. It's
behaviour in the tests also defies everything else. For once, I include the
results from the other two machines:

JAMIE4K

ABA0163

KESÄMETSÄ

String size

Str

Int

Str

Int

Str

Int

Small

51

68

4

6

4

6

Medium

73

54

17

96

167

109

Large

141

83

358

18

178

15

X-Large

330

147

386

48

193

30

If we first look at the numbers for JAMIE4K and compare with the other
methods, we see that only FIX$SINGLE is able
to match EXEC$B, and not always; for size LargeEXEC$B
wins for Str and is tied with FIX$SINGLE for Int. But since there is a very
large cost for EXEC$B, to wit EXEC$A, the initial query to get the plan into
the cache, EXEC$B is not a real contender. You will have to submit that same
SQL statement 20 times, to outperform XML for size
X-Large. Even more to outperform ITER or TBLNUM.

Now
look at the results for Int on ABA0163 and KESÄMETSÄ, and the row for Medium.
What on Earth is going on here? Recall that this is the average of 100 runs,
so it is not an occasional spurious result. Since there is a clustered index
on the Int column, it is difficult to think of another query plan. ...but
there is a non-clustered index that the optimizer could get the idea to scan?

If you feel like running these tests yourself, maybe testing your own method,
you can download the file arraylist.zip
that includes the test script, the test procedures, the functions for the
various methods and the table of numbers. The test data is in a separate
file,
usrdictwords.zip, of the simple reason that this is 6 MB in size. (The
arraylist.zip is a mere 30 KB.) Please
refer to the README.HTML in the
arraylist.zip for further instructions.

I have not invented all these methods myself. This article is a summarizes
the work of a lot of people in the SQL community. There are those who came up
with an original idea, those who have refined interfaces etc. I cannot
mention them all, because I simply don't know about all of them.

If you have suggestions for other methods, or twists of those listed, that
you think ought to be mentioned in this article, or if you have other
comments or questions, please drop me a line at
esquel@sommarskog.se.