Consider the situation where you have a list of row IDs and you need to retrieve the data for each of the rows. The simplest way is to make one query per row, i.e.

(A) SELECT * from data_table WHERE id=?

For a large number of rows, that results in a lot of queries. This could be condensed into one query, such as:

(B) SELECT * from data_table WHERE id=1 OR id=2 OR id=3 …

or

(C) SELECT * from data_table WHERE id IN (1,2,3,…)

When constructing potentially large SQL statements such as these (imagine if you wanted to retrieve 1,000,000 rows), it’s important to take into account the max_allowed_packet size which restricts the length of the query. It might be necessary to divide the data up into several blocks and make a query for each block to ensure max_allowed_packet is not exceeded.

Another approach is to create a temporary table, insert the keys of the required rows, then do a JOIN query to retrieve the data, i.e.

(D) CREATE TEMPORARY TABLE tmp ( id INT(11) );

INSERT INTO tmp (id) VALUES (1), (2), (3), …

SELECT d.* FROM data_table d JOIN tmp USING (id)

This approach is somewhat cleaner, particularly when multiple keys are involved. With multiple keys the WHERE syntax of the prior options becomes:

WHERE (key1=x1 AND key2=y1) OR (key1=x2 AND key2=y2) …

or

WHERE (key1, key2) IN ((x1, y1), (x2, y2), …)

Under the temporary table approach, the question then arises as to how to most efficiently insert the data. A ‘LOAD DATA INFILE’ approach is the most efficient way to load a table, but here we assume this is not an option as it is not readily portable (due to security settings that differ between local and remote MySQL daemons). The example (D) above assumes a long INSERT statement, which again may be affected by max_allowed_packet. Other options include:

(G) Multiple single INSERTs as an array, using the DBI execute_array() function

(H) As for (G), in a transaction block.

These options were benchmarked using MySQL 5.0.45 and the results are shown in the figure below. As would be expected, the use of single select statements scales linearly. For small query set sizes, the setup times for the different query approaches have significant impact on the performance; as the query set size increases, three classes emerge – one group that performs similarly to single selects, another that performs much much better, and one that lives on a completely different planet (one you wouldn’t want to visit). In summary:

That SELECT + IN(…) (case C) offers best performance when the query set size is above 30 or so. It is also interesting to note that the performance of SELECT + IN(…) is very similar to using a temporary table with a single, long INSERT statement for large query set sizes, presumably because internally the IN(…) operation is essentially implemented as a temporary table.

That SELECT + OR (case B) is a good choice for query set size < 30

That SELECT + OR hits a point where performance becomes exponentially worse (not shown on the graph, for the largest data set the performance reaches 1300s per query set! Curiously, this is elapsed time, but CPU time does not significantly increase. This suggest there are some inefficient data moves/swapping occurring).

In short, as a rule of thumb, use SELECT + OR for query sets < 30 in size, and SELECT + IN(…) otherwise.

The SELECT + OR performance is a significant result; the Perl SQL::Abstract library turns a WHERE specification such as { A => [ 1, 2, 3] } into WHERE ( ( ( A = ? ) OR ( A = ? ) OR ( A = ? ) ) ). It will do the same if there are 1000 options (try it – perl -MSQL::Abstract -e ‘$sql = SQL::Abstract->new; $w = $sql->where({ A => [ 1 .. 1000]}); print $w’). Thus libraries that use SQL::Abstract, such as DBIx::Class, are similarly affected. A perfectly reasonable approach from the library’s perspective, but potentially a significant performance hit if used in this manner.