Learning MySQL (2007)

Part II. Using MySQL

Chapter 7. Advanced Querying

Over the previous two chapters, you’ve completed an introduction to the basic features of querying and modifying databases with SQL. You should now be able to create, modify, and remove database structures, as well as work with data as you read, insert, delete, and update entries. Over the next three chapters, we’ll look at more advanced concepts. You can skim these chapters and return to read them thoroughly when you’re comfortable with using MySQL.

This chapter teaches you more about querying, giving you skills to answer complex information needs. You’ll learn how to:

§ Use nicknames, or aliases, in queries to save typing and allow a table to be used more than once in a query

§ Aggregate data into groups so you can discover sums, averages, and counts

§ Join tables in different ways

§ Use nested queries

§ Save query results in variables so they can be reused in other queries

§ Understand why MySQL supports several table types

Aliases

Aliases are nicknames. They give you a shorthand way of expressing a column, table, or function name, allowing you to:

§ Use special types of nested queries; these are the subject of Nested Queries,” discussed later in this chapter

Column Aliases

Column aliases are useful for improving the expression of your queries, reducing the number of characters you need to type, and making it easier to work with languages such as PHP. Consider a simple, not-very-useful example:

mysql> SELECT artist_name AS artists FROM artist;

+---------------------------+

| artists |

+---------------------------+

| New Order |

| Nick Cave & The Bad Seeds |

| Miles Davis |

| The Rolling Stones |

| The Stone Roses |

| Kylie Minogue |

+---------------------------+

6 rows in set (0.00 sec)

The column artist_name is aliased as artists. You can see that in the output, the usual column heading, artist_name, is replaced by the alias artists. The advantage is that the alias artists might be more meaningful to users. Other than that, it’s not very useful, but it does illustrate the idea: for a column, you add the keyword AS and then a string that represents what you’d like the column to be known as.

Now let’s see column aliases doing something useful. Here’s an example that uses a MySQL function and an ORDER BY clause:

The MySQL function CONCAT() concatenates together the strings that are parameters—in this case, the artist_name, a constant string recorded, and the album_name to give output such as New Order recorded Brotherhood. We’ve added an alias to the function, AS recording, so that we can refer to it easily as recording throughout the query. You can see that we do this in the ORDER BY clause, where we ask MySQL to sort the output by ascending recording value. This is much better than the unaliased alternative, which requires you to write out theCONCAT() function again:

mysql> SELECT CONCAT(artist_name, " recorded ", album_name)

-> FROM artist INNER JOIN album USING (artist_id)

-> ORDER BY CONCAT(artist_name, " recorded ", album_name);

+-------------------------------------------------------------+

| recording |

+-------------------------------------------------------------+

| Kylie Minogue recorded Light Years |

| Miles Davis recorded In A Silent Way |

| Miles Davis recorded Live Around The World |

| New Order recorded Brotherhood |

| New Order recorded Power, Corruption & Lies |

| New Order recorded Retro - John McCready FAN |

| New Order recorded Retro - Miranda Sawyer POP |

| New Order recorded Retro - New Order / Bobby Gillespie LIVE |

| New Order recorded Substance (Disc 2) |

| New Order recorded Substance 1987 (Disc 1) |

| Nick Cave & The Bad Seeds recorded Let Love In |

| The Rolling Stones recorded Exile On Main Street |

| The Stone Roses recorded Second Coming |

+-------------------------------------------------------------+

13 rows in set (0.21 sec)

The alternative is unwieldy, and worse, you risk mistyping some part of the ORDER BY clause and getting a result different from what you expect. (Note that we’ve used as recording on the first line so that the displayed column has the label recording.)

There are restrictions on where you can use column aliases. You can’t use them in a WHERE clause, or in the USING and ON clauses that we discuss later in this chapter. This means you can’t write a query such as:

mysql> SELECT artist_name AS a FROM artist WHERE a = "New Order";

ERROR 1054 (42S22): Unknown column 'a' in 'where clause'

You can’t do this because MySQL doesn’t always know the column values before it executes the WHERE clause. However, you can use column aliases in the ORDER BY clause, and in the GROUP BY and HAVING clauses discussed later in this chapter.

The AS keyword is optional. Because of this, the following two queries are equivalent:

We recommend using the AS keyword, since it helps to clearly distinguish an aliased column, especially where you’re selecting multiple columns from a list of columns separated by commas.

Alias names have few restrictions. They can be at most 255 characters in length and can contain any character. If you plan to use characters that might confuse the MySQL parser—such as periods, commas, or semicolons—make sure you enclose the alias name in backticks. We recommend using lowercase alphanumeric strings for alias names and using a consistent character choice—such as an underscore—to separate words. Aliases are case-insensitive on all platforms.

Table Aliases

Table aliases are useful for the same reasons as column aliases, but they are also sometimes the only way to express a query. This section shows you how to use table aliases, and Nested Queries,” later in this chapter, shows you other sample queries where table aliases are essential.

Here’s a basic table-alias example that shows you how to save some typing:

mysql> SELECT ar.artist_id, al.album_name, ar.artist_name FROM

-> album AS al INNER JOIN artist AS ar

-> USING (artist_id) WHERE al.album_name = "Brotherhood";

+-----------+-------------+-------------+

| artist_id | album_name | artist_name |

+-----------+-------------+-------------+

| 1 | Brotherhood | New Order |

+-----------+-------------+-------------+

1 row in set (0.00 sec)

You can see that the album and artist tables are aliased as al and ar, respectively, using the AS keyword. This allows you to express column names more compactly, such as ar.artist_id. Notice also that you can use table aliases in the WHERE clause; unlike column aliases, there are no restrictions on where table aliases can be used in queries. From our example, you can see that we’re referring to the table aliases before they have been defined.

As with column aliases, the AS keyword is optional. This means that:

album AS al INNER JOIN artist AS ar

is the same as:

album al INNER JOIN artist ar

Again, we prefer the AS style, as it’s clearer to anyone looking at your queries than the alternative. The restrictions on table-alias-name characters and lengths are the same as column aliases, and our recommendations on choosing them are the same, too.

As discussed in the introduction to this section, table aliases allow you to write queries that you can’t otherwise easily express. Consider an example: suppose you want to know whether two or more artists have released an album of the same name and, if so, what the identifiers for those artists are. Let’s think about the basic requirement: you want to know if two albums have the same name. To do this, you might try a query like this:

mysql> SELECT * FROM album WHERE album_name = album_name;

But that doesn’t make sense: an album has the same name as itself, and so it just produces all albums as output:

+-----------+----------+------------------------------------------+

| artist_id | album_id | album_name |

+-----------+----------+------------------------------------------+

| 2 | 1 | Let Love In |

| 1 | 1 | Retro - John McCready FAN |

| 1 | 2 | Substance (Disc 2) |

| 1 | 3 | Retro - Miranda Sawyer POP |

| 1 | 4 | Retro - New Order / Bobby Gillespie LIVE |

| 3 | 1 | Live Around The World |

| 3 | 2 | In A Silent Way |

| 1 | 5 | Power, Corruption & Lies |

| 4 | 1 | Exile On Main Street |

| 1 | 6 | Substance 1987 (Disc 1) |

| 5 | 1 | Second Coming |

| 6 | 1 | Light Years |

| 1 | 7 | Brotherhood |

+-----------+----------+------------------------------------------+

13 rows in set (0.01 sec)

What you really want is to know if two different albums from the album table have the same name. But how can you do that in a single query? The answer is to give the table two different aliases; you then check if one row in the first aliased table matches a row in the second:

mysql> SELECT a1.artist_id, a2.album_id

-> FROM album AS a1, album AS a2 WHERE

-> a1.album_name = a2.album_name;

+-----------+----------+

| artist_id | album_id |

+-----------+----------+

| 2 | 1 |

| 1 | 1 |

| 1 | 2 |

| 1 | 3 |

| 1 | 4 |

| 3 | 1 |

| 3 | 2 |

| 1 | 5 |

| 4 | 1 |

| 1 | 6 |

| 5 | 1 |

| 6 | 1 |

| 1 | 7 |

+-----------+----------+

13 rows in set (0.01 sec)

But it still doesn’t work! We get all 13 albums as answers. The reason is that an album still matches itself because it occurs in both aliased tables.

To get the query to work, we need to make sure an album from one aliased table doesn’t match itself in the other aliased table. The way to do so is to specify that the albums in each table shouldn’t have the same artist:

mysql> SELECT a1.artist_id, a2.album_id

-> FROM album AS a1, album AS a2

-> WHERE a1.album_name = a2.album_name

-> AND a1.artist_id != a2.artist_id;

Empty set (0.00 sec)

You can now see that there aren’t two albums in the database with the same name but by different artists. The additional AND a1.artist_id != a2.artist_id stops answers from being reported where the artist is the same in both tables.

Table aliases are also useful in nested queries that use the EXISTS and ON clauses. We show you examples later in this chapter when we introduce nested techniques.

Aggregating Data

Aggregate functions allow you to discover the properties of a group of rows. You use them for purposes such as discovering how many rows there are in a table, how many rows in a table share a property (such as having the same name or date of birth), finding averages (such as the average temperature in November), or finding the maximum or minimum values of rows that meet some condition (such as finding the coldest day in August).

This section explains the GROUP BY and HAVING clauses, the two most commonly used SQL statements for aggregation. But first, it explains the DISTINCT clause, which is used to report unique results for the output of a query. When neither the DISTINCT nor the GROUP BY clause is specified, the returned raw data can still be processed using the aggregate functions that we describe in this section.

The DISTINCT Clause

To begin our discussion on aggregate functions, we’ll focus on the DISTINCT clause. This isn’t really an aggregate function, but more of a post-processing filter that allows you to remove duplicates. We’ve added it into this section because, like aggregate functions, it’s concerned with picking examples from the output of a query, rather than processing individual rows.

An example is the best way to understand DISTINCT. Consider this query:

mysql> SELECT DISTINCT artist_name FROM

-> artist INNER JOIN album USING (artist_id);

+---------------------------+

| artist_name |

+---------------------------+

| New Order |

| Nick Cave & The Bad Seeds |

| Miles Davis |

| The Rolling Stones |

| The Stone Roses |

| Kylie Minogue |

+---------------------------+

6 rows in set (0.03 sec)

The query finds artists who have made albums—by joining together artist and album with an INNER JOIN clause—and reports one example of each artist. You can see that we have six artists in our database for whom we own albums. If you remove the DISTINCT clause, you get one row of output for each album we own:

mysql> SELECT artist_name FROM

-> artist INNER JOIN album USING (artist_id);

+---------------------------+

| artist_name |

+---------------------------+

| New Order |

| New Order |

| New Order |

| New Order |

| New Order |

| New Order |

| New Order |

| Nick Cave & The Bad Seeds |

| Miles Davis |

| Miles Davis |

| The Rolling Stones |

| The Stone Roses |

| Kylie Minogue |

+---------------------------+

13 rows in set (0.00 sec)

So, the DISTINCT clause helps get a summary.

The DISTINCT clause applies to the query output and removes rows that have identical values in the columns selected for output in the query. If you rephrase the previous query to output both artist_name and album_name (but otherwise don’t change the JOIN clause and still useDISTINCT), you’ll get all 13 rows in the output:

Because none of the rows are identical, no duplicates are removed using DISTINCT. You can check this by rephrasing the query to omit the DISTINCT clause; you’ll get the same output.

To remove duplicates, MySQL needs to sort the output. If indexes are available that are in the same order as required for the sort—or the data itself is in an order that’s useful—this process has very little overhead. However, for large tables and without an easy way of accessing the data in the right order, sorting can be very slow. You should use DISTINCT (and other aggregate functions) with caution on large data sets. If you do use it, you can check its behavior using the EXPLAIN statement discussed in Chapter 8.

The GROUP BY Clause

The GROUP BY clause sorts data into groups for the purpose of aggregation. It’s similar to ORDER BY, but it occurs much earlier in the query process: GROUP BY is used to organize the data before other clauses—such as WHERE, ORDER BY, and functions—are applied. In contrast, ORDER BYis applied last—after the query has been resolved—to reorganize the query output for display.

An example will help you understand what GROUP BY is used for. Suppose you want to know how many albums we own by each artist. Using the techniques you’ve learned so far, you could perform an INNER JOIN between artist and album, and use an ORDER BY artist_name clause to organize the artists into an order to make it easy for you to count. Here’s the query that you’d use:

mysql> SELECT artist_name FROM

-> artist INNER JOIN album USING (artist_id)

-> ORDER BY artist_name;

+---------------------------+

| artist_name |

+---------------------------+

| Kylie Minogue |

| Miles Davis |

| Miles Davis |

| New Order |

| New Order |

| New Order |

| New Order |

| New Order |

| New Order |

| New Order |

| Nick Cave & The Bad Seeds |

| The Rolling Stones |

| The Stone Roses |

+---------------------------+

13 rows in set (0.00 sec)

By running down the list, it’s easy to count off how many albums we’ve got by each artist: one by Kylie Minogue, two by Miles Davis, seven by New Order, and so on.

The GROUP BY clause can help automate this process by grouping the albums by artist; we can then use the COUNT() function to count off the number of albums in each group. Here’s the query that does what we want:

mysql> SELECT artist_name, COUNT(artist_name) FROM

-> artist INNER JOIN album USING (artist_id)

-> GROUP BY artist_name;

+---------------------------+--------------------+

| artist_name | COUNT(artist_name) |

+---------------------------+--------------------+

| Kylie Minogue | 1 |

| Miles Davis | 2 |

| New Order | 7 |

| Nick Cave & The Bad Seeds | 1 |

| The Rolling Stones | 1 |

| The Stone Roses | 1 |

+---------------------------+--------------------+

6 rows in set (0.01 sec)

You can see that the output we’ve asked for is artist_name, COUNT(artist_name), and this tells us exactly what we wanted to know. Notice also that we’ve used GROUP BY artist_name to sort early for aggregation, rather than using ORDER BY artist_name later for presentation.

Let’s consider the query further. We’ll start with the GROUP BY clause. This tells us how to put rows together into groups: in this example, we’re telling MySQL that the way to group rows is by artist_name. The result is that rows for artists with the same name form a cluster—that is, each distinct name becomes one group. Once the rows are grouped, they’re treated in the rest of the query as if they’re one row. So, for example, when we write SELECT artist_name, we get just one row for each group. This is exactly the same as DISTINCT, which performs the same function as grouping by a column name and then selecting that column for display. The COUNT() function tells us about the properties of the group. More specifically, it tells us the number of rows that form each group; you can count any column in a group, and you’ll get the same answer, soCOUNT(artist_name) is the same as COUNT(*) or COUNT(artist_id). Of course, you can use a column alias for the COUNT() column.

Let’s try another example. Suppose you want to know how many tracks are on each album, along with the artist and album name. Here’s the query:

Before we discuss what’s new, think about the general function of the query: it’s an INNER JOIN between artist, album, and track using the primary-key (identifier) columns. Forgetting the aggregation for a moment, the output of this query is one row per track.

The GROUP BY clause puts the rows together into clusters. In this query, we want the tracks grouped together for each album by an artist. So, the GROUP BY clause uses artist_id and album_id to do that. You can use the artist_id from any of the three tables; artist.artist_id,album.artist_id, or track.artist_id are the same for this purpose. It doesn’t matter since the INNER JOIN makes sure they match anyway. The same applies to album_id.

As in the previous example query, we’re using the COUNT() function to tell us how many rows are in each group. For example, you can see that COUNT(*) tells us that there are 15 tracks on New Order’s Retro - John McReady FAN album. Again, it doesn’t matter what column or columns you count in the query: for example, COUNT(*) has the same effect as COUNT(artist.artist_id) or COUNT(artist_name).

Let’s try another example. Say we want to know how many times we’ve listened to tracks on each album. This query is a little trickier than the previous ones: we need to think carefully about how to group the rows. We want rows for each album grouped together—that is, we want to count the total number of times that any of the tracks on the album have been played. So, we want to group together by artist and by album; we don’t want to group by track, since that’d split the tracks from each album into different groups and tell us how many times we’d listened to each track. We also need a four-way join between all four tables in the database, but that isn’t hard to do using the skills we’ve developed so far. Here’s the query:

mysql> SELECT artist_name, album_name, COUNT(*) FROM

-> artist INNER JOIN album USING (artist_id)

-> INNER JOIN track USING (artist_id, album_id)

-> INNER JOIN played USING (artist_id, album_id, track_id)

-> GROUP BY album.artist_id, album.album_id;

+-------------+----------------------------+----------+

| artist_name | album_name | COUNT(*) |

+-------------+----------------------------+----------+

| New Order | Retro - Miranda Sawyer POP | 8 |

| Miles Davis | Live Around The World | 3 |

+-------------+----------------------------+----------+

2 rows in set (0.11 sec)

You can see we’ve only listened to two albums: we’ve listened to one or more tracks on New Order’s Retro - Miranda Sawyer POP eight times, and one or more tracks on the Miles Davis album Live Around The World three times. We don’t know whether it’s the same track we’ve listened to multiple times, different tracks a few times, or many tracks once: the GROUP BY clause hides the details. Again, we use COUNT(*) to do the counting of rows in the groups, and you can see the INNER JOIN spread over lines 2 to 4 in the query.

Before we end this section, let’s consider how results are displayed for a grouping operation. The output rows are grouped together according to the GROUP BY clause, with one row displayed for each group. You will typically not ask for fields that are collected together in the grouping process, since the result will be meaningless. For example, grouping the tracks by artist will produce:

mysql> SELECT * FROM track GROUP BY artist_id;

+----------+----------------------+-----------+----------+----------+

| track_id | track_name | artist_id | album_id | time |

+----------+----------------------+-----------+----------+----------+

| 0 | Elegia | 1 | 1 | 00:04:93 |

| 0 | Do You Love Me? | 2 | 1 | 00:05:95 |

| 0 | In A Silent Way | 3 | 1 | 00:01:81 |

| 0 | Rocks Off | 4 | 1 | 00:04:54 |

| 0 | Breaking Into Heaven | 5 | 1 | 00:11:37 |

| 0 | Spinning Around | 6 | 1 | 00:03:46 |

+----------+----------------------+-----------+----------+----------+

6 rows in set (0.01 sec)

Only the artist_id here is meaningful; the rest of the columns just contain the first-listed entry from each group. To illustrate this point, “Elegia” is the first track that would be listed for artist_id 1 if we hadn’t performed any grouping:

mysql> SELECT * FROM track WHERE artist_id=1;

+----------+----------------------+-----------+----------+----------+

| track_id | track_name | artist_id | album_id | time |

+----------+----------------------+-----------+----------+----------+

| 0 | Elegia | 1 | 1 | 00:04:93 |

| 1 | In A Lonely Place | 1 | 1 | 00:06:26 |

| 2 | Procession | 1 | 1 | 00:04:28 |

...

+----------+----------------------+-----------+----------+----------+

86 rows in set (0.00 sec)

Other aggregate functions

We’ve seen examples of how the COUNT() function can be used to tell how many rows are in a group. Here are other functions commonly used to explore the properties of aggregated rows:

AVG()

Returns the average (mean) of the values in the specified column for all rows in a group. For example, you could use it to find the average cost of a house in a city, when the houses are grouped by city:

SELECT AVG(cost) FROM house_prices GROUP BY city;

MAX()

Returns the maximum value from rows in a group. For example, you could use it to find the warmest day in a month, when the rows are grouped by month.

MIN()

Returns the minimum value from rows in a group. For example, you could use it to find the youngest student in a class, when the rows are grouped by class.

STD() or STDDEV()

Returns the standard deviation of values from rows in a group. For example, you could use it to understand the spread of test scores, when rows are grouped by university course.

SUM()

Returns the sum of values from rows in a group. For example, you could use it to compute the dollar amount of sales in a given month, when rows are grouped by month.

There are other functions available for use with GROUP BY; they’re less frequently used than the ones we’ve introduced. You can find more details on them in the MySQL manual under the heading “GROUP BY (Aggregate) Functions.”

The HAVING Clause

You’re now familiar with the GROUP BY clause, which allows you to sort and cluster data. You should now be able to use it find out about counts, averages, minimums, and maximums. This section shows how you can use the HAVING clause to add additional control to the aggregation of rows in a GROUP BY operation.

Suppose you want to know how many times you’ve listened to tracks on popular albums. You’ve decided to define an album as popular if you’ve listened to one or more of its tracks at least five times. In the previous section, we tried an almost identical query but without the popularity limitation. Here’s the new query, with an additional HAVING clause that adds the constraint:

mysql> SELECT artist_name, album_name, COUNT(*) FROM

-> artist INNER JOIN album USING (artist_id)

-> INNER JOIN track USING (artist_id, album_id)

-> INNER JOIN played USING (artist_id, album_id, track_id)

-> GROUP BY album.artist_id, album.album_id

-> HAVING COUNT(*) >= 5;

+-------------+----------------------------+----------+

| artist_name | album_name | COUNT(*) |

+-------------+----------------------------+----------+

| New Order | Retro - Miranda Sawyer POP | 8 |

+-------------+----------------------------+----------+

1 row in set (0.01 sec)

You can see there’s only one album that meets the new criteria.

The HAVING clause must contain an expression or column that’s listed in the SELECT clause. In this example, we’ve used HAVING COUNT(*) >= 5, and you can see that COUNT(*) is part of the SELECT clause. Typically, the expression in the HAVING clause uses an aggregate function such as COUNT(), SUM(), MIN(), or MAX(). If you find yourself wanting to write a HAVING clause that uses a column or expression that isn’t in the SELECT clause, chances are you should be using a WHERE clause instead. The HAVING clause is only for deciding how to form each group or cluster, not for choosing rows in the output. We’ll show you an example later that illustrates when not to use HAVING.

Let’s try another example. Suppose you want a list of albums that have more than 10 tracks, together with the number of tracks they contain. Here’s the query you’d use:

You can again see that the expression COUNT(*) is used in both the SELECT and HAVING clauses.

Now let’s consider an example where you shouldn’t use HAVING. You want to know how many tracks are on albums by New Order. Here’s the query you shouldn’t use:

mysql> SELECT artist_name, album_name, COUNT(*) FROM

-> artist INNER JOIN album USING (artist_id)

-> INNER JOIN track USING (artist_id, album_id)

-> GROUP BY artist.artist_id, album.album_id

-> HAVING artist_name = "New Order";

+-------------+------------------------------------------+----------+

| artist_name | album_name | COUNT(*) |

+-------------+------------------------------------------+----------+

| New Order | Retro - John McCready FAN | 15 |

| New Order | Substance (Disc 2) | 12 |

| New Order | Retro - Miranda Sawyer POP | 14 |

| New Order | Retro - New Order / Bobby Gillespie LIVE | 15 |

| New Order | Power, Corruption & Lies | 8 |

| New Order | Substance 1987 (Disc 1) | 12 |

| New Order | Brotherhood | 10 |

+-------------+------------------------------------------+----------+

7 rows in set (0.00 sec)

It gets the right answer, but in the wrong—and, for large amounts of data, much slower—way. It’s not the correct way to write the query because the HAVING clause isn’t being used to decide what rows should form each group, but is instead being incorrectly used to filter the answers to display. For this query, we should really use a WHERE clause as follows:

mysql> SELECT artist_name, album_name, COUNT(*) FROM

-> artist INNER JOIN album USING (artist_id)

-> INNER JOIN track USING (artist_id, album_id)

-> WHERE artist_name = "New Order"

-> GROUP BY artist.artist_id, album.album_id;

+-------------+------------------------------------------+----------+

| artist_name | album_name | COUNT(*) |

+-------------+------------------------------------------+----------+

| New Order | Retro - John McCready FAN | 15 |

| New Order | Substance (Disc 2) | 12 |

| New Order | Retro - Miranda Sawyer POP | 14 |

| New Order | Retro - New Order / Bobby Gillespie LIVE | 15 |

| New Order | Power, Corruption & Lies | 8 |

| New Order | Substance 1987 (Disc 1) | 12 |

| New Order | Brotherhood | 10 |

+-------------+------------------------------------------+----------+

7 rows in set (0.00 sec)

This correct query forms the groups, and then picks which groups to display based on the WHERE clause.

Advanced Joins

So far in the book, we’ve used the INNER JOIN clause to bring together rows from two or more tables. We’ll explain the inner join in more detail in this section, contrasting it with the other join types we explain: the union, left and right joins, and natural joins. At the conclusion of this section, you’ll be able to answer difficult information needs and be familiar with the correct choice of join for the task.

The Inner Join

The INNER JOIN clause matches rows between two tables based on the criteria you provide in the USING clause. For example, you’re very familiar now with an inner join of the artist and album tables:

You can see that we’ve spelled out the inner join: we’re selecting from the artist and album tables the rows where the identifiers match between the tables.

You can modify the INNER JOIN syntax to express the join criteria in a way that’s similar to using a WHERE clause. This is useful if the names of the identifiers don’t match between the tables. Here’s the previous query, rewritten in this style:

You can see that the ON clause replaces the USING clause, and that the columns that follow are fully specified to include the table and column names. There’s no real advantage or disadvantage in using ON or a WHERE clause; it’s just a matter of taste. Typically, you’ll find most SQL professionals use the WHERE clause in preference to INNER JOIN, most likely because it’s the technique they learned first.

Before we move on, let’s consider what purpose the WHERE, ON, and USING clauses serve. If you omit the WHERE clause from the query we showed you, you get a very different result. Here’s the query, and the first few lines of output:

The output is nonsensical: what’s happened is that each row from the artist table has been output alongside each row from the album table, for all possible combinations. Since there are 6 artists and 13 albums, there are 6 × 13 = 78 rows of output, and we know that only 13 of those combinations actually make sense (there are only 13 albums). This type of query, without a clause that matches rows, is known as a Cartesian product. Incidentally, you also get the Cartesian product if you perform an inner join without specifying a column with a USING or ON clause, as in the query:

SELECT artist_name, album_name FROM artist INNER JOIN album;

Later in The Natural Join,” we’ll introduce the natural join, which is an inner join on identically named columns. While the natural join doesn’t use explicitly specified columns, it still produces an inner join, rather than a Cartesian product.

The keyphrase INNER JOIN can be replaced with JOIN or STRAIGHT JOIN; they all do the same thing. However, STRAIGHT JOIN forces MySQL to always read the table on the left before it reads the table on the right. We’ll have a look at how MySQL processes queries behind the scenes in Chapter 8. The keyphrase INNER JOIN is the one you’ll see most commonly used: it’s used by many other database systems besides MySQL, and we use it in all our inner-join examples.

The Union

The UNION statement isn’t really a join operator. Rather, it allows you to combine the output of more than one SELECT statement to give a consolidated result set. It’s useful in cases where you want to produce a single list from more than one source, or you want to create lists from a single source that are difficult to express in a single query.

Let’s look at an example. If you wanted to output all of the text in the music database, you could do this with a UNION statement. It’s a contrived example, but you might want to do this just to list all of the text fragments, rather than to meaningfully present the relationships between the data. There’s text in the artist_name, album_name, and track_name columns in the artist, album, and track tables, respectively. Here’s how to display it:

mysql> SELECT artist_name FROM artist

-> UNION

-> SELECT album_name FROM album

-> UNION

-> SELECT track_name FROM track;

+------------------------------------------+

| artist_name |

+------------------------------------------+

| New Order |

| Nick Cave & The Bad Seeds |

| Miles Davis |

| The Rolling Stones |

| The Stone Roses |

| Kylie Minogue |

| Let Love In |

| Retro - John McCready FAN |

| Substance (Disc 2) |

| Retro - Miranda Sawyer POP |

| Retro - New Order / Bobby Gillespie LIVE |

| Live Around The World |

| In A Silent Way |

| Power, Corruption & Lies |

| Exile On Main Street |

| Substance 1987 (Disc 1) |

| Second Coming |

| Light Years |

| Brotherhood |

| Do You Love Me? |

...

We’ve only shown the first 20 of 153 rows. The UNION statement outputs all results from all queries together, under a heading appropriate to the first query.

A slightly less contrived example is to create a list of the first five and last five tracks you’ve played. You can do this easily with the UNION operator:

mysql> (SELECT track_name FROM

-> track INNER JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played ASC LIMIT 5)

-> UNION

-> (SELECT track_name FROM

-> track INNER JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played DESC LIMIT 5);

+-----------------------+

| track_name |

+-----------------------+

| Fine Time |

| Temptation |

| True Faith |

| The Perfect Kiss |

| Ceremony |

| New Blues |

| Intruder |

| In A Silent Way |

| Bizarre Love Triangle |

| Crystal |

+-----------------------+

10 rows in set (0.09 sec)

The first query uses ORDER BY with the ASC (ascending) modifier and a LIMIT 5 clause to find the first five tracks played. The second query uses ORDER BY with the DESC (descending) modifier and a LIMIT 5 clause to find the last five tracks played. The UNION combines the result sets.

The UNION operator has several limitations:

§ The output is labeled with the names of the columns or expressions from the first query. Use column aliases to change this behavior.

§ The queries should output the same number of columns. If you try using different numbers of columns, MySQL will report an error.

§ All matching columns should have the same type. So, for example, if the first column output from the first query is a date, the first column output from any other query must be a date.

§ The results returned are unique, as if you’d applied a DISTINCT to the overall result set. To see this in action, let’s add a new row for the track “Fine Time” to the played table. This has artist_id 1, album_id 3, and track_id 0:

§ mysql> INSERT INTO played SET

§ -> artist_id = 1,

§ -> album_id = 3,

§ -> track_id = 0,

§ -> played='2006-08-14 10:27:03';

Query OK, 1 row affected (0.02 sec)

We’ve used the more verbose INSERT format to clarify what we’re inserting.

Now, if you run the previous SELECT query again, you’ll see 9 rows instead of 10, since “Fine Time” appears twice in the first 5 tracks placed, but the implicit DISTINCT operation means it’s shown only once:

mysql> (SELECT track_name FROM

-> track INNER JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played ASC LIMIT 5)

-> UNION

-> (SELECT track_name FROM

-> track INNER JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played DESC LIMIT 5);

+-----------------------+

| track_name |

+-----------------------+

| Fine Time |

| Temptation |

| True Faith |

| The Perfect Kiss |

| New Blues |

| Intruder |

| In A Silent Way |

| Bizarre Love Triangle |

| Crystal |

+-----------------------+

9 rows in set (0.01 sec)

If you want to show any duplicates, replace UNION with UNION ALL:

mysql> (SELECT track_name FROM

-> track INNER JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played ASC LIMIT 5)

-> UNION ALL

-> (SELECT track_name FROM

-> track INNER JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played DESC LIMIT 5);

+-----------------------+

| track_name |

+-----------------------+

| Fine Time |

| Temptation |

| Fine Time |

| True Faith |

| The Perfect Kiss |

| New Blues |

| Intruder |

| In A Silent Way |

| Bizarre Love Triangle |

| Crystal |

+-----------------------+

10 rows in set (0.00 sec)

Here, “Fine Time” appears twice.

§ If you want to apply LIMIT or ORDER BY to an individual query that is part of a UNION statement, enclose that query in parentheses (as shown in the previous example). It’s useful to use parentheses anyway to keep the query easy to understand.

The UNION operation simply concatenates the results of the component queries with no attention to order, so there’s not much point in using ORDER BY within one of the subqueries. The only time that it makes sense to order a subquery in a UNION operation is when you want to select a subset of results. In our example, we’ve ordered the tracks by the time they were played, and then selected only the first five (in the first subquery) and the last five (in the second subquery).

For efficiency, MySQL will actually ignore an ORDER BY clause within a subquery if it’s used without LIMIT. Let’s look at some examples to see exactly how this works.

First, let’s run a simple query to list the tracks that have been played, along with the time each track was played. We’ve enclosed the query in parentheses for consistency with our other examples—the parentheses don’t actually have any effect here—and haven’t used an ORDER BY orLIMIT clause:

mysql> (SELECT track_name, played

-> FROM track INNER JOIN played USING (artist_id, album_id, track_id)

-> );

+-----------------------+---------------------+

| track_name | played |

+-----------------------+---------------------+

| Fine Time | 2006-08-14 10:21:03 |

| Fine Time | 2006-08-14 10:27:03 |

| Temptation | 2006-08-14 10:25:22 |

| True Faith | 2006-08-14 10:30:25 |

| The Perfect Kiss | 2006-08-14 10:36:54 |

| Ceremony | 2006-08-14 10:41:43 |

| Regret | 2006-08-14 10:43:37 |

| Crystal | 2006-08-14 10:47:21 |

| Bizarre Love Triangle | 2006-08-14 10:54:02 |

| In A Silent Way | 2006-08-15 14:00:03 |

| Intruder | 2006-08-15 14:26:12 |

| New Blues | 2006-08-15 14:33:57 |

+-----------------------+---------------------+

12 rows in set (0.00 sec)

The query returns all the played tracks, in no particular order (see the second and third entries).

Now, let’s add an ORDER BY clause to this query:

mysql> (SELECT track_name, played

-> FROM track INNER JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played ASC);

+-----------------------+---------------------+

| track_name | played |

+-----------------------+---------------------+

| Fine Time | 2006-08-14 10:21:03 |

| Temptation | 2006-08-14 10:25:22 |

| Fine Time | 2006-08-14 10:27:03 |

| True Faith | 2006-08-14 10:30:25 |

| The Perfect Kiss | 2006-08-14 10:36:54 |

| Ceremony | 2006-08-14 10:41:43 |

| Regret | 2006-08-14 10:43:37 |

| Crystal | 2006-08-14 10:47:21 |

| Bizarre Love Triangle | 2006-08-14 10:54:02 |

| In A Silent Way | 2006-08-15 14:00:03 |

| Intruder | 2006-08-15 14:26:12 |

| New Blues | 2006-08-15 14:33:57 |

+-----------------------+---------------------+

12 rows in set (0.03 sec)

As expected, we get all the played tracks, in the order that they’ve been played.

Adding a LIMIT clause to the previous query selects the first five tracks played, in chronological order—no surprises here:

mysql> (SELECT track_name, played

-> FROM track INNER JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played ASC LIMIT 5);

+------------------+---------------------+

| track_name | played |

+------------------+---------------------+

| Fine Time | 2006-08-14 10:21:03 |

| Temptation | 2006-08-14 10:25:22 |

| Fine Time | 2006-08-14 10:27:03 |

| True Faith | 2006-08-14 10:30:25 |

| The Perfect Kiss | 2006-08-14 10:36:54 |

+------------------+---------------------+

5 rows in set (0.00 sec)

Now, let’s see what happens when we perform a UNION operation. In this example, we’re using two subqueries, each with an ORDER BY clause. We’ve used a LIMIT clause for the second subquery, but not for the first:

mysql> (SELECT track_name, played

-> FROM track INNER JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played ASC)

-> UNION ALL

-> (SELECT track_name,played

-> FROM track INNER JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played DESC LIMIT 5);

+-----------------------+---------------------+

| track_name | played |

+-----------------------+---------------------+

| Fine Time | 2006-08-14 10:21:03 |

| Fine Time | 2006-08-14 10:27:03 |

| Temptation | 2006-08-14 10:25:22 |

| True Faith | 2006-08-14 10:30:25 |

| The Perfect Kiss | 2006-08-14 10:36:54 |

| Ceremony | 2006-08-14 10:41:43 |

| Regret | 2006-08-14 10:43:37 |

| Crystal | 2006-08-14 10:47:21 |

| Bizarre Love Triangle | 2006-08-14 10:54:02 |

| In A Silent Way | 2006-08-15 14:00:03 |

| Intruder | 2006-08-15 14:26:12 |

| New Blues | 2006-08-15 14:33:57 |

| New Blues | 2006-08-15 14:33:57 |

| Intruder | 2006-08-15 14:26:12 |

| In A Silent Way | 2006-08-15 14:00:03 |

| Bizarre Love Triangle | 2006-08-14 10:54:02 |

| Crystal | 2006-08-14 10:47:21 |

+-----------------------+---------------------+

17 rows in set (0.00 sec)

As expected, the first subquery returns all the played tracks (the first 12 rows of this output), and the second subquery returns the last 5 tracks (the last 5 rows of this output). Notice how the first 12 rows are not in order (see the second and third rows), even though the first subquery does have a ORDER BY clause. Since we’re performing a UNION operation, the MySQL server has decided that there’s no point sorting the result of the subquery. The second subquery includes a LIMIT operation, so the results of that subquery are sorted.

The output of a UNION operation isn’t guaranteed to be ordered, even if the subqueries are ordered, so if you want the final output to be ordered, you should add an ORDER BY clause at the end of the whole query:

mysql> (SELECT track_name, played

-> FROM track INNER JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played ASC)

-> UNION ALL

-> (SELECT track_name, played

-> FROM track INNER JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played DESC LIMIT 5)

-> ORDER BY played;

+-----------------------+---------------------+

| track_name | played |

+-----------------------+---------------------+

| Fine Time | 2006-08-14 10:21:03 |

| Temptation | 2006-08-14 10:25:22 |

| Fine Time | 2006-08-14 10:27:03 |

| True Faith | 2006-08-14 10:30:25 |

| The Perfect Kiss | 2006-08-14 10:36:54 |

| Ceremony | 2006-08-14 10:41:43 |

| Regret | 2006-08-14 10:43:37 |

| Crystal | 2006-08-14 10:47:21 |

| Crystal | 2006-08-14 10:47:21 |

| Bizarre Love Triangle | 2006-08-14 10:54:02 |

| Bizarre Love Triangle | 2006-08-14 10:54:02 |

| In A Silent Way | 2006-08-15 14:00:03 |

| In A Silent Way | 2006-08-15 14:00:03 |

| Intruder | 2006-08-15 14:26:12 |

| Intruder | 2006-08-15 14:26:12 |

| New Blues | 2006-08-15 14:33:57 |

| New Blues | 2006-08-15 14:33:57 |

+-----------------------+---------------------+

17 rows in set (0.00 sec)

Here’s another example of sorting the final results, including a limit on the number of returned results:

mysql> (SELECT artist_name FROM artist WHERE artist_id < 5)

-> UNION

-> (SELECT artist_name FROM artist WHERE artist_id > 7)

-> ORDER BY artist_name LIMIT 4;

+---------------------------+

| artist_name |

+---------------------------+

| Miles Davis |

| New Order |

| Nick Cave & The Bad Seeds |

| The Rolling Stones |

+---------------------------+

4 rows in set (0.01 sec)

The UNION operation is somewhat unwieldy, and there are generally alternative ways of getting the same result. For example, the previous query could have been written more simply as:

mysql> SELECT artist_name FROM artist WHERE

-> artist_id < 3 OR artist_id > 5

-> ORDER BY artist_name LIMIT 4;

+---------------------------+

| artist_name |

+---------------------------+

| Kylie Minogue |

| New Order |

| Nick Cave & The Bad Seeds |

+---------------------------+

3 rows in set (0.00 sec)

The Left and Right Joins

The joins we’ve discussed so far output only rows that match between tables. For example, when you join the track and played tables, you see only the tracks that have been played. Therefore, rows for tracks that haven’t been played are ignored and—if they existed—would play data for tracks that don’t exist. This makes sense in many cases, but it isn’t the only way to join data. This section explains other options you have.

Suppose you did want a comprehensive list of all albums and the number of times you’ve played tracks from them. Unlike the example earlier in this chapter, included in the list you want to see a zero next to albums that haven’t been played. You can do this with a left join, a different type of join that’s driven by one of the two tables participating in the join. A left join works like this: each row in the left table—the one that’s doing the driving—is processed and output, with the matching data from the second table if it exists and NULL values if there is no matching data in the second table. We’ll show you how to write this type of query later in this section, but we’ll start with a simpler example.

Here’s a simple LEFT JOIN example. You want to list all tracks, and next to each track you want to show when it was played. If a track has been never been played, you want to see that. If it’s been played many times, you want to see that too. Here’s the query:

mysql> SELECT track_name, played FROM

-> track LEFT JOIN played USING (artist_id, album_id, track_id)

-> ORDER BY played DESC;

+--------------------------+---------------------+

| track_name | played |

+--------------------------+---------------------+

| New Blues | 2006-08-15 14:33:57 |

| Intruder | 2006-08-15 14:26:12 |

| In A Silent Way | 2006-08-15 14:00:03 |

| Bizarre Love Triangle | 2006-08-14 10:54:02 |

| Crystal | 2006-08-14 10:47:21 |

| Regret | 2006-08-14 10:43:37 |

| Ceremony | 2006-08-14 10:41:43 |

| The Perfect Kiss | 2006-08-14 10:36:54 |

| True Faith | 2006-08-14 10:30:25 |

| Temptation | 2006-08-14 10:25:22 |

| Fine Time | 2006-08-14 10:21:03 |

| Do You Love Me? | NULL |

| Nobody's Baby Now | NULL |

| Loverman | NULL |

| Jangling Jack | NULL |

| Red Right Hand | NULL |

| I Let Love In | NULL |

...

You can see what happens: tracks that have been played have dates and times, and those that haven’t don’t (the played value is NULL). We’ve added an ORDER BY played DESC to display the output from most to least recently played, where “never played” (NULL) is the smallest possible value.

The order of the tables in the LEFT JOIN is important. If you reverse the order in the previous query, you get very different output:

mysql> SELECT track_name, played FROM

-> played LEFT JOIN track USING (artist_id, album_id, track_id)

-> ORDER BY played DESC;

+-----------------------+---------------------+

| track_name | played |

+-----------------------+---------------------+

| New Blues | 2006-08-15 14:33:57 |

| Intruder | 2006-08-15 14:26:12 |

| In A Silent Way | 2006-08-15 14:00:03 |

| Bizarre Love Triangle | 2006-08-14 10:54:02 |

| Crystal | 2006-08-14 10:47:21 |

| Regret | 2006-08-14 10:43:37 |

| Ceremony | 2006-08-14 10:41:43 |

| The Perfect Kiss | 2006-08-14 10:36:54 |

| True Faith | 2006-08-14 10:30:25 |

| Temptation | 2006-08-14 10:25:22 |

| Fine Time | 2006-08-14 10:21:03 |

+-----------------------+---------------------+

11 rows in set (0.18 sec)

In this, the query is driven by the played table, and so all rows from it are output matched against track values if they exist and NULL otherwise. Since all rows in the played table have matching tracks, no NULL values are shown. Importantly, because the played table drives the process, you don’t see all the rows from the track table (because not all tracks have been played).

In the introduction to this section, we motivated left joins with the example of listing all albums and the number of times they’ve been played, regardless of whether that value is zero. You’ll recall from The GROUP BY Clause” the following query that shows you that information, but only for albums you’ve played:

mysql> SELECT artist_name, album_name, COUNT(*) FROM

-> artist INNER JOIN album USING (artist_id)

-> INNER JOIN track USING (artist_id, album_id)

-> INNER JOIN played USING (artist_id, album_id, track_id)

-> GROUP BY album.artist_id, album.album_id;

+-------------+----------------------------+----------+

| artist_name | album_name | COUNT(*) |

+-------------+----------------------------+----------+

| New Order | Retro - Miranda Sawyer POP | 8 |

| Miles Davis | Live Around The World | 3 |

+-------------+----------------------------+----------+

2 rows in set (0.11 sec)

Here’s how you modify that query to use a left join to list all albums, even those that have never been played:

The only difference is that the final INNER JOIN is replaced by a LEFT JOIN, which means that the data from the first two inner joins—of artist and album—drives the process. The result is that all albums and their artists are displayed, along with the count of the number of matching rows in the played table. You can see we haven’t listened to the majority of the albums.

We’ve shown you that it matters what comes before and after the LEFT JOIN statement. Whatever is on the left drives the process, hence the name “left join.” If you really don’t want to reorganize your query so it matches that template, you can use rollRIGHT JOIN. It’s exactly the same, except whatever is on the right drives the process. Here’s our earlier played and track example written as a right join:

mysql> SELECT track_name, played FROM

-> played RIGHT JOIN track USING (artist_id, album_id, track_id)

-> ORDER BY played DESC;

+-----------------------+---------------------+

| track_name | played |

+-----------------------+---------------------+

| New Blues | 2006-08-15 14:33:57 |

| Intruder | 2006-08-15 14:26:12 |

| In A Silent Way | 2006-08-15 14:00:03 |

| Bizarre Love Triangle | 2006-08-14 10:54:02 |

| Crystal | 2006-08-14 10:47:21 |

| Regret | 2006-08-14 10:43:37 |

| Ceremony | 2006-08-14 10:41:43 |

| The Perfect Kiss | 2006-08-14 10:36:54 |

| True Faith | 2006-08-14 10:30:25 |

| Temptation | 2006-08-14 10:25:22 |

| Fine Time | 2006-08-14 10:21:03 |

| Do You Love Me? | NULL |

| Nobody's Baby Now | NULL |

| Loverman | NULL |

| Jangling Jack | NULL |

The right join is useful sometimes because it allows you to write a query more naturally, expressing it in a way that’s more intuitive. However, you won’t often see it used, and we’d recommend avoiding it where possible.

Both the LEFT JOIN and RIGHT JOIN can use either the USING or ON clauses discussed for the INNER JOIN earlier in this chapter in The Inner Join.” You should use one or the other: without them, you’ll get the Cartesian product discussed in The Inner Join.”

There’s an extra OUTER keyword that you can optionally use in left and right joins, to make them read as LEFT OUTER JOIN and RIGHT OUTER JOIN. It’s just an alternative syntax that doesn’t do anything different, and you won’t often see it used. We stick to the basic versions in this book.

The Natural Join

We’re not big fans of the natural join that we’re about to describe in this section. It’s in here only for completeness and because you’ll see it used sometimes in SQL statements you’ll encounter. Our advice is to avoid using it where possible.

A natural join is, well, supposed to be magically natural. This means that you tell MySQL what tables you want to join, and it figures out how to do it and gives you an INNER JOIN result set. Here’s an example for the artist and album tables:

In reality, it’s not quite magical: all MySQL does is look for columns with the same names and, behind the scenes, adds these silently into an inner join with a USING clause. So, the above query is actually translated into:

mysql> SELECT artist_name, album_name FROM

-> artist INNER JOIN album USING (artist_id);

If identifier columns don’t share the same name, natural joins won’t work. Also, more dangerously, if columns that do share the same names aren’t identifiers, they’ll get thrown into the behind-the-scenes USING clause anyway. For example, if you had name columns in the artist andalbum tables (instead of artist_name and album_name), you’d get USING (artist_id, name) and some unpredictable results. The magic and mystery makes natural joins worth avoiding; spell out queries using an inner join or a WHERE clause instead.

You’ll sometimes see the natural join mixed with left and right joins. The following are valid join syntaxes: NATURAL LEFT JOIN, NATURAL LEFT OUTER JOIN, NATURAL RIGHT JOIN, and NATURAL RIGHT OUTER JOIN. The former two are left joins without ON or USING clauses, and the latter two are right joins. Again, avoid writing them when you can, but you should understand what they mean if you see them used.

Nested Queries

Nested queries—supported by MySQL since version 4.1—are the most difficult to learn. However, they provide a powerful, useful, and concise way of expressing difficult information needs in short SQL statements. This section explains them, beginning with simple examples and leading to the more complex features of the EXISTS and IN statements. At the conclusion of this section, you’ll have completed everything this book contains about querying data, and you should be comfortable understanding almost any SQL query you encounter.

Nested Query Basics

You know how to find the name of an artist who made a particular album using an INNER JOIN:

mysql> SELECT artist_name FROM

-> artist INNER JOIN album USING (artist_id)

-> WHERE album_name = "In A Silent Way";

+-------------+

| artist_name |

+-------------+

| Miles Davis |

+-------------+

1 row in set (0.14 sec)

But there’s another way, using a nested query:

mysql> SELECT artist_name FROM artist WHERE artist_id =

-> (SELECT artist_id FROM album WHERE album_name = "In A Silent Way");

+-------------+

| artist_name |

+-------------+

| Miles Davis |

+-------------+

1 row in set (0.28 sec)

It’s called a nested query because one query is inside another. The inner query, or subquery—the one that is nested—is written in parentheses, and you can see that it determines the artist_id for the album with the name In A Silent Way. The parentheses are required for inner queries. The outer query is the one that’s listed first and isn’t parenthesized here: you can see that it finds the artist_name of the the artist with an artist_id that matches the result of the subquery. So, overall, the inner query finds the artist_id, and the outer query uses it to find the artist’s name.

So, which approach is preferable: nested or not nested? The answer isn’t easy. In terms of performance, the answer is usually not: nested queries are hard to optimize, and so they’re almost always slower to run than the unnested alternative. Does this mean you should avoid nesting? The answer is no: sometimes it’s your only choice if you want to write a single query, and sometimes nested queries can answer information needs that can’t be easily solved otherwise. What’s more, nested queries are expressive. Once you’re comfortable with the idea, they’re a very readable way to show how a query is evaluated. In fact, many SQL designers advocate teaching nested queries before the join-based alternatives we’ve shown you in the past few chapters. We’ll show you examples of where nesting is readable and powerful throughout this section.

Before we begin to cover the keywords that can be used in nested queries, let’s visit an example that can’t be done easily in a single query—at least, not without MySQL’s proprietary LIMIT clause! Suppose you want to know which track you listened to most recently. To do this, following the methods we’ve learned previously, you could find the date and time of the most recently stored row in the played table:

mysql> SELECT MAX(played) FROM played;

+---------------------+

| MAX(played) |

+---------------------+

| 2006-08-15 14:33:57 |

+---------------------+

1 row in set (0.00 sec)

You can then use the output as input to another query to find the track name:

mysql> SELECT track_name FROM track INNER JOIN played

-> USING (artist_id, album_id, track_id)

-> WHERE played = "2006-08-15 14:33:57";

+------------+

| track_name |

+------------+

| New Blues |

+------------+

1 row in set (0.31 sec)

In User Variables,” later in this chapter, we’ll show how you can use variables to avoid having to type in the value in the second query.

With a nested query, you can do both steps in one shot:

mysql> SELECT track_name FROM track INNER JOIN played

-> USING (artist_id, album_id, track_id)

-> WHERE played = (SELECT MAX(played) FROM played);

+------------+

| track_name |

+------------+

| New Blues |

+------------+

1 row in set (0.28 sec)

You can see the nested query combines the two previous queries. Rather than using the constant date and time value discovered from a previous query, it executes the query directly as a subquery. This is the simplest type of nested query, one that returns a scalar operand—that is, a single value.

The previous example used the equality operator, the equals sign, =. You can use all types of comparison operators: < (less than), <= (less than or equal to), > (greater than), >= (greater than or equal to), and != (not equals) or <> (not equals).

The ANY, SOME, ALL, IN, and NOT IN Clauses

Before we start to show some more advanced features of nested queries, we need to create two new tables to use in our examples. Unfortunately, our music database is a little too simple to effectively demonstrate the full power of nested querying. So, let’s extend the database to give us something to play with.

We’ll create two new tables that share common data, but store different types of facts. The first table we’ll create contains information about producers—that is, the people who oversee the music recording process. Here’s the structure and some data:

mysql> CREATE TABLE producer (

-> producer_id SMALLINT(4) NOT NULL DEFAULT 0,

-> producer_name CHAR(128) DEFAULT NULL,

-> years SMALLINT(3) DEFAULT 0,

-> PRIMARY KEY (producer_id));

Query OK, 0 rows affected (1.03 sec)

mysql> INSERT INTO producer VALUES

-> (1, "Phil Spector", 36),

-> (2, "George Martin", 40),

-> (3, "Tina Weymouth", 20),

-> (4, "Chris Frantz", 20),

-> (5, "Ed Kuepper", 15);

Query OK, 5 rows affected (0.50 sec)

Records: 5 Duplicates: 0 Warnings: 0

You can download these instructions from the book’s web site in the file producer.sql, and run them in the same way you ran the music.sql file.

You can see it’s a fairly simple table: an identifier column, a textual name, and an integer value of the number of years they’ve been producing. The second table is almost identical, but stores information about engineers—that is, the people who work the mixing desks and other equipment that’s used in the music recording process. Here’s the table and its data:

mysql> CREATE TABLE engineer (

-> engineer_id SMALLINT(4) NOT NULL DEFAULT 0,

-> engineer_name CHAR(128) DEFAULT NULL,

-> years SMALLINT(3) DEFAULT 0,

-> PRIMARY KEY (engineer_id));

Query OK, 0 rows affected (0.04 sec)

mysql> INSERT INTO engineer VALUES

-> (1, "George Martin", 40),

-> (2, "Eddie Kramer", 38),

-> (3, "Jeff Jarratt", 40),

-> (4, "Ed Stasium", 25);

Query OK, 4 rows affected (0.14 sec)

Records: 4 Duplicates: 0 Warnings: 0

You can download these instructions from the book’s web site in the file engineer.sql.

Using ANY and IN

Now that you’ve created the sample tables, you can try an example using ANY. Suppose you’re looking to find engineers who’ve been working longer than the least experienced producer. You can express this information need as follows:

mysql> SELECT engineer_name, years

-> FROM engineer WHERE years > ANY

-> (SELECT years FROM producer);

+---------------+-------+

| engineer_name | years |

+---------------+-------+

| George Martin | 40 |

| Eddie Kramer | 38 |

| Jeff Jarratt | 40 |

| Ed Stasium | 25 |

+---------------+-------+

4 rows in set (0.08 sec)

The subquery finds the years that the producers have worked:

mysql> SELECT years FROM producer;

+-------+

| years |

+-------+

| 36 |

| 40 |

| 20 |

| 20 |

| 15 |

+-------+

5 rows in set (0.00 sec)

The outer query goes through each engineer, returning the engineer if their number of years is greater than any of the values in the set returned by the subquery. So, for example, Eddie Kramer is output because 38 is greater than at least one value in the set (36, 40, 20, 15). The ANYkeyword means just that: it’s true if the column or expression preceding it is true for any of the values in the set returned by the subquery. Used in this way, ANY has the alias SOME, which was included so that some queries can be read more clearly as English expressions; it doesn’t do anything different and you’ll rarely see it used.

The ANY keyword gives you more power in expressing nested queries. Indeed, the previous query is the first nested query in this section with a column subquery—that is, the results returned by the subquery are one or more values from a column, instead of a single scalar value as in the previous section. With this, you can now compare a column value from an outer query to a set of values returned from a subquery.

Consider another example using ANY. Suppose you want to know the producers who are also engineers. You can do this with the following nested query:

mysql> SELECT producer_name FROM producer WHERE

-> producer_name = ANY

-> (SELECT engineer_name FROM engineer);

+---------------+

| producer_name |

+---------------+

| George Martin |

+---------------+

1 row in set (0.04 sec)

The = ANY causes the outer query to return a producer when the producer_name is equal to any of the engineer names returned by the subquery. The = ANY keyphrase has the alias IN, which you’ll see commonly used in nested queries. Using IN, the previous example can be rewritten:

mysql> SELECT producer_name FROM producer WHERE producer_name

-> IN (SELECT engineer_name FROM engineer);

+---------------+

| producer_name |

+---------------+

| George Martin |

+---------------+

1 row in set (0.06 sec)

Of course, for this particular example, you could also have used a join query:

mysql> SELECT producer_name FROM producer INNER JOIN engineer

-> ON (producer_name = engineer_name);

+---------------+

| producer_name |

+---------------+

| George Martin |

+---------------+

1 row in set (0.17 sec)

Again, nested queries are expressive but typically slow in MySQL, so use a join where you can.

Using ALL

Suppose you want to find engineers who are more experienced than all of the producers—that is, more experienced than the most experienced producer. You can do this with the ALL keyword in place of ANY:

mysql> SELECT engineer_name, years FROM engineer

-> WHERE years > ALL (SELECT years FROM producer);

Empty set (0.00 sec)

You can see that there are no answers: looking at the data, we see that George Martin has been a producer for 40 years, equal to or longer than the time any engineer has been engineering. While the ANY keyword returns values that satisfy at least one condition (Boolean OR), the ALL keyword returns values when all the conditions are satisfied (Boolean AND).

We can use the alias NOT IN in place of <> ANY or != ANY. Let's find all the engineers who aren't producers:

mysql> SELECT engineer_name FROM engineer WHERE

-> engineer_name NOT IN

-> (SELECT producer_name FROM producer);

+---------------+

| engineer_name |

+---------------+

| Eddie Kramer |

| Jeff Jarratt |

| Ed Stasium |

+---------------+

3 rows in set (0.25 sec)

As an exercise, try writing the above query using the ANY syntax and in at least two ways as a join query.

The ALL keyword has a few tricks and traps:

§ If it’s false for any value, it’s false. Suppose that table a contains a row with the value 14. Suppose table b contains the values 16, 1, and NULL. If you check whether the value in a is greater than ALL values in b, you’ll get false, since 14 isn’t greater than 16. It doesn’t matter that the other values are 1 and NULL.

§ If it isn’t false for any value, it isn’t true unless it’s true for all values. Suppose that table a again contains 14, and suppose b contains 1 and NULL. If you check whether the value in a is greater than ALL values in b, you’ll get UNKNOWN (neither true or false) because it can’t be determined whether NULL is greater than or less than 14.

§ If the table in the subquery is empty, the result is always true. Hence, if a contains 14 and b is empty, you’ll get true when you check if the value in a is greater than ALL values in b.

When using the ALL keyword, be very careful with tables that can have NULL values in columns; consider disallowing NULL values in such cases. Also, be careful with empty tables.

Writing row subqueries

In the previous examples, the subquery returned a single, scalar value (such as an artist_id) or a set of values from one column (such as all of the engineer_name values). This section describes another type of subquery, the row subquery that works with multiple columns from multiple rows.

Suppose you’re interested in whether an engineer has been a producer for the same length of time. To answer this need, you must match both names and years. You can easily write this as a join query:

mysql> SELECT producer_name, producer.years FROM

-> producer, engineer WHERE producer_name = engineer_name AND

-> producer.years = engineer.years;

+---------------+-------+

| producer_name | years |

+---------------+-------+

| George Martin | 40 |

+---------------+-------+

1 row in set (0.30 sec)

But you can also write it as a nested query:

mysql> SELECT producer_name, years FROM producer WHERE

-> (producer_name, years) IN

-> (SELECT engineer_name, years FROM engineer);

+---------------+-------+

| producer_name | years |

+---------------+-------+

| George Martin | 40 |

+---------------+-------+

1 row in set (0.17 sec)

You can see there’s a different syntax being used in this nested query: a list of two column names in parentheses follows the WHERE statement, and the inner query returns two columns. We’ll explain this syntax next.

The row subquery syntax allows you to compare multiple values per row. The expression (producer_name, years) means two values per row are compared to the output of the subquery. You can see following the IN keyword that the subquery returns two values, engineer_name andyears. So, the fragment:

(producer_name, years) IN (SELECT engineer_name, years FROM engineer)

matches producer names and years to engineer names and years, and returns a true value when a match is found. The result is that if a matching pair is found, the overall query outputs a result. This is a typical row subquery: it finds rows that exist in two tables.

To explain the syntax further, let’s consider another example. Suppose you want to see if you own the Brotherhood album by New Order. You can do this with the following query:

mysql> SELECT artist_name, album_name FROM artist, album WHERE

-> (artist.artist_id, artist_name, album_name) =

-> (album.artist_id, "New Order", "Brotherhood");

+-------------+-------------+

| artist_name | album_name |

+-------------+-------------+

| New Order | Brotherhood |

+-------------+-------------+

1 row in set (0.16 sec)

It’s not a nested query, but it shows you how the new row subquery syntax works. You can see that the query matches the list of columns before the equals sign, (artist.artist_id, artist_name, album_name), to the list of columns and values after the equals sign,(album.artist_id, "New Order", "Brotherhood"). So, when the artist_id values match, the artist is New Order, and the album is Brotherhood, we get output from the query. We don’t recommend writing queries like this—use a WHERE clause instead—but it does illustrate exactly what’s going on. For an exercise, try writing this query using a join.

Row subqueries require that the number, order, and type of values in the columns match. So, for example, our previous example matches a SMALLINT to a SMALLINT, and two character strings to two character strings.

The EXISTS and NOT EXISTS Clauses

You’ve now seen three types of subquery: scalar subqueries, column subqueries, and row subqueries. In this section, you’ll learn about a fourth type, the correlated subquery, where a table used in the outer query is referenced in the subquery. Correlated subqueries are often used with the INstatement we’ve already discussed, and almost always used with the EXISTS and NOT EXISTS clauses that are the focus of this section.

EXISTS and NOT EXISTS basics

Before we start on our discussion of correlated subqueries, let’s investigate what the EXISTS clause does. We’ll need a simple but strange example to introduce the clause, since we’re not discussing correlated subqueries just yet. So, here goes: suppose you want to find a list of all artists in the database, but only if the database is active (which you’ve defined to mean only if at least one track from any album by any artist has been played). Here’s the query that does it:

mysql> SELECT * FROM artist WHERE EXISTS

-> (SELECT * FROM played);

+-----------+---------------------------+

| artist_id | artist_name |

+-----------+---------------------------+

| 1 | New Order |

| 2 | Nick Cave & The Bad Seeds |

| 3 | Miles Davis |

| 4 | The Rolling Stones |

| 5 | The Stone Roses |

| 6 | Kylie Minogue |

+-----------+---------------------------+

6 rows in set (0.18 sec)

The subquery returns all rows from the played table. However, what’s important is that it returns at least one row; it doesn’t matter what’s in the row, how many rows there are, or whether the row contains only NULL values. So, you can think of the subquery as being true or false, and in this case it’s true because it produces some output. When the subquery is true, the outer query that uses the EXISTS clause returns a row. The overall result is that all rows in the artist table are displayed because, for each one, the subquery is true.

Let’s try a query where the subquery isn’t true. Again, let’s contrive a query: this time, we’ll output the names of all albums in the database, but only if we own at least one album by John Coltrane. Here’s the query:

mysql> SELECT album_name FROM album WHERE EXISTS

-> (SELECT * FROM artist WHERE artist_name = "John Coltrane");

Empty set (0.10 sec)

Since the subquery isn’t true—no rows are returned because John Coltrane isn’t in our database—no results are returned by the outer query.

The NOT EXISTS clause does the opposite. Imagine you want a list of all producers if you don’t have an artist called New Order in the database. Here it is:

mysql> SELECT * FROM producer WHERE NOT EXISTS

-> (SELECT * FROM artist WHERE artist_name = "New Order");

Empty set (0.16 sec)

This time, the inner query is true but the NOT EXISTS clause negates it to give false. Since it’s false, the outer query doesn’t produce results.

You’ll notice that the subquery begins with SELECT * FROM artist. It doesn’t actually matter what you select in an inner query when you’re using the EXISTS clause, since it’s not used by the outer query anyway. You can select one column, everything, or even a constant (as in SELECT "cat" from artist), and it’ll have the same effect. Traditionally, though, you’ll see most SQL authors write SELECT * by convention.

Correlated subqueries

So far, it’s difficult to imagine what you’d do with the EXISTS or NOT EXISTS clauses. This section shows you how they’re really used, illustrating the most advanced type of nested query that you’ll typically see in action.

Let’s think about a realistic information need you might want to answer from the music database. Suppose you want a list of all artists who’ve produced a self-titled album. You can do this easily with a join query, which we recommend you try to think about before you continue. You can also do it with the following nested query that uses a correlated subquery:

mysql> SELECT artist_name FROM artist WHERE EXISTS

-> (SELECT * FROM album WHERE album_name = artist_name);

Empty set (0.28 sec)

There’s no output because there are no self-titled albums. Let’s add an artist with a self-titled album and try again:

mysql> INSERT INTO artist VALUES (7, "The Beatles");

Query OK, 1 row affected (0.13 sec)

mysql> INSERT INTO album VALUES (7, 1, "The Beatles");

Query OK, 1 row affected (0.14 sec)

Now the query:

mysql> SELECT artist_name FROM artist WHERE EXISTS

-> (SELECT * FROM album WHERE album_name = artist_name);

+-------------+

| artist_name |

+-------------+

| The Beatles |

+-------------+

1 row in set (0.17 sec)

So, the query works; now, we just need to understand how!

Let’s examine the subquery in our previous example. You can see that it lists only the album table in the FROM clause, but it uses a column from the artist table in the WHERE clause. If you run it in isolation, you’ll see this isn’t allowed:

mysql> SELECT * FROM album WHERE album_name = artist_name;

ERROR 1054 (42S22): Unknown column 'artist_name' in 'where clause'

However, it’s legal when executed as a subquery because tables listed in the outer query are allowed to be accessed in the subquery. So, in this example, the current value of artist_name in the outer query is supplied to the subquery as a constant, scalar value and compared to the album name. If the album name matches the artist name, the subquery is true, and so the outer query outputs a row. Consider two cases that illustrate this more clearly:

§ When the artist_name being processed by the outer query is New Order, the subquery is false because SELECT * FROM album WHERE album_name = "New Order" doesn’t return any rows, and so the artist row for New Order isn’t output as an answer.

§ When the artist_name being processed by the outer query is The Beatles, the subquery is true because SELECT * FROM album WHERE album_name = "The Beatles" returns at least one row. Overall, the artist row for The Beatles is output as an answer.

Can you see the power of correlated subqueries? You can use values from the outer query in the inner query to evaluate complex information needs.

We’ll now explore another example using EXISTS. Let’s try to find all artists from whom we own at least two albums. To do this with EXISTS, we need to think through what the inner and outer queries should do. The inner query should produce a result only when the condition we’re checking is true; in this case, it should produce output when the artist has at least two albums in the database. The outer query should produce the artist name whenever the inner query is true. Here’s the query:

mysql> SELECT artist_name FROM artist WHERE EXISTS

-> (SELECT * FROM album WHERE artist.artist_id = album.artist_id

-> GROUP BY artist.artist_id HAVING COUNT(*) >= 2);

+-------------+

| artist_name |

+-------------+

| New Order |

| Miles Davis |

+-------------+

2 rows in set (0.12 sec)

This is yet another query where nesting isn’t necessary and a join would suffice, but let’s stick with this version for the purpose of explanation. Have a look at the inner query: you can see that the WHERE clause ensures only album rows for the artist being referenced by the outer query—the current artist—are considered by the subquery. The GROUP BY clause clusters the rows for that artist, but only if there are at least two albums. Therefore, the inner query only produces output when there are at least two albums for the current artist. The outer query is straightforward: it outputs an artist’s name when the subquery produces output.

Here’s one more example before we move on and discuss other issues. We’ve already shown you a query that uses IN and finds producers who are also engineers:

mysql> SELECT producer_name FROM producer WHERE producer_name

-> IN (SELECT engineer_name FROM engineer);

+---------------+

| producer_name |

+---------------+

| George Martin |

+---------------+

1 row in set (0.06 sec)

Let’s rewrite the query to use EXISTS. First, think about the subquery: it should produce output when there’s an engineer with the same name as a producer.

Second, think about the outer query: it should return the producer’s name when the inner query produces output. Here’s the rewritten query:

mysql> SELECT producer_name FROM producer WHERE EXISTS

-> (SELECT * FROM engineer WHERE producer_name = engineer_name);

+---------------+

| producer_name |

+---------------+

| George Martin |

+---------------+

1 row in set (0.06 sec)

Again, you can see that the subquery references the producer_name column, which comes from the outer query.

Correlated subqueries can be used with any nested query type. Here’s the previous IN query rewritten with an outer reference:

mysql> SELECT producer_name FROM producer WHERE producer_name

-> IN (SELECT engineer_name FROM engineer WHERE

-> engineer_name = producer_name);

+---------------+

| producer_name |

+---------------+

| George Martin |

+---------------+

1 row in set (0.14 sec)

The query is more convoluted than it needs to be, but it illustrates the idea. You can see that the producer_name in the subquery references the producer table from the outer query. This query can also be rewritten to use an equals instead of IN:

mysql> SELECT producer_name FROM producer WHERE producer_name

-> = (SELECT engineer_name FROM engineer WHERE

-> engineer_name = producer_name);

+---------------+

| producer_name |

+---------------+

| George Martin |

+---------------+

1 row in set (0.01 sec)

This works because the subquery returns one scalar value—there’s only one engineer and producer with each name—and so the column subquery operator IN isn’t necessary. Of course, if names are duplicated, you’d need to use IN, ANY, or ALL instead.

Nested Queries in the FROM Clause

The techniques we’ve shown all use nested queries in the WHERE clause. This section shows you how they can alternatively be used in the FROM clause. This is useful when you want to manipulate the source of the data you’re using in a query.

The producer and engineer tables store the number of years that a person has been producing and engineering, respectively. If you want that value in months, there are several ways you can obtain it. One way—which we’ll show you in Chapter 8—is to use a date and time function to do the conversion. Another way is to do some math in the query; one option in this class is to do it with a subquery:

mysql> SELECT producer_name, months FROM

-> (SELECT producer_name, years*12 AS months FROM producer) AS prod;

+---------------+--------+

| producer_name | months |

+---------------+--------+

| Phil Spector | 432 |

| George Martin | 480 |

| Tina Weymouth | 240 |

| Chris Frantz | 240 |

| Ed Kuepper | 180 |

+---------------+--------+

5 rows in set (0.05 sec)

Focus on what follows the FROM clause: the subquery uses the producer table and returns two columns. The first column is the producer_name; the second column is aliased as months, and is the years value multiplied by 12. The outer query is straightforward: it just returns theproducer_name and the month value created through the subquery. Note that we’ve added the table alias as prod for the subquery. When we use a subquery as a table, that is, we use a SELECT FROM operation on it—this “derived table” must have an alias—even if we don’t use the alias in our query. MySQL complains if we omit the alias:

mysql> SELECT producer_name, months FROM

-> (SELECT producer_name, years*12 AS months FROM producer);

ERROR 1248 (42000): Every derived table must have its own alias

Here’s another example, where we’ll find out the average number of albums that we own by each artist. Let’s begin by thinking through the subquery. It should return the number of albums that we own by each artist. Then, the outer query should average the values to give the answer. Here’s the query:

mysql> SELECT AVG(albums) FROM

-> (SELECT COUNT(*) AS albums FROM artist INNER JOIN album

-> USING (artist_id) GROUP BY artist.artist_id) AS alb;

+-------------+

| AVG(albums) |

+-------------+

| 2.0000 |

+-------------+

1 row in set (0.00 sec)

You can see that the inner query joins together artist and album, and groups the albums together by artist so you can get a count for each artist. If you run it in isolation, here’s what happens:

mysql> SELECT COUNT(*) AS albums FROM artist INNER JOIN album

-> USING (artist_id) GROUP BY artist.artist_id;

+--------+

| albums |

+--------+

| 7 |

| 1 |

| 2 |

| 1 |

| 1 |

| 1 |

| 1 |

+--------+

7 rows in set (0.00 sec)

Now, the outer query takes these counts—which are aliased as albums—and averages them to give the final result. This query is the typical way that you apply two aggregate functions to one set of data. You can’t apply aggregate functions in cascade, as in AVG(COUNT(*)); it won’t work:

mysql> SELECT AVG(COUNT(*)) FROM album INNER JOIN artist

-> USING (artist_id) GROUP BY artist.artist_id;

ERROR 1111 (HY000): Invalid use of group function

With subqueries in FROM clauses, you can return a scalar value, a set of column values, more than one row, or even a whole table. However, you can’t use correlated subqueries, meaning that you can’t reference tables or columns from tables that aren’t explicitly listed in the subquery. Note also that you must alias the whole subquery using the AS keyword and give it a name, even if you don’t use that name anywhere in the query.

User Variables

Often you’ll want to save values that are returned from queries. You might want to do this so that you can easily use a value in a later query. You might also simply want to save a result for later display. In both cases, user variables solve the problem: they allow you to store a result and use it later.

Let’s illustrate user variables with a simple example. The following query finds the name of an artist and saves the result in a user variable:

mysql> SELECT @artist:=artist_name FROM artist WHERE artist_id = 1;

+----------------------+

| @artist:=artist_name |

+----------------------+

| New Order |

+----------------------+

1 row in set (0.05 sec)

The user variable is named artist, and it’s denoted as a user variable by the @ character that precedes it. The value is assigned using the := operator. You can print out the contents of the user variable with the following very short query:

mysql> SELECT @artist;

+-----------+

| @artist |

+-----------+

| New Order |

+-----------+

1 row in set (0.00 sec)

You can explicitly set a variable using the SET statement without a SELECT. Suppose you want to initialize a counter to 0:

mysql> SET @counter := 0;

Query OK, 0 rows affected (0.11 sec)

You should separate several assignments with a comma, or put each in a statement of its own:

mysql> SET @counter := 0, @age:=23;

Query OK, 0 rows affected (0.00 sec)

The most common use of user variables is to save a result and use it later. You’ll recall the following example from earlier in the chapter, which we used to motivate nested queries (which are certainly a better solution for this problem). We want to find the name of the track that was played most recently:

mysql> SELECT MAX(played) FROM played;

+---------------------+

| max(played) |

+---------------------+

| 2006-08-15 14:33:57 |

+---------------------+

1 row in set (0.00 sec)

mysql> SELECT track_name FROM track INNER JOIN played

-> USING (artist_id, album_id, track_id)

-> WHERE played = "2006-08-15 14:33:57";

+------------+

| track_name |

+------------+

| New Blues |

+------------+

1 row in set (0.31 sec)

You can use a user variable to save the result for input into the following query. Here’s the same query pair rewritten using this approach:

mysql> SELECT @recent := MAX(played) FROM played;

+-------------------------+

| @recent := MAX(played) |

+-------------------------+

| 2006-08-15 14:33:57 |

+-------------------------+

1 row in set (0.00 sec)

mysql> SELECT track_name FROM track INNER JOIN played

-> USING (artist_id, album_id, track_id)

-> WHERE played = @recent;

+------------+

| track_name |

+------------+

| New Blues |

+------------+

1 row in set (0.44 sec)

This can save you cutting and pasting, and it certainly helps you avoid typing errors.

Here are some guidelines on using user variables:

§ User variables are unique to a connection: variables that you create can’t be seen by anyone else, and two different connections can have two different variables with the same name.

§ The variable names can be alphanumeric strings and can also include the period (.), underscore (_), and dollar ($) characters.

§ Variable names are case-sensitive in MySQL versions earlier than version 5, and case-insensitive from version 5 onward.

§ Any variable that isn’t initialized has the value NULL; you can also manually set a variable to be NULL.

§ Variables are destroyed when a connection closes.

§ You should avoid trying to both assign a value to a variable and use the variable as part of a SELECT query. Two reasons for this are that the new value may not be available for use immediately in the same statement, and a variable’s type is set when it’s first assigned in a query; trying to use it later as a different type in the same SQL statement can lead to unexpected results.

Let’s look at the first issue in more detail using the new variable @aid. Since we haven’t used this variable before, it’s empty. Now, let’s show the artist_id for artists who have an entry in the album table. Instead of showing it directly, we’ll assign the artist_id to the @aidvariable. Our query will show the variable twice: once before the assignment operation, once as part of the assignment operation, and once afterwards:

mysql> SELECT @aid, @aid:=artist.artist_id, @aid FROM artist,album

-> WHERE album.artist_id=@aid;

Empty set (0.00 sec)

This returns nothing; since there’s nothing in the variable to start with, the WHERE clause tries to look for empty artist_id values. If we modify the query to use artist.artist_id as part of the WHERE clause, things work as expected:

mysql> SELECT @aid, @aid:=artist.artist_id, @aid FROM artist,album

-> WHERE album.artist_id=artist.artist_id;

+------+------------------------+------+

| @aid | @aid:=artist.artist_id | @aid |

+------+------------------------+------+

| | 1 | 1 |

| 1 | 1 | 1 |

| 1 | 1 | 1 |

| 1 | 1 | 1 |

| 1 | 1 | 1 |

| 1 | 1 | 1 |

| 1 | 1 | 1 |

| 1 | 2 | 2 |

| 2 | 3 | 3 |

| 3 | 3 | 3 |

| 3 | 4 | 4 |

| 4 | 5 | 5 |

| 5 | 6 | 6 |

+------+------------------------+------+

13 rows in set (0.01 sec)

Now that @aid isn’t empty, the initial query will produce some results:

mysql> SELECT @aid, @aid:=artist.artist_id, @aid FROM artist,album

-> WHERE album.artist_id=@aid;

+------+------------------------+------+

| @aid | @aid:=artist.artist_id | @aid |

+------+------------------------+------+

| 6 | 1 | 1 |

| 1 | 1 | 1 |

| 1 | 2 | 2 |

+------+------------------------+------+

3 rows in set (0.01 sec)

It’s best to avoid such circumstances where the behavior is not guaranteed and is hence unpredictable.

Transactions and Locking

When a database is concurrently accessed by several users, you have to consider how you may be affected if other users change the data that you’re accessing, and how changes you make may affect other users. For example, you might get the wrong value for the total sales so far this year if new sales are being added to the database while you’re adding up the sales figures.

Locks can be applied to prevent concurrent users from interacting destructively with one other’s data. A read lock allows you to prevent other users from changing data while you’re reading and processing the data, while a write lock tells other users that the data is being changed and that they should not read or modify it. For example, you need locks to avoid problems with reports when one user is trying to produce a report while another user changes the data the report is derived from.

In some cases, you want all or none of a series of operations to succeed. For example, if you want to travel from Melbourne to Seattle via Los Angeles, you need to have a seat on the flight from Melbourne to Los Angeles, and a seat on the connecting flight from Los Angeles to Seattle. Having a confirmed seat on just one leg of the route is no use to you; you can’t fly without confirmed seats on both legs.

Transactions allow you to batch together SQL statements as an indivisible set that either succeeds or has no effect on the database. This means you can start a transaction and then issue a series of SQL statements. At the conclusion, you have the option of committing (saving) the transaction to the database or rolling back (canceling) the transaction.

By default, MySQL operates in AUTOCOMMIT mode, where each update is treated as an atomic transaction of its own, and changes are automatically committed. If this mode is disabled, or a transaction is explicitly started, changes aren’t commited to the database unless you execute a COMMITor ROLLBACK instruction.

Locking and transaction support is complex, and you need to make choices about the degree of isolation needed between users and the trade-offs involved in implementing them for your application. This is a difficult and advanced topic that’s mostly outside the scope of this book, but in the next section we discuss how the main table types supported by MySQL allow locking and transactions. We also include a simple transaction example in Transaction examples,” and we describe how simple locking can be used—and avoided—for our PHP wedding-registry application inSelecting and Deselecting Gifts” in Chapter 15.

Table Types

In the book so far, we’ve used only the default MyISAM table type. There’s a good reason behind this: you very rarely need to make any other choice in small- to medium-size applications because it’s a very fast, reliable table type for most tasks. However, at the time of writing, there are at least nine other choices you can make. This section gives you an insight into these choices, explaining briefly the pros and cons of the alternatives to MyISAM.

You can divide the MySQL table types up into two sets using a few different criteria. The most common division is transaction-safe (TST) versus non-transaction-safe (NTST):

transaction-safe tables (TSTs)

These include the InnoDB and the (no longer supported) Berkeley Database (BDB) table types. TSTs support transactions and have advanced features that allow you safely restore and recover from database failures.

Non-transaction-safe tables (NTSTs)

These include the MyISAM, Merge, and Memory (also called Heap) types described in this section. They’re less advanced than the TSTs, but that isn’t always bad. They’re typically much faster to query because there’s less overhead, and they use much less disk and memory space. They’re also much easier to understand.

We’ve avoided TSTs in this book, because you’re unlikely to need to configure, set parameters for, and use such tables for most applications.

Another consideration when choosing a table type is whether it supports foreign key constraints. With foreign-key support, you can tell MySQL that a row in a table shouldn’t exist without another matching row in another table. For example, you could use it to stop you from adding a new album for an artist who doesn’t exist. We don’t use foreign-key constraints, and instead rely on the application to do the checking, not the database. Doing the checking in the database slows everything down because MySQL needs to verify the foreign-key constraints before it modifies anything. It also prevents you from ignoring the rules for good reasons—such as improved performance—when you want to. Currently, only the InnoDB table type supports foreign-key constraints, although support is planned for MyISAM. If you’re not using the InnoDB table type and specify foreign-key constraints for a field, MySQL won’t complain, but won’t actually do anything, either. We won’t discuss foreign key constraints in further detail.

You can use the SHOW TABLE STATUS command to display technical information about how your tables are stored:

For example, if we need a transaction-safe table on this server, we can use the InnoDB table type.

If you decide you want to use a different table type, there are two ways to exercise your choice. One way to do it is in the CREATE TABLE statement. For example, you can create a new Memory table mytable as follows:

mysql> CREATE TABLE mytable (field INT(2)) type=Memory;

Query OK, 0 rows affected, 1 warning (0.08 sec)

Alternatively, you can use ALTER TABLE to adjust the type after it’s created. For example, you could convert the artist table to the InnoDB type:

mysql> ALTER TABLE artist type = InnoDB;

In both examples, you can substitute the alias ENGINE for TYPE. Of course, much like every other ALTER TABLE statement, the overhead of changing your choice can be high for large tables.

Note that there are several, rarely used table types we don’t discuss at all in this book. These include Merge (which is a variant of MyISAM used in large distributed installations), Example (a nonfunctioning type used to illustrate ideas for programmers), NDB Cluster (a high-performance type used to partition tables across many computers), Archive (a high-performance, index-free table type used for very large data collections), CSV (a table type for working with data stored as comma-separated values in text files), and Federated (a very new engine—added in MySQL 5.0.3—that’s used to store data in remote databases). You can find out more about these under “Storage Engines and Table Types” in the MySQL manual.

MyISAM

Before we discuss the alternatives, let’s focus on the default MyISAM type. It’s an all-around performer that’s designed for typical applications; it supports very fast querying and has very low overhead for changes to data. It’s also very flexible: underneath; it adapts how it stores data, depending on the structure of the tables you ask it to create. You’ll recall from Chapter 6 that we encouraged you to consider using fixed-length column types in preference to variable-length types. It was with MyISAM in mind that we made the recommendation: when you use fixed-length fields, MySQL adapts its disk-storage structures for fast data access and modification; it’s also easier to recover data from a corrupted table file if it uses fixed-length fields.

One of the key features of MyISAM is its unique way of locking tables. In brief, MyISAM locks are whole-table locks. This means that when you decide to lock a table, other users can have no access to the table at all. While this seems heavy-handed, it works fine for most typical applications, and management of the locks in this way costs very little memory and computational overhead. We’ll contrast this with other locking schemes later when we describe InnoDB and BDB tables.

Unless you can see a good reason, stick with MyISAM while you’re learning MySQL.

Memory or Heap

Prior to MySQL 4.1, the Memory table type was known as the Heap table type. Both keywords are supported, but the MySQL designers now prefer the term Memory. We’ll use the new term here, but they’re interchangeable.

The Memory table type is useful when you want to force data to be in main memory and not on disk. You do this when you want very fast access to a typically small set of data. It’s ideal, for example, for storing and finding country names, lists of states or cities, or salutations. Don’t use it for large files, as you need main memory for other tasks, such as SQL query evaluation and whatever other tasks your computer performs. Choose it when speed is essential for small tables with data that doesn’t change.

There are serious disadvantages to the Memory type that can make it annoying. The most serious is that when you stop and restart a MySQL server, the data stored in a Memory table is lost. This means you need to restore it each time you start the MySQL server, which you might do by using the SOURCE statement or by using the init-file option to cause it to load a file on startup; the former is discussed in Running the Monitor in Batch Mode” in Chapter 3 and the latter in Resetting Forgotten MySQL Passwords” in Chapter 9. This is also a good reason to ensure the data doesn’t change: use it for tasks where you have a fixed set of choices, not for tasks where you’re dynamically updating those choices. Remember that if your MySQL server goes down, you’ll lose any changes you’ve made if you’ve haven’t explicitly dumped them to a disk file.

The Memory type has one significant advantage: it’s an extremely fast environment for searching for exact matches (for example, checking if a country entered by a user matches a list of valid countries). Its list of disadvantages and limitations is longer:

§ As discussed, data is lost when the server stops. You need to reload it each time the server starts.

§ It doesn’t support TEXT or BLOB type columns, or any of their variants.

§ Prior to MySQL 4.0.2, it doesn’t support indexes on columns that contain NULL values.

§ Prior to MySQL 4.1.0, the AUTO_INCREMENT feature isn’t supported.

§ The tables are stored exclusively in memory. While this is what makes them fast, it’s a disadvantage if there are many memory-based tables, if the memory-based tables are large, or if the server needs the memory for other tasks.

InnoDB

The InnoDB type is the heavyweight, reliable, high-performance choice for large-scale, highly reliable applications. It’s similar to MyISAM but includes extra features that make it transaction-safe, reliable, and flexible for high-end applications. Choose it if you’re building an application that needs features MyISAM doesn’t have. In this book, we don’t discuss those features in detail, so it’s unlikely you’ll need to use it while you’re learning MySQL. However, note that with MySQL 4.1.5 and later, InnoDB is the default table type in Windows when you download and install a binary package from the MySQL AB web site; in practice, this has no impact on you, so don’t be too concerned whether the default is MyISAM or InnoDB while using this book.

The InnoDB table type includes the following features:

Support for transactions

This is discussed at the start of this section.

Advanced crash recovery features

The InnoDB table type uses logs, which are files that contain the actions that MySQL has taken to change the database. With the combination of a log and the database, MySQL can recover effectively from power losses, crashes, and other basic database failures. Of course, nothing can help you recover from loss of a machine, failure of a disk drive, or other catastrophic failures. For these, you need offsite backups and new hardware.

Row-level locking

We’ve explained how MyISAM locks at the table level, and the advantages and disadvantages of this. InnoDB locks at the row level, meaning that only the rows of data that are affected are unavailable to other users, which promotes better concurrency (sharing) of resources in certain circumstances. For applications that write more data than they read, or for applications that change large amounts of data when they do, InnoDB may be a better choice than MyISAM.

Foreign-key support

InnoDB is currently the only MySQL table type that supports foreign keys.

Fast, flexible indexing

The InnoDB type chooses the right data structure for the task when you create an index. It can switch from the fast, exact-match hash index to the fast, all-around B-tree index as the need arises, giving you fast searching for most applications without you having to explicitly set the index type.

The InnoDB type has the following limitations:

More features means more to understand

You need to know about transactions, foreign keys, data versioning, and other features to use it effectively.

It’s difficult to set up

It has tens of startup parameters and options, and to use it effectively, you need to understand and tune these. If you’re planning on using it, you need to know its details because that’s why you’ve chosen it over MyISAM. Tuning InnoDB requires a book of its own!

It’s disk-hungry

To support its transaction-safe and robust behavior, InnoDB needs extra disk space. MyISAM is much more compact because it doesn’t have these features.

Locking overheads

Row locking is more complex than table locking, and so it’s slower and takes more memory.

Transaction examples

Because transactions are the key feature that make InnoDB different from MyISAM, we’ll conclude this section with an introductory example that shows how they work.

Suppose you decide you want to add a new artist and album to the database. You want to ensure that either both actions succeed or neither do, and you want to carry out the process in complete isolation from other users; you don’t want to insert tracks for a peculiar artist ID if the artist_idvalues is already taken for another artist, or other people using your data until it’s finalized. To do this with a transaction, we need to first, change the table type for artist and album to InnoDB:

mysql> ALTER TABLE artist type = InnoDB;

Query OK, 7 rows affected, 1 warning (0.54 sec)

Records: 7 Duplicates: 0 Warnings: 0

mysql> ALTER TABLE album type = InnoDB;

Query OK, 14 rows affected, 1 warning (0.01 sec)

Records: 14 Duplicates: 0 Warnings: 0

With the InnoDB tables, we can now perform the following transaction:

mysql> START TRANSACTION;

Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO artist VALUES (8, "The Cure");

Query OK, 1 row affected (0.04 sec)

mysql> INSERT INTO album VALUES (8, 1, "Disintegration");

Query OK, 1 row affected (0.00 sec)

mysql> COMMIT;

Query OK, 0 rows affected (0.00 sec)

The first statement, START TRANSACTION, tells MySQL that you’re beginning a set of statements you want in isolation and to be treated as a block or atomic entity. You then execute the two statements that modify the database. At the conclusion, you tell MySQL to COMMIT—that is, end the transaction and make the changes to the database.

Transactions also allow you to abort or rollback—that is, undo everything that’s in the transaction. Let’s try an example where we do just that:

mysql> START TRANSACTION;

Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO artist VALUES (9, "The Wh");

Query OK, 1 row affected (0.01 sec)

mysql> ROLLBACK;

Query OK, 0 rows affected (0.04 sec)

mysql> SELECT * FROM artist;

+-----------+---------------------------+

| artist_id | artist_name |

+-----------+---------------------------+

| 1 | New Order |

| 2 | Nick Cave & The Bad Seeds |

| 3 | Miles Davis |

| 4 | The Rolling Stones |

| 5 | The Stone Roses |

| 6 | Kylie Minogue |

| 7 | The Beatles |

| 8 | The Cure |

+-----------+---------------------------+

8 rows in set (0.06 sec)

You can see in the second statement that we’ve misspelled the name of the band The Who, so we’ve decided to ROLLBACK the transaction. You can see that the rollback was successful, since the SELECT statement shows the artist wasn’t added.

BDB

The Berkeley Database (BDB) table type can survive the same types of database crashes as the InnoDB table type, and also has the COMMIT and ROLLBACK functionality we showed you in the previous section. The Berkeley DB software itself is developed by Sleepycat Software (http://www.sleepycat.com) and is modified to work with MySQL. While Berkeley DB is very stable and is used by over 200 million installations in a very wide range of products, the interface between MySQL and Berkeley DB is still under development—and so the MySQL BDB table type is not yet widely used. Inbuilt support for this table type was dropped from MySQL version 5.1.12 onward. If you really need to use this table type, you can add support for it as a plug-in storage table, but that’s outside the scope of this book.

This table type includes the following features:

Transaction support

See our earlier description of transactions in Transactions and Locking.”

Advanced crash-recovery features

See our earlier description in InnoDB” for a discussion of logging and recovery.

Page-level locking

We’ve explained how MyISAM and InnoDB lock at the table and row levels, respectively. The BDB locking philosophy lies somewhere between the two, locking typically a block of rows that reside in a physical disk-drive block.

Fast primary-key indexing

The primary key index is stored with the data, and MySQL can avoid accessing the data itself if you require only columns that are part of the primary key. However, this also means that you must have a primary-key index. MySQL will automatically create a hidden five-byte primary key if you don’t specify one. It’s also slower to scan all rows in a table if required for a query.

The BDB type has the following limitations:

Gamma-quality interface and limited support

The Berkeley DB engine never became a fully integrated and supported part of MySQL, and it could not be used on some non-Intel architectures. As mentioned earlier, inbuilt support for the BDB engine was officially dropped from MySQL version 5.1.12.

It’s disk-hungry, like InnoDB

With the features that make it transaction safe and robust to recover comes the cost of extra disk space for storing the information that’s needed. MyISAM is much more compact because it doesn’t have those features.

It’s difficult to set up

You generally need a compiler to generate the required program files from source code. There are plenty of startup parameters and options, and you need to understand and tune these to make effective use of the BDB engine. This is a good enough reason not to use it, unless you really know what you’re doing and why you want it.

Exercises

Selected exercises in this section concern the music database. You’ll find that the description of table structures in The Music Database” in Chapter 4 is a useful reference:

1. Write a join query that displays the name of each artist and the albums they’ve made. Alongside the album, display the number of tracks on the album.

2. Repeat Question 1, but now display only those albums that have more than 10 tracks.

3. Repeat Question 1, but write it as a nested query.

4. What are the four types of nested queries? For each type, write a sample query on the music database. Try to use different keywords in each query, selecting from ANY, ALL, EXISTS (or NOT EXISTS), and IN (or NOT IN).

5. What is the difference between an INNER JOIN, a LEFT JOIN, and a RIGHT JOIN? Does the order of tables matter in an INNER JOIN?