Adding It Up With SQL Aggregates

So far we've learned how to answer a lot of questions using T-SQL. The next important thing to learn is how to aggregate data. Aggregating data is what we do when want to find out totals, averages or counts of various data elements. Take a look at the following example:

Here we have a set of cards with various colors. We can easily answer questions like, "How many colors are there?", "How many total cards are there?" and "How many blue cards are there?". However, if we had five hundred cards, it'd be a very different story. Here's the SQL table representation of our cards:

SELECT CardID, Color FROM CARDS

Here we have our result set that represents the table of cards shown above. Now, lets say we're asked to figure out how many unique colors there are. We're asked to put all the red cards in a pile, all the green cards in another, and the blue cards in a third pile. Now we need to aggregate, or group, our cards and our SQL statement!

Aggregation Time

The dictionary defines aggregate as, "A whole formed by combining several disparate elements." Keep this in mind, since we'll be using the term frequently going forward.

Now we have our cards laid out in three piles. It's easier to say there are only three colors now. You can also easily reach conclusions about the population of cards. Red is the most prevalent. There is only one green card. Other statements like this are easier to answer visually.

We can do the same thing with T-SQL. The SQL grouping functionality is broken into two parts:

GROUP BY Clause - The GROUP BY clause tells SQL what the criteria for our grouping will be. In the example above, we grouped by color.

Aggregate Functions - These appear in the SELECT statement field list and tell SQL how to treat each field. Any field that does not appear in the GROUP BY clause has to have an aggregate function. These tell SQL if we're interested in seeing a count, a sum or an average of each collection.

The best way to understand this is simply to see it in a query:

SELECT Color, COUNT(CardID) FROM CARDS GROUP BY Color

Now, we see something different. Although we're talking about eight cards, only three rows of data came back. Much like we created three piles of colored cards, this statement also created three rows, one for each color.

If we try to run the following SQL, we'll get an error:

SELECT Color, CardID FROM CARDS GROUP BY Color

Why? How can the SQL Server display CardID if it is trying to group all results by color? The same is true of the following:

SELECT Color, COUNT(CardID) FROM CARDS

You haven't told SQL Server what you want to sort by. How can you get a count if you're asking for each row's color to be displayed as well? As you can see SQL can only help you if you properly divide the fields being returned into two categories: those that are displayed and those that are totaled up.

Slice and Dice Your Data

Now that we've seen a very simple example, let's apply it to our college class example from the previous lesson on advanced joins. We'll start with something simple and add complexity to it. This will help you grasp how powerful creating an ad hoc aggregation query can really be!

Let's start with something very simple:

SELECT COUNT(StudentID) FROM STUDENTS

3

The result? SQL returns only a single value. Three. Since all we're asking with this query is for a count of students with no grouping whatsoever, SQL just tells you how many rows there are in the student table. It's like it just stacked up all the records and kept count. Lets change it up slightly.

SELECT COUNT(StudentID), AVG(GPA) FROM STUDENTS

3

3.93

Now the query returns two values in a single row. We still get our count of students but also their average GPA. This is allowed because we're still asking for something that requires the same level of sorting. Now, let's change it up again. We're going to introduce a new field into the query that needs to be sliced up.

SELECT LastName, COUNT(StudentID), AVG(GPA) FROM STUDENTS

Will this statement run? It won't. Now that we've included the LastName, we're asking for something that must be sliced. SQL starts to figure out there are three students and their GPA. After that, it gets confused when it sees LastName, something it hasn't been told to either group by or aggregate on.

To make this work, we need to add grouping. Lets add a GROUP BY clause to our SQL:

SELECT LastName, COUNT(StudentID), AVG(GPA)
FROM STUDENTS
GROUP BY LastName

Now, you'll notice that SQL has made two separate rows. It has stacked the two people with the last name "Doe" together and the one row with the last name "Jayne" by itself. Just to see what happens, let's add the FirstName to the SELECT and GROUP BY clause as well:

SELECT FirstName, LastName, COUNT(StudentID), AVG(GPA)
FROM STUDENTS
GROUP BY LastName, FirstName

What happened? Where are our carefully prepared stacks? Well, by adding the FirstName into the slicing of the data, we sliced as thin as we can go. There is no more data to aggregate. You should now have a firm grip on how aggregation works within the context of T-SQL.

Grouping With Joins

Using the aggregate functions with joins is exactly like using them with a single table query. Just put your JOIN statements in as usual and then add the grouping. It is usually best to start out by creating the join with no aggregate code. This helps you make sure you have everything working right before you complicate it further.

Filtering With Aggregated Data

There is still one more "gotcha" to talk about with grouping data in T-SQL. What happens when we have to filter out certain elements from our results? How do we use aggregated fields as criteria? Let's look at these two items next.

Let's go back to our cards example above. Let's say we've been asked not to include the blue cards at all. A good rule of thumb with aggregating is, get the non-aggregated query right first. Once you do that, adding the groupings won't mess anything up. We'll start by filtering out the blue cards or taking them off the table, so to speak.

SELECT *
FROM CARDS
WHERE COLOR <> 'BLUE'

That takes the blue cards out of the basic results. Now we can just do our grouping as we would before.

SELECT Color, COUNT(CardID)
FROM CARDS
WHERE COLOR <> 'BLUE'
GROUP BY COLOR

It's that simple! Trying to do it all at once could introduce errors into the equation. We're almost done! Let's say we also want to get rid of any colors that have less than 2 cards. You might think that this would work:

SELECT Color, COUNT(CardID)
FROM CARDS
WHERE COLOR <> 'BLUE' AND COUNT(CardID) > 1
GROUP BY COLOR

This makes sense to us but not to SQL server. This has to do with the way SQL puts its statements together. Essentially, it doesn't know what the count of CardID is at the time that its figuring out your WHERE clause. Instead, we use a special HAVING clause.

SELECT Color, COUNT(CardID)
FROM CARDS
WHERE COLOR <> 'BLUE'
GROUP BY COLOR
HAVING COUNT(CardID) > 1

You'll see it looks almost identical to the WHERE clause. However, now SQL understands that we want it to get the totals before dropping the slices that have less than two.

You've done it! You have now mastered one of the trickier components of T-SQL. However, up until now, we haven't changed a single piece of data! We've just been retrieving it. If you had a read-only computer, it wouldn't be very useful, would it? The next lesson will show us how to start making changes to our data.