Structured Query Language/SELECT: Grouping

In this chapter we will leave the level of individual rows. We strive to find informations and statements that refer to groups of rows - at the expense of information about individual rows. In the context of SQL such 'row-groups' (or sets of rows) are build by the GROUP BY clause and further processed by the HAVING clause.

First we must establish criteria according to which the rows are assigned to groups. To do so we use the content of one or more columns of the involved table(s). If the values are identical, the rows belong to the same group. Consider the lastname in table person. In our small example we can insinuate that persons with same lastname form a family. So if we strive for informations about families we should use this column as the grouping criterion. This grouping allows us to ask questions concerning whole families, such as 'Which families are there?', 'How many families exists?', 'How many persons are in each family?'. Please note that all of them are questions about the whole group (which means the family), not about single rows (which means the person).

In the SQL syntax the criterion is specified after the key word GROUP BY and consists of one or more columnnames.

The query retrieves seven 'family names' out of the 10 rows. There are several persons with lastname 'Goldstein' or 'de Winter'.

We can retrieve the same seven 'family names' by applying the key word DISTINCT in a SELECT without GROUP BY.

SELECTDISTINCT lastname
FROM person;
-- no GROUP BY clause

What makes the difference? The DISTINCT key word is limited to remove duplicate values. It can not initiate computations on other rows and columns of the result set. In contrast, the GROUP BY additionally arranges the intermediate received rows as a number of groups and offers the possibility to get informations about each of this groups. It is even the case that within this groups all columns are available, not only the 'criterion'-column. To confirm this statement about 'all' columns we use weight which is not the 'criterion'-column.

The result shows the seven family names - as seen before - plus the average weight of every family. The weight of individual persons is not shown. (In groups with exactly one person the average weight of the group is of course identical to the single persons weight.)

We see that in our small example database there is one family with 3 members, another with 2 members and all others consist of exactly 1 member.

What is going on behind the scene during the execution of the command?

All ten rows of table person are retrieved (in the above command there is no WHERE clause).

The rows are arranged into seven groups according to the value of column lastname.

Every group with all of its rows is passed to the SELECT clause.

The SELECT builds one resulting row for every received group (in 'real world' databases each of the groups may contain thousands of rows).

In step 4 exactly one resulting row is generated per group. Because the SELECT creates only one resulting row per group, it is not possible to show values of such columns which may differ from row to row, e.g. the firstname. The SELECT can only show such values of which it is ensured that they are identical within all rows of the group: the 'criterion'-column.

-- It is not possible to show the 'firstname' of a group! 'Firstname' is an attribute of single persons.-- Within a group 'firstname' varies from row to row.-- The DBMS should recognise this problem and should issue an error message.SELECT lastname, firstname
FROM person
GROUPBY lastname;
-- A hint to users of MySQL:-- To receice correct results (the error message) you must deactivate a special performance feature by issue the command-- set sql_mode = 'ONLY_FULL_GROUP_BY'; or set it in the workbench or in the ini-file.

Nevertheless we can get information about the non-criterion-columns. But this information is more generalised. The DBMS offers a special group of functions which builds one value out of a set of rows. Consider the avg() function, which computes the arithmetic mean of numerical values. This function receives a columnname and operates on a set of rows. If our command in question contains a GROUP BY clause, the avg() function does compute one value per group - not one value per all rows as usual. So it is possible to show the result of such functions together with the values of the 'criterion'-column.

Here is an - incomplete - list of such functions: count(), max(), min(), sum(), avg(). Not all functions are of that kind, e.g. the function concat(), which concatenates two strings, operates on single rows and creates one value per row.

You know the WHERE clause. It defines which rows of a table will be part of the result set. The HAVING clause has the same meaning at the group-level. It defines which groups will be part of the result set.

-- The HAVING complements the GROUP BYSELECT...FROM...GROUPBY<columnname>HAVING<HAVING clause>; -- specify a criterion which can be applied to groups

We retrieve exclusively families with more than 1 members:

SELECT lastname
FROM person
GROUPBY lastname -- grouping over lastnameHAVINGCOUNT(*)>1; -- more than one person within the group

All families with one member are no longer part of the result.

In a second example we focus on such groups which satifies a criterion on column firstname. Consider that firstname is not the grouping-column.

-- Groups containing a person whose firstname has more than 4 characters: 5 resulting rowsSELECT lastname
FROM person
GROUPBY lastname
HAVINGMAX(LENGTH(firstname))>4; -- max() returns ONE value (the highest one) for all rows of each 'lastname'-group

The result shows the 5 families Baker, de Winter, Goldstein, Rich and Stefanos (, but not the row(s) with the long firstname).

Please note that this result is very different from the similar question to persons whose firstname has more than 4 characters:

-- Persons whose firstname has more than 4 characters: 6 resulting rows!!SELECT lastname, firstname
FROM person
WHERELENGTH(firstname)>4;
-- no GROUP BY and no HAVING. The WHERE isn't an equivalent replacement for the HAVING!!

Where is the additional row coming from? In the family de Winter there are two persons with a firstname longer than 4 characters: James and Victor. Because in the command without GROUP BY we select for persons and not for families, both rows are displayed individually.

In summary we can say that the HAVING clause decides, which groups are part of the result set and which are not.

The GROUP BY and HAVING clauses are part of the SELECT comand and we can combine them with any other clauses of the SELECT as desired. Only the order of the clauses is obligatory.

-- This is the obligatory order of clausesSELECT...FROM...WHERE...GROUPBY...HAVING...ORDERBY...
;

As mentioned the WHERE clause works on the row-level whereas the HAVING clause works on the group-level. First the WHERE is evaluated, next the GROUP BY, next the HAVING, next the ORDER BY and at the end the SELECT. Every step is based on the results of the previous one.

Finally we offer two additional examples:

-- Are there persons born on the same day?SELECT date_of_birth -- In a later chapter you will learn how to select the name of this persons.FROM person
GROUPBY date_of_birth
HAVINGCOUNT(date_of_birth)>1-- more than one on the same day?ORDERBY date_of_birth;
-- Families with long first- and lastname. Comment out some lines to see differences to the original query.SELECT lastname,COUNT(*)AS cnt
FROM person
WHERELENGTH(firstname)>4GROUPBY lastname
HAVINGLENGTH(lastname)>4ORDERBY cnt DESC, lastname
;

There are persons, which do not perform a hobby. But the nearby formulation 'count(*) = 0' will not lead to the expected result because for such persons there are no rows in table person_contact, so the DBMS cannot create any group and hence cannot display anything.

Looking for something that does NOT exist is often more difficult than looking for the existence of something. In such cases you usually have to use one of: NOT EXISTS, NOT IN,a combination of OUTER JOIN and IS NULL, a combination of OUTER JOIN and MINUS togetherwith INNER JOIN.