Wrong GROUP BY makes your queries fragile

This is probably well known issue for everyone having some MySQL experience or experience with any other SQL database. Still I see this problem in many production applications so it is worth to mention it, especially as it is connected to MySQL Performance. No it might not affect MySQL Performance per say but it limits our ability tune MySQL Performance as queries become fragile – changing execution plan leads to different query results.

So what I’m speaking about ?

Lets say you have query something like SELECT A,B,MAX(C) FROM TBL GROUP BY A – what would you expect from such query ? Column A is part of group by so its value is same for whole group. MAX(C) is also particular value for each group, while B is not part of GROUP BY and may well correspond to different values. Which one are you looking to get ? In fact this is where results becomes non-deterministic and fragile – any B from the group could be returned, while you might be expecting some particular one.

Many other DBMS and ANSI SQL Simply forbids such queries, MySQL is more permissive and will return you first B it runs into. This value however can be dependent on selected execution plan and change if you add some indexes. Also plan may change as database statistics changes and your query may become broken without any reason.

My Advice is to stay away from such group by statements, however as usually there are exceptions and such GROUP BY statements may be faster than alternatives. Assume for example we have non-normalized table (or join result) where two columns have 1-1 relationship. For example we have user id and login which are both unique. In this case running

SELECT id,login,max(login_time) FROM log GROUP BY id,login May be replaced by SELECT id,login,max(login_time) FROM log GROUP BY id which may be faster especially if GROUP BY is executed via sorting.

In this case even though we do not know which value from the group will be returned we do not care as it will be same for all rows.

Related

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master’s Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

I recommend people enable the sql mode to prevent these kinds of queries:

#

ONLY_FULL_GROUP_BY

Do not allow queries for which the SELECT list refers to non-aggregated columns that are not named in the GROUP BY clause. The following query is invalid with this mode enabled because address is not named in the GROUP BY clause:

SELECT name, address, MAX(age) FROM t GROUP BY name;

As of MySQL 5.0.23, this mode also restricts references to non-aggregated columns in the HAVING clause that are not named in the GROUP BY clause.

Um, in your example, you say that id and login are unique — I’m assuming that’s enforced with UNIQUE KEY, right? The internals manual states that GROUP BY uses sorting when there is no index…..so while you are correct in the general case, that GROUP BY may be faster if it does not have to sort, your example does not fit your point.

The query in your example is a bad one, but not because of the optimizer’s behavior. The query is poor (and I’m sure it’s straight from clients who need your help!) because you do not NEED to GROUP BY 2 elements if they have a 1:1 relationship. Folks need to pick one element in that case. I’m sure other DBMS’ don’t handle this case “right” either.

This can be summed up with a “Best practice” — “Do not GROUP BY more than you need to.”

see
http://dev.mysql.com/doc/internals/en/optimizer.html
under “GROUP BY”

In my example I’m saying login and id have one to one relationship in schema. But they are not unique in this example – This is “log” table which for example logs all accesses by users including their id and login.

I also do not understand you further comment – I specially show this as a case when columns can be accessed in group by even though they are not being grouped by.

Really it does not have to be one to one relationship, it is enough to have same value for particular value of the column in group by. For example if we have City and County and each City corresponds to only one County we could have something like:

SELECT county,city, avg(age) FROM people GROUP BY city;

Not correct according to the standard but should not be fragile as there is only one county per city so it does not matter which one will be picked.

isnt this sql statement wrong? if you are grouping by name, you get a subset of address and age. on age you are applying an aggregate operator (MAX), but which item of the subset is seleected from address?

if table name, address, age contains

meier, street a, 23
meier, street b, 26

the group-by subset should be meier, (street a, street b), (23, 26)

max means

meier, (???), 26

so i’d expect sorta subset instead of questionmark. which aggregate operator is applied here? and, isnt it fatal for development to select apples and max(price) while grouping by peaches?