Description

I would like to see an addition to the django ORM - many people has asked for aggregation. Do you think Django should have aggregation support?

I would like to add it in a similar fashion that ValuesQuerySet is implemented (a subclass to QuerySet with minor changes to how query is constructed and results are returned) - if we add a AgreggateQuerySet that would:

There are a few problems I haven't thought through and would like comments:

how to deal with grouping by foreign key? the best seems to include the whole object in the result, not just the ID. (isn't it too magical?)

how to specify ordering?

how to best present the results? originally I thought that fields in group_by will be on the same level as min or max, but Tim Chase pointed out that it could clash if the model has a field named min .

I will start working on this as soon as I am sure about the interface. Any comments and help would be greatly appreciated.

"how to deal with grouping by foreign key? the best seems to include the whole object in the result, not just the ID. (isn't it too magical?)"
Surely, aren't they lazy after all? So they will be not queried until you access them?

"how to best present the results? originally I thought that fields in group_by will be on the same level as min or max, but Tim Chase pointed out that it could clash if the model has a field named min ."
I see no collision in current result.

"how to deal with grouping by foreign key? the best seems to include the whole object in the result, not just the ID. (isn't it too magical?)"
Surely, aren't they lazy after all? So they will be not queried until you access them?

either it will be the whole object, or just an id, having a lazy object doesn't seem to have any effect here

either it will be the whole object, or just an id, having a lazy object doesn't seem to have any effect here

Returning just an ID is more harmful than returning object. When you return just the ID the people who needs the object is facing frustrating problem, fetching those objects externally. I suggest we use object, and those who need ID just uses the object.id to get it (notice: in django one shouldn't change the id fieldname, I've learnt, same way they shouldn't change the primarykey type... even if it is possible, there is stuff that simply does not work after changing it).

nice, but what field should be first in your example? name, SUM( pay ) or AVG( pay ) ??

name is first, then comes sub(pay) with descending in place and finally avg(pay), same way as in normal order_by, the leftmost has biggest priority.

For the record - I'm taking ownership of this ticket, although work probably won't commence until after the queryset refactor and newforms-admin land in trunk. There are still some design issues to sort out, and I want to make sure we have everyone's attention.

I added a patch that provides the aggregate functionality as proposed in ​http://groups.google.com/group/django-developers/browse_thread/thread/3d4039161f6d2f5e/7485e59f449f9bf4?hl=en&
No anotate function yet. This is not a final implementation, just some of the functionality. The aggregate modifier already follows joins and group by functionality can be specified by using values on the QuerySet before calling aggregate(). Aggregate is a terminal modifier.
To-Do: test aggregate(), annotate(), AggregationQueryset? (like ValuesQuerySet, so it doesn't need to be called as the last modifier), un-aliased syntax, custom

Finnally got some time to work on this. I added a new update to the patch. It is still very rough but implements both annotate and aggregate functionalities. Currently the modifiers take both *agrs and kwargs and generate the standard alias in the case of *args. When applying the modifiers to a ValuesQuerySet the field used for GROUP BY are restricted to the fields specified in 'values'. Annotate returns a new (Values)QuerySet and aggregate a dict. Currently there is a 'bug' in which I'd like to get feedback: when aggregating on both a field of the model and a foreign field the result of the aggregation on the 'local' field is 'wrong' (not the same as if aggregating only in local fields or only in foreign fields) this is because we are doing the aggregation in columns that are repeated in the SQL (due to the nature of joins). I am not sure if this should be prohibited (doing both foreign and local aggregates) because, though it provides some functionality, generates results that are unexpected and might lead to confusion. I am not going to touch on this for now as I think it needs discusion. I will continue to work in making the HAVING clause work when filtering on aggregated fields.

For the benefit of those that haven't heard - ORM aggregation support has been accepted as a 2008 Google Summer of Code project. Nicolas Lara will be doing the heavy lifting, I (Russell Keith-Magee) will be mentoring.

Status update: this ticket is now pretty much guaranteed to be on the "must have" list for v1.1.

I've been working on this patch in my own private sandbox in preparation for committing to trunk. In the interests of public disclosure of my progress, I've just pushed my updates onto ​github.

Big, flashing light warning: While the basic shape of the API has been finalized, the code itself isn't final. Changes are expected before this reaches trunk. In particular:

Return types for aggregate columns are still in flux

The output format for ValuesQuerySet is still under development

The internal structure of Aggregate classes is subject to change

The handling of Having clauses will undergo some major changes.

There will probably be other changes. In short - consider yourself warned. This is a development branch. It is subject to change. It will occasionally break. Backwards compatibility is in no way guaranteed until this ticket is committed to the Django trunk.

Aggregations are almost ready for trunk, so I would greatly appreciate any bug reports on the code as it currently stands. Pull a copy of the repository from ​my github repository; if you're not a git user, you can download a snapshot tarball from the github page.

I would be great if this feature supported Standard Deviation to do some simple statistics.

I've just updated the github repository with StdDev and Variance, including an optional flag to do sample stats instead of population stats. Be warned, however: SQlite doesn't provide these aggregates out of the box - you will need to install an SQLite extension module to get StdDev and Variance.