Description

This is really a feature request, but I'm posting it here in the hopes that it gets a little backing from the community.

I had a situation where I needed to write a query that would sum up demographics information for items ordered. I needed to count the total number of orders for each type of ethnicity, grouped by the item ordered. So the results should look something like this:

Now, I could have just as well written a raw SQL query, but I like django's ORM so much it pains me to not take advantage of it as much as I can.

My idea was for a new type of aggregate class called "CountIf()", which would act just like count, except that you could also pass filter conditions. The resulting django query might look something like this:

... which would then generate the SQL clause similar to above. Unfortunately, the task of converting the condition itself into SQL was a bit too daunting, so I took a small workaround that only required minimal SQL:

If we could figure out a way to convert django-esque filter conditions into the SQL necessary for the condition in the "sql_template" of the aggregate, that would be epic! It seems like someone with a good knowlege of the inner-workings of django.db.models.sql.query would be able to figure it out pretty quickly, but hopefully someone will let me know if this is just a pipe dream :)

Change History (47)

I'm slightly concerned that this could a little too specific. What you're proposing isn't a simple aggregate - it's a very specific combination of aggregates. I'm not convinced that it is a good idea to start embedding extremely complex aggregate clauses as part of the Django core.

However, the Django core should be exposing the tools that make it possible for end-users to write their own complex aggregates. It looks like we're most of the way there - you've been able to write a custom aggregate that writes most of the SQL. All we're missing is a clean interface for converting field lookup syntax into joins.

The SQL query interface contains utility functions that do this, but these functions aren't easily digested by end users (hell, they make my head spin whenever I look at them). There is also an advantage to keeping the SQL query interface as internal code - it means we don't have to maintain backwards compatibility guarantees. However, as a 'someday/maybe' goal, making SQL generation a clean interface is certainly something to strive for.

Anyone looking to implement this might want to look for hints in the implementation of F() clauses. This gives a pretty clean code path for lookups and joins.

Support for conditional aggregates would make Django more appealing for developers who work on projects that primarily serve up numbers. Computation off of annotated aggregates would be used quite often in those cases, i.e. metrics, segmentation analysis, etc.

I came across this same problem myself today, and spoke to Russ who pointed me to this thread.

I was planning on writing up a patch for this myself (for consideration into the core), which had support for Q() / F(), and such. However, the patch would have needed to be cross database compatible, and sadly, I just don't have the spare time to write this for all the databases.

I would be more than willing to write the MySQL aggregates patch, if others can contribute with the other supported databases and testing etc. If anyone would like to join forces on this to help get it included in the core, let me know via ticket.

Cal Leeming

PS: My use case for this +1 feature vote, is due to having 40 million rows of data, and needing to avoid using layers of filter().aggregate(), which would in turn cause multiple queries needing to be run.

Using the patch, one can do queries like (using the modeltests/aggregation models):

# The average age of author's friends, whose age is lower than the authors age.
vals = Author.objects.aggregate(Avg("friends__age", only=Q(friends__age__lt=F('age'))))
vals["friends__age__avg"] == 30.43

Every aggregate now can use the "only" kwarg. The only kwarq must be a Q object. Lookups in only are allowed only if they do not generate additional joins.

There are 3 main issues with this patch:

The biggest problem is that aggregates need to return query parameters. These parameters should be escaped by the database backend. This causes all sorts of ugliness. I have checked in various places if as_sql returns a tuple, and if so, added the parameters to the query parameters. This isn't an ideal solution, but my skills aren't good enough to work out a better solution.

Does the CASE WHEN construction work on all supported databases? I have tested this on PostgreSQL and SQLite3. I think this will work on any DB but I haven't tested it. MySQL 4.x is the main concern.

The usage of add_q in add_aggregate isn't a clean solution.

There are some new tests included. I have ran the full test suite both on PostgreSQL 8.4 and SQLite3 version 3.7.4, and this passes the test suite. Annotations should also work.

As said, this is a POC patch, and I do expect it to fail if given a complex enough query. Also there aren't nearly enough tests, and there is no documentation.

Updated patch. Now with aggregate.relabel_aliases() and .extra() fixed.

I also included the ability to do F() calculations in aggregate and annotate:

# calculate how much younger the author's friends are on average
# (only those friends included that are actually younger than the author)
vals = Author.objects.aggregate(
friends_younger_avg=Avg(F('age') - F("friends__age"),
only=Q(friends__age__lt=F('age')))
)
vals["friends_younger_avg"] == 7.29

The last part is based on work of Michael Elsdörfer (miracle2k) in ticket #10972.

I would be grateful if somebody has the time to say if this is at all wanted in Django core. It is too soon to review the patch, but on the other hand I am not willing to give it the polish it needs if this patch has no hope of getting committed.

I also have a version which allows annotating fields without aggregation: qs.field_annotate(age_x2=F('age')*2). It is even more in proof-of-concept state than this patch. The idea is from ticket #14030.

Just a quick note: I have been thinking about this, and I am now -0 about the approach taken in my patch. I would rather work on an approach that would allow creating custom aggregates more easily. Currently creating custom aggregates is painful, and in many cases downright impossible.

Conditional aggregates are really useful, but they are just a special case of an aggregate that take parameters.

In a nutshell, we have orders which have a quantity of items, a total amount for the order, tax amount, and are related to a specific customer name. I need to output something like this:
Customer, Total Items (per customer), Total Orders (per customer), Total Amount (per customer), Total Tax (per customer)
UserA,6,3,$50,$5
UserB,34,2,$1800,$100

I would think this sort of thing to be a very common use case. It can be done to the queryset, but sql is good at this sort of thing, so I think it would be better to be able to generate this type of output as a queryset. I'm proselytizing for the integrating the support for conditional aggregates. I like django much better than Excel, but excel has these tools...

I have turned my mind on the -0. The approach is not the cleanest, but the approach could be included as is (minor refactoring needed, though). Later, when there is better support for custom aggregates and aggregates that take parameters, we could use that framework to implement the conditional aggregates. The user visible API would not change.

So, I will try to clean up the patch a bit. I hope there is enough interest from core developers that we can push this forward. IMHO this is very useful feature, and there seems to be others who think so, too.

This still needs documentation, more tests and then review(s). Helping by testing and reviewing the patch is the way to get this included. Though there is not much point until I have time to update the patch. I hope to do it this weekend.

I haven't had time to do anything about this patch. And I will not have time in the near future.

There seems to be at least one change I would like to make. Now, aggregates return sql, params from their .as_sql() but some other columns return just sql. That seems a little unclear, it would be better to return sql, params always from .as_sql(). Of course, the params could be empty.

Otherwise what this feature needs is people reviewing it. Just trying it out for your problem case is enough. Report if it worked in the way you expected, and if the API was sane for your use case. Try to also include a brief description of what you tried. That would help to verify that the feature as written is actually useful. Even if there is no documentation, that should not stop you from doing this. There are some tests in the end of the patch, and those should hopefully make it clear how it works.

Again, no knowledge of the ORM is required. Just try it and see if it does something sane. A very good guide to reviewing patches is available from ​PostgreSQL wiki. Of course, this is not directly applicable to Django, but still worth a read to see for things you can check.

I've updated the patch to work with 1.6, except for a single test which isn't passing. If anyone could take a look at it, I'd appreciate it. I lack the expertise and haven't had the time to make up for it, so any help is welcome.

The problem was that source = None ended up in a WhereNode -> hence the error. This was the TODO in the add_aggreate() method.

I added a _very_ basic field type resolving strategy, and now the patch passes all tests. If I recall correctly the #14030 work does have a much better solution to the field type resolving, and it might be wise to see if that one can be worked into master first.

I'll continue to try and hack away at this, however, I don't think the current patch can be easily applied. There have been some bug fixes to aggregates that removed some stuff that the original patch was using. I'll keep the idea of Q, F expressions being what modify Sum, Avg, etc.

I think that it's worth mentioning here that work on extending the usefulness of aggregates and expressions (#14030) has led to an API capable of constructing conditional aggregates by composing different functions. The support is experimental at the moment, and the 14030 patch needs thorough review, but the implementation might be of interest.