SUM, DISTINCT, and Eagerly-Loaded Associations

Sep 22, 2016

rails

A common task in application is to list a collection of records along with a
summary such as the sum of a particular attribute across the collection. For
example, imagine we have Job model which has a numeric rate attribute. In a
jobs#index action, we might list the jobs in the body of a table, and display
the sum of the rates in the table footer.

What’s happening is a row is returned for every existing Job-Category
combination. If a Job belongs to one Category, one row is returned; if a Job
belongs to two Categories, two rows are returned, etc. (Because we are doing a
LEFT OUTER JOIN, if a Job belongs to zero Categories, we still get one row
for the Job itself.) Let’s look at a different query to see the actual result
set:

Great! Now how do we make the test pass? Well, the goal is to calculate the
SUM of the rates for each distinct job in the result set. In other words,
when considering the query that loads the Jobs, we want to calculate the SUM
for the Jobs whose IDs are in the original result set.
ActiveRecord::Base.pluck
seems like it might be useful here.

Well, we can see why it works, but I have a nagging concern about the way this
works. It’s performing two separate queries. The first query looks up the Job
IDs and loads them into an Array. The second query performs the actual SUM,
passing the Array of Job IDs in as a parameter. We only have two Jobs in the
database during the test, but what will it look like in production? What if we
have 100,000 Jobs? Well, it will perform the first query, creating an Array of
100,000 IDs, and then it will perform the second query, passing in a
100,000-element Array as a query parameter. Yikes.

The thing is, SQL is a clever language. You don’t have to perform two separate
queries, and you don’t have to construct an arbitrarily long Array in order to
perform this task, when you can just use a sub-select. The SQL for such a query
might look something like this:

SELECTSUM(rate)FROMjobsWHEREidIN(SELECTDISTINCTidFROMjobs)

So what’s going on? ActiveRecord is able to generate sub-selects. Why isn’t it
doing it here? Well, pluck is a special method; it executes immediately and
always returns an Array. We don’t actually want to use pluck in this
situation. We want to use
ActiveRecord::Base.select.

Bingo! There’s the sub-select. So does that mean we’re done? Well, not quite.

What happened to the JOIN clauses?

ActiveRecord provides two different ways for joining additional tables in a
query:
ActiveRecord::Base.joins.
and
ActiveRecord::Base.includes.
Which method you choose depends on the reason you are joining the additional
tables. You use joins when you just need to join the table in the query in
order to reference columns from that other table. You use includes when you
want to eagerly load the referenced association when constructing the result
set, say to avoid an N+1 query problem.

The joins method defaults to performing an INNER JOIN. Rails 5 added
ActiveRecord::Base.left_outer_joins
(aliased as left_joins) to perform a LEFT OUTER JOIN instead. If you are
using Rails 4.2.x or earlier, you can still perform a LEFT OUTER JOIN, but
you have to pass a SQL String to the joins call.

ActiveRecord omits the JOIN clauses from the sub-select, because we created
them using includes. We chose includes because we wanted to eagerly load
the Categories when we loaded the Job. This makes sense for the goal of loading
and displaying a collection of Jobs, but there’s no reason to load the
Categories in Job.total_rate because we aren’t instantiating the Jobs; we’re
just performing a SUM. That was good enough to make the test pass, but will
that always work?

Eager Loading and Conditions

Let’s consider a new scenario. This time, instead of calculating the total rate
for all Jobs, let’s calculate the total rate for Jobs that belong to a
particular category. Here’s a test to demonstrate:

In this case, we are filtering Jobs for those that belong to Categories with
the name “category 1”. As only one Job belongs to that Category, we expect the
total rate to be 100. Let’s run the test and see what happens.

Recall that in the previous example, ActiveRecord has simplified the sub-select
by removing the JOIN clauses. That wasn’t a problem before, but now, in this
scenario, we need the JOIN clauses to be preserved within the sub-select to
ensure the categories table is available for our search.

Earlier we talked about joins and left_outer_joins. If we used those
instead of includes, would the JOIN clauses be preserved? Lets’ modify the
test and find out.

Hey! It works. So is that enough? No. As mentioned before, the point of using
includes is to eagerly load associations. If we called
Job.left_outer_joins(:categories) from a controller action, the categories
table will be available in Job.total_rate, but the Categories will not be
eagerly loaded. If we start displaying Categories, we will set off a flurry of
additional SQL queries to load the Categories one by one — the stereotypical
N+1 query.

Okay, so we really do want to use includes. How can we get Job.total_rate
to work? Well, as it happens, an ActiveRecord::Relation has methods for
introspection into the parameters of its query. In this case, there’s an
accessor method named includes_values; each time you call includes the
associations you reference are added to includes_values.

Job.includes(:categories).includes_values# => [:categories]

WARNING The includes_values method is not mentioned in the Rails API
documentation, so it may be that it’s not meant for public use. I can’t promise
this will always work. We’re living dangerously.

What if we modified Job.total_rate to use includes_values and
left_outer_joins?

Eager Loading, Conditions, and Pagination

Even in a modest-sized application, there are times when you want to paginate a
collection to keep response times and memory usage low. If our database has
100,000 Jobs, we don’t want to instantiate them all at once. But that doesn’t
mean we don’t want to consider the entire result set in the footer. We may only
be display 25 Jobs per page, but we probably want to know the total rate for
all the matching Jobs.

The simplest way to paginate results is by adding LIMIT and OFFSET clauses
to our query. (Often this is not the most efficient
way, but it
is the most common, and so this is the approach I will be addressing below.)
The kaminari gem is my go-to gem for
adding pagination to a Rails application.

2.times{Job.create!}Job.count# => 2Job.page(1).per(1).count# => 1

So our goal is to ensure that Job.total_rate performs the SUM across all
matching jobs regardless of what page we are looking at. Let’s write a test to
see what happens when we use pagination.

Those LIMIT and OFFSET clauses are limiting the SUM. What’s a developer
to do?

Well,
ActiveRecord::Base.except
can be very useful here. The except method allows us to skip certain types of
query conditions. In this case, we want to drop the LIMIT and OFFSET
conditions. Let’s rewrite Job.total_rate one more time.

I’m now using except to drop the LIMIT and OFFSET clauses. While I’m at
it, I’m also extracting a new method, Job.summable. Job.total_rate was
becoming a little hairy for my tastes. I also moved distinct.select(:id)
because I felt the result was a bit more intuitive.