Django: Ticket #18748: Remove dupe-avoidance logic from the ORMhttps://code.djangoproject.com/ticket/18748
<p>
There is some complicated-looking and zero documented logic related to dupe avoidance in sql/query.py and sql/compiler.py. I began to wander what exactly the dupe avoidance does. It turns out it does nothing. I removed the dupe-avoidance, and all tests still pass.
</p>
<p>
The main benefit is easier to follow code. There should be a slight speed advantage, too, but it is so small it is unmeasurable. However I did confirm the removal of dupe-avoidance doesn't lead to speed problems using djangobench.
</p>
<p>
Patch can be found from: <a class="ext-link" href="https://github.com/akaariai/django/tree/dupe_away"><span class="icon">​</span>https://github.com/akaariai/django/tree/dupe_away</a>
</p>
<p>
I am marking this as RFC. I will likely commit this one soon if nobody complains.
</p>
en-usDjangohttps://www.djangoproject.com/s/img/site/hdr_logo.gifhttps://code.djangoproject.com/ticket/18748
Trac 1.2.2Alex GaynorFri, 10 Aug 2012 20:05:59 GMThttps://code.djangoproject.com/ticket/18748#comment:1
https://code.djangoproject.com/ticket/18748#comment:1
<p>
I think this may be to avoid duplicate JOINs, so it's not a correctness issue, but rather a performance issue, can you try some queries that look like they might go through that code and make sure the JOINs look ok?
</p>
TicketAnssi KääriäinenFri, 10 Aug 2012 20:33:57 GMTstage changedhttps://code.djangoproject.com/ticket/18748#comment:2
https://code.djangoproject.com/ticket/18748#comment:2
<ul>
<li><strong>stage</strong>
changed from <em>Ready for checkin</em> to <em>Accepted</em>
</li>
</ul>
<p>
Quick testing and seems this isn't the case. I added a join counter (how many times ' JOIN ' is found in the executed SQL) and ran queries, select_related, select_related_regress, many_to_many, aggregation_regress and select_related tests. All produce exactly as many joins.
</p>
<p>
Another reason to think this will not reduce joins is that the dedupe is used to mark joins to _exclude_ from reuse, so the dedupe would cause more joins to appear if anything. Although, the code is complex and it is sometimes hard to see what exactly is happening...
</p>
<p>
I will still downgrade this from RFC, this probably is a bit too large change to just push in...
</p>
TicketAlex GaynorFri, 10 Aug 2012 20:36:38 GMThttps://code.djangoproject.com/ticket/18748#comment:3
https://code.djangoproject.com/ticket/18748#comment:3
<p>
If this isn't a correctness issue, and it doesn't cause extra JOINs then I'm comfortable with you landing this, based on the analysis you've posted. I can take a look if you want, but I don't think that's necessary.
</p>
TicketAnssi KääriäinenFri, 10 Aug 2012 20:59:01 GMTstage changedhttps://code.djangoproject.com/ticket/18748#comment:4
https://code.djangoproject.com/ticket/18748#comment:4
<ul>
<li><strong>stage</strong>
changed from <em>Accepted</em> to <em>Ready for checkin</em>
</li>
</ul>
<p>
I am not too afraid re the code changes, the bigger question is if there is something the dedupe logic is doing and I just haven't spotted it. To me it seems there isn't. The code creates same amount of joins and the generated queries are correct, at least according to test suite.
</p>
<p>
Todays useless fact: There are 29176 instances of the word 'JOIN' in SQL executed by the testing framework (both with and without patch).
</p>
<p>
Marking again RFC.
</p>
TicketMalcolm TredinnickSat, 11 Aug 2012 04:07:38 GMThttps://code.djangoproject.com/ticket/18748#comment:5
https://code.djangoproject.com/ticket/18748#comment:5
<p>
Alex pinged me to take a look at this. I am *very* suspicious of removing exclusions from the "def join(...)" part. I have a memory it was important at the time. The reason for this being there may have vanished in the interim (it was 5 years ago and prior to the SQL-compiler changes), but I'll try to puzzle it out this weekend.
</p>
<p>
The reason for it would have been to avoid inadvertent merging of table joins when we need to force joining separate copies of the same table. On the other hand it's been around much longer than the real monster join bugs prior to 1.0, so the effect may be duplicated in other code, removing the need for this,
</p>
TicketAnssi KääriäinenSat, 11 Aug 2012 06:43:15 GMThttps://code.djangoproject.com/ticket/18748#comment:6
https://code.djangoproject.com/ticket/18748#comment:6
<p>
The real problem I was trying to solve when starting this was "what is this dedupe stuff really used for". The approach I used for finding that out was "lets just remove it and see if anything breaks". So, it wouldn't surprise me if something actually breaks...
</p>
TicketAnssi KääriäinenSat, 25 Aug 2012 11:44:26 GMThttps://code.djangoproject.com/ticket/18748#comment:7
https://code.djangoproject.com/ticket/18748#comment:7
<p>
I did some digging, and the dupe avoidance seems to be about cases where there are multiple paths from model A to model B, but through different columns. Something like this:
</p>
<pre class="wiki">class B:
pass
class A:
b1 = models.ForeignKey(B)
b2 = models.ForeignKey(B)
</pre><p>
However, this case is already handled in join() by having the column in the join identifier. That is, we will no accidentally reuse join through b1 for b2.
</p>
<p>
To me it seems there is no need for dupe avoidance any more.
</p>
TicketAnssi KääriäinenSun, 16 Sep 2012 21:25:41 GMThttps://code.djangoproject.com/ticket/18748#comment:8
https://code.djangoproject.com/ticket/18748#comment:8
<p>
OK, I now know what exclude was used for - when combining queries and the rhs query has two joins for the same connection we must exclude the alias used for the first connection. As an example:
</p>
<pre class="wiki">q1 = qs.filter(m2m__f=1)
q2 = qs.filter(m2m__f=1).filter(m2m__f=2)
</pre><p>
Now, we have two joins for m2m in q2, and specifically they have the same connections for the m2m joins. If we do (q1|q2), then we will reuse all joins for the first m2m join. If we don't exclude the alias used for the first m2m join, then we will again reuse the same join for the second m2m join. This would result in having just one join in the combined query. This is wrong, as the original rhs query had two joins.
</p>
<p>
I have done a fix for this in the <a class="ext-link" href="https://github.com/akaariai/django/compare/dupe_away"><span class="icon">​</span>https://github.com/akaariai/django/compare/dupe_away</a>, specifically in <a class="ext-link" href="https://github.com/akaariai/django/commit/da96154a72c4c4ca53cce198a82bb6af3f87ff64"><span class="icon">​</span>https://github.com/akaariai/django/commit/da96154a72c4c4ca53cce198a82bb6af3f87ff64</a> - note that the exclude arg is still gone, instead I use the reuse arg here.
</p>
<p>
I also reworked the .join() implementation (<a class="ext-link" href="https://github.com/akaariai/django/commit/b42280f45c7cd0fcde835ee4665e9d67327fda1d"><span class="icon">​</span>https://github.com/akaariai/django/commit/b42280f45c7cd0fcde835ee4665e9d67327fda1d</a>). I feel the code is much improved, but I can't claim to understand the existing code in full. I am afraid that there are other regressions lurking...
</p>
<p>
If anybody can review this I would much appreciate it!
</p>
TicketAnssi KääriäinenSun, 28 Oct 2012 20:42:30 GMThttps://code.djangoproject.com/ticket/18748#comment:9
https://code.djangoproject.com/ticket/18748#comment:9
<p>
I have updated the patch. It consist of two major parts:
</p>
<ul><li>dupe avoidance removal
</li><li>refactoring the signature and implementation of sql.Query.join()
</li></ul><p>
In a perfect world these should likely be two separate patches and tickets. However, I have the code ready in one patch, and I think now is a good time to commit this to master. Splitting this to two patches seems somewhat laborious for little benefit. The patch is still manageable in size.
</p>
<p>
Code available in: <a class="ext-link" href="https://github.com/akaariai/django/compare/ticket_18748"><span class="icon">​</span>https://github.com/akaariai/django/compare/ticket_18748</a>
</p>
<p>
The patch removes around 120 lines of code and adds 10 new comment lines.
</p>
<p>
Unless objections I will commit this to master.
</p>
TicketAnssi KääriäinenWed, 31 Oct 2012 06:45:35 GMTstatus changed; resolution sethttps://code.djangoproject.com/ticket/18748#comment:10
https://code.djangoproject.com/ticket/18748#comment:10
<ul>
<li><strong>status</strong>
changed from <em>new</em> to <em>closed</em>
</li>
<li><strong>resolution</strong>
set to <em>fixed</em>
</li>
</ul>
<p>
I managed to miss the "Fixed <a class="closed ticket" href="https://code.djangoproject.com/ticket/18748" title="#18748: Cleanup/optimization: Remove dupe-avoidance logic from the ORM (closed: fixed)">#18748</a>" from the commit message... So, fixed in <a class="changeset" href="https://code.djangoproject.com/changeset/68847135bc9acb2c51c2d36797d0a85395f0cd35" title="Removed dupe_avoidance from sql/query and sql/compiler.py
The dupe ...">[68847135bc9acb2c51c2d36797d0a85395f0cd35]</a>.
</p>
Ticket