Magnus Hagander's PostgreSQL blog (Entries tagged as sql)http://blog.hagander.net/
enSerendipity 1.6.2 - http://www.s9y.org/Finding gaps in partitioned sequenceshttp://blog.hagander.net/archives/203-Finding-gaps-in-partitioned-sequences.html
PostgreSQLhttp://blog.hagander.net/archives/203-Finding-gaps-in-partitioned-sequences.html#commentshttp://blog.hagander.net/wfwcomment.php?cid=2034http://blog.hagander.net/rss.php?version=2.0&type=comments&cid=203nospam@example.com (Magnus Hagander)
<p>There are an almost unlimited number of articles on the web about how to find gaps in sequences in SQL. And it doesn't have to be very hard. Doing it in a "partitioned sequence" makes it a bit harder, but still not very hard. But when I turned to a window aggregate to do that, I was immediately told "hey, that's a good example of a window aggregate to solve your daily chores, you should blog about that". So here we go - yet another example of finding a gap in a sequence using SQL.</p>
<p>I have a database that is very simply structured - it's got a primary key made out of <i>(groupid, year, month, seq)</i>, all integers. On top of that it has a couple of largish text fields and an fti field for full text search. (Initiated people will know right away which database this is). The sequence in the seq column resets to zero for each combination of <i>(groupid, year, month)</i>. And I wanted to find out where there were gaps in it, and how big they were, to debug the tool that wrote the data into the database. This is really easy with a window aggregate:</p>
<pre><code><div class="geshi" style="text-align: left"><br /><span style="color: #993333; font-weight: bold;">SELECT</span> * <span style="color: #993333; font-weight: bold;">FROM</span> <span style="color: #66cc66;">&#40;</span><br />&#160; &#160;<span style="color: #993333; font-weight: bold;">SELECT</span><br />&#160; &#160; &#160; groupid,<br />&#160; &#160; &#160; year,<br />&#160; &#160; &#160; month,<br />&#160; &#160; &#160; seq, <br />&#160; &#160; &#160; seq-lag<span style="color: #66cc66;">&#40;</span>seq,<span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#41;</span> OVER <span style="color: #66cc66;">&#40;</span>PARTITION <span style="color: #993333; font-weight: bold;">BY</span> groupid, year, month <span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> seq<span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">AS</span> gap <span style="color: #993333; font-weight: bold;">FROM</span> mytable<br /><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">AS</span> t<br /><span style="color: #993333; font-weight: bold;">WHERE</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #66cc66;">&#40;</span>t.gap=<span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#41;</span><br /><span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> groupid, year, month, seq<br />&#160;</div></code></pre>
<p>One advantage to using a window aggregate for this is that we actually get the whole row back, and not just the primary key - so it's easy enough to include all the data you need to figure something out.</p>
<p>What about performance? I don't really have a big database to test this on, so I can't say for sure. It's going to be a sequential scan, since I look at the <i>whole</i> table,and not just parts of it. It takes about 4 seconds to run over a table of about a million rows, 2.7Gb, on a modest VM with no actual I/O capacity to speak of and a very limited amount of memory, returning about 100 rows. It's certainly by far fast enough for me in this case.</p>
<p>And as a bonus, it found me two bugs in the loading script and at least one bug in somebody elses code that I'm now waiting on to get fixed...</p>
Fri, 27 Jan 2012 16:53:52 +0000http://blog.hagander.net/archives/203-guid.htmlpostgresqlsqlwindow aggregatesGetting a range of entries centered around a pointhttp://blog.hagander.net/archives/147-Getting-a-range-of-entries-centered-around-a-point.html
PostgreSQLhttp://blog.hagander.net/archives/147-Getting-a-range-of-entries-centered-around-a-point.html#commentshttp://blog.hagander.net/wfwcomment.php?cid=1474http://blog.hagander.net/rss.php?version=2.0&type=comments&cid=147nospam@example.com (Magnus Hagander)
<p>I had a question yesterday on an internal IRC channel from one of my colleagues in Norway about a SQL query that would "for a given id value, return the 50 rows centered around the row with this id", where the id column can contain gaps (either because they were inserted with gaps, or because there are further <strong>WHERE</strong> restrictions in the query).</p>
<p>I came up with a reasonably working solution fairly quickly, but I made one mistake. For fun, I asked around a number of my PostgreSQL contacts on IM and IRC for their solutions, and it turns out that almost everybody made the exact same mistake at first. I'm pretty sure all of them, like me, would've found and fixed that issue within seconds if they were in front of a psql console. But I figured that was a good excuse to write a blog post about it.</p>
<p>The solution itself becomes pretty simple if you rephrase the problem as "for a given id value, return the 25 rows preceding and the 25 rows following the row with this id". That pretty much spells a <strong>UNION</strong> query. Thus, the solution to the problem is:</p>
<pre><code><div class="geshi" style="text-align: left"><br />&#160; &#160; <span style="color: #993333; font-weight: bold;">SELECT</span> * <span style="color: #993333; font-weight: bold;">FROM</span> <span style="color: #66cc66;">&#40;</span><br />&#160; &#160; &#160; &#160; <span style="color: #993333; font-weight: bold;">SELECT</span> id,field1,field2 <span style="color: #993333; font-weight: bold;">FROM</span> mytable <span style="color: #993333; font-weight: bold;">WHERE</span> id &amp;gt;= <span style="color: #cc66cc;">123456</span> <span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> id <span style="color: #993333; font-weight: bold;">LIMIT</span> <span style="color: #cc66cc;">26</span><br />&#160; &#160; <span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">AS</span> a<br /><span style="color: #993333; font-weight: bold;">UNION</span> <span style="color: #993333; font-weight: bold;">ALL</span><br />&#160; &#160; <span style="color: #993333; font-weight: bold;">SELECT</span> * <span style="color: #993333; font-weight: bold;">FROM</span> <span style="color: #66cc66;">&#40;</span><br />&#160; &#160; &#160; &#160; <span style="color: #993333; font-weight: bold;">SELECT</span> id,field1,field2 <span style="color: #993333; font-weight: bold;">FROM</span> mytable <span style="color: #993333; font-weight: bold;">WHERE</span> id &amp;lt; <span style="color: #cc66cc;">123456</span> <span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> id <span style="color: #993333; font-weight: bold;">DESC</span> <span style="color: #993333; font-weight: bold;">LIMIT</span> <span style="color: #cc66cc;">25</span><br />&#160; &#160; <span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">AS</span> b<br /><span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> id;<br />&#160;</div></code></pre>
<p>The mistake everybody made? Forgetting that you need a subselect in order to use <strong>LIMIT</strong>. Without subselects, you can't put <strong>ORDER BY</strong> or <strong>LIMIT</strong> inside the two separate parts of the query, only at the outer end of it. But we specifically need to apply the <strong>LIMIT</strong> individually, and the <strong>ORDER BY</strong> needs to be different for the two parts.</p>
<p>Another question I got around this was, why use <strong>UNION ALL</strong>. We know, after all, that there are no overlapping rows so the result should be the same as for <strong>UNION</strong>. And this is exactly the reason why <strong>UNION ALL</strong> should be used, rather than a plain <strong>UNION</strong>. <i>We</i> know it - the database doesn't. A <strong>UNION</strong> query will generate a plan that requires an extra <i>unique</i> node at the top, to make sure that there are no overlapping rows. So the tip here is - <i>always</i> use <strong>UNION ALL</strong> rather than <strong>UNION</strong> whenever you <i>know</i> that the results are not overlapping.</p>
<p>All things considered, this query produces a pretty quick plan even for large datasets, since it allows us to do two independent index scans, one backwards. Since there are <strong>LIMIT</strong> nodes on the scans, they will stop running as soon as they have produced the required number of rows, which is going to be very small compared to the size of the table. This is the query plan I got on my test data:</p>
<pre><code><div class="geshi" style="text-align: left"><br />&#160;Sort&#160; <span style="color: #66cc66;">&#40;</span>cost=<span style="color: #cc66cc;">54</span>.<span style="color: #cc66cc;">60</span>..<span style="color: #cc66cc;">54</span>.<span style="color: #cc66cc;">73</span> rows=<span style="color: #cc66cc;">51</span> width=<span style="color: #cc66cc;">86</span><span style="color: #66cc66;">&#41;</span><br />&#160; &#160;Sort <span style="color: #993333; font-weight: bold;">KEY</span>: id<br />&#160; &#160;-&amp;gt;&#160; Append&#160; <span style="color: #66cc66;">&#40;</span>cost=<span style="color: #cc66cc;">0</span>.<span style="color: #cc66cc;">00</span>..<span style="color: #cc66cc;">53</span>.<span style="color: #cc66cc;">15</span> rows=<span style="color: #cc66cc;">51</span> width=<span style="color: #cc66cc;">86</span><span style="color: #66cc66;">&#41;</span><br />&#160; &#160; &#160; &#160; &#160;-&amp;gt;&#160; <span style="color: #993333; font-weight: bold;">LIMIT</span>&#160; <span style="color: #66cc66;">&#40;</span>cost=<span style="color: #cc66cc;">0</span>.<span style="color: #cc66cc;">00</span>..<span style="color: #cc66cc;">35</span>.<span style="color: #cc66cc;">09</span> rows=<span style="color: #cc66cc;">26</span> width=<span style="color: #cc66cc;">51</span><span style="color: #66cc66;">&#41;</span><br />&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;-&amp;gt;&#160; <span style="color: #993333; font-weight: bold;">INDEX</span> Scan <span style="color: #993333; font-weight: bold;">USING</span> mytable_pk <span style="color: #993333; font-weight: bold;">ON</span> mytable&#160; <span style="color: #66cc66;">&#40;</span>cost=<span style="color: #cc66cc;">0</span>.<span style="color: #cc66cc;">00</span>..<span style="color: #cc66cc;">55425</span>.<span style="color: #cc66cc;">06</span> rows=<span style="color: #cc66cc;">41062</span> width=<span style="color: #cc66cc;">51</span><span style="color: #66cc66;">&#41;</span><br />&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;<span style="color: #993333; font-weight: bold;">INDEX</span> Cond: <span style="color: #66cc66;">&#40;</span>id &amp;gt;= <span style="color: #cc66cc;">100000</span><span style="color: #66cc66;">&#41;</span><br />&#160; &#160; &#160; &#160; &#160;-&amp;gt;&#160; <span style="color: #993333; font-weight: bold;">LIMIT</span>&#160; <span style="color: #66cc66;">&#40;</span>cost=<span style="color: #cc66cc;">0</span>.<span style="color: #cc66cc;">00</span>..<span style="color: #cc66cc;">17</span>.<span style="color: #cc66cc;">04</span> rows=<span style="color: #cc66cc;">25</span> width=<span style="color: #cc66cc;">51</span><span style="color: #66cc66;">&#41;</span><br />&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;-&amp;gt;&#160; <span style="color: #993333; font-weight: bold;">INDEX</span> Scan Backward <span style="color: #993333; font-weight: bold;">USING</span> mytable_pk <span style="color: #993333; font-weight: bold;">ON</span> mytable&#160; <span style="color: #66cc66;">&#40;</span>cost=<span style="color: #cc66cc;">0</span>.<span style="color: #cc66cc;">00</span>..<span style="color: #cc66cc;">56090</span>.<span style="color: #cc66cc;">47</span> rows=<span style="color: #cc66cc;">82306</span> width=<span style="color: #cc66cc;">51</span><span style="color: #66cc66;">&#41;</span><br />&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;<span style="color: #993333; font-weight: bold;">INDEX</span> Cond: <span style="color: #66cc66;">&#40;</span>id &amp;lt; <span style="color: #cc66cc;">100000</span><span style="color: #66cc66;">&#41;</span><br />&#160;</div></code></pre>
<p>And yes, the final <strong>ORDER BY</strong> is still needed if we want the total result to come out in the correct order. With the default query plan, it will come out in the wrong order after the <i>append</i> node. But it's important to remember that by the specification the database is free to return the rows in <i>any order it chooses</i> unless there is an explicit <strong>ORDER BY</strong> in the query. The rows may otherwise be returned in a completely different order between different runs, depending on the size/width of the table and other parameters.</p>
Fri, 05 Jun 2009 13:26:00 +0000http://blog.hagander.net/archives/147-guid.htmllimitpostgresqlsqlunion