Getting an advance list of RowIDs for a query result set

Getting an advance list of RowIDs for a query result set

For a complex query you can often get a list of the base table RowIDs very quickly with a simple query and then use an array of those values (along with the carray virtual table) to retrieve sections of data from the complex query almost instantly. I've been doing this for a while but would love to find a way of doing it automatically. The problem is as follows

and I want a list of the BaseTbl's rowid's for all records returned by the above query in
the order defined by ORDER BY. I also want it in the fastest possible way. In other words I
want

SELECT BaseTbl.RowID
FROM BaseTbl
jointype_1 Tbl1 on comparison_1
jointype_2 Tbl2 on comparison_2
.
.
jointype_n Tbln on comparison_n
WHERE ....
ORDER BY ....

with all redundant tables removed.

The question is, how to determine what tables are redundant?

Obviously any table involved in the WHERE or ORDER BY has to be retained aa does any table
acting as an intermediate to join those tables back to the BaseTbl.

Of the other tables I'm thinking I can discard table_i provided it's joined to the BaseTbl by a table_i unique index and comparison_i covers all columns of that index. It's fairly obvious if jointype_i == left join but I'm thinking it works also for inner join. Can anyone give me a counter example?

Anyone think of other ways?

I'm also wondering if the result produced by 'explain query plan' could simplify the coding although I know use of that isn't recommended as the structure of it is not guaranteed.

but the first way will be faster and, as the record count - ID.size() - is
known , it would allow a
record counter to be shown, vertical scrollbar to be sized, last known
record to be relocated etc.

Anyway, what I'm trying to do is find a way of doing the same thing for more
complicated
queries. Given a select where a 'base table' is attached to lookup tables
how can I determine which of the lookup tables can be removed from the table
such that

Re: Getting an advance list of RowIDs for a query result set

> Given a select where a 'base table' is attached to lookup tables
> how can I determine which of the lookup tables can be removed from the table
> such that
>
> select BaseTbl.RowID from ... where ... order by ...
>
> will find the set of records that represents the original query.

That optimization could be done at the level of the SQL engine. You wouldn’t want to do it inside your own code since that would make your code extremely complicated. So just execute the query without trying to optimize it and see what happens. Does it run fast enough for your users ?

Re: Getting an advance list of RowIDs for a query result set

Simon, I’ve no users. I’ve been teaching myself c++ (and SQLite) during a lengthy illness so this is just experimentation for me.

This subject touches on a previous question of mine you were involved in regarding redundant tables. The following may jog your memory

create table TblA(A integer primary key, B int, C int);

create table TblB(B integer primary key, BX int);

create table TblC(C integer primary key, CX int);

explain query plan select A from TblA left join TblB using (B) left join TblC using (C) where BX=?;

0|0|0|SCAN TABLE TblA

0|1|1|SEARCH TABLE TblB USING INTEGER PRIMARY KEY (rowid=?)

explain query plan select A from TblA left join TblB using (B) left join TblC using (C) where CX=?;

0|0|0|SCAN TABLE TblA

0|1|1|SEARCH TABLE TblB USING INTEGER PRIMARY KEY (rowid=?)

0|2|2|SEARCH TABLE TblC USING INTEGER PRIMARY KEY (rowid=?)

In the first explain SQLite drops the trailing redundant table but in the second explain it doesn’t drop the middle redundant table.

As TblB is included in the second query it must surely run slower than if it were omitted.

I’m not complaining about the SQLite optimiser failing to spot the redundancy as it’s got to deal with a variety of queries far removed from my narrow experiment. Checking for such redundancies would likely slow down prepares and, when it comes down to it, anyone including TblB in the second query is only getting what they asked for.

I do think though that it’s possible to write code to remove these redundancies so as to get the vector of RowIDs as fast as possible. So far I’ve been splitting SQL into ‘RowSQL’ (returns RowIDs involved in correct order) and ‘ColSQL’ (returns columns requested in original SQL for the requested range as shown in my second post) but I’ve only been doing it visually via knowledge of the tables. What I’m trying to do is write a function to automatically ‘split’ the sql into RowSQL and ColSQL. I’ll make another post later showing where I’m at with that.

> Given a select where a 'base table' is attached to lookup tables
> how can I determine which of the lookup tables can be removed from the table
> such that
>
> select BaseTbl.RowID from ... where ... order by ...
>
> will find the set of records that represents the original query.

That optimization could be done at the level of the SQL engine. You wouldn’t want to do it inside your own code since that would make your code extremely complicated. So just execute the query without trying to optimize it and see what happens. Does it run fast enough for your users ?

Re: Getting an advance list of RowIDs for a query result set

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of x
>Sent: Saturday, 25 November, 2017 10:08
>To: SQLite mailing list
>Subject: Re: [sqlite] Getting an advance list of RowIDs for a query
>result set
>
>Simon, I’ve no users. I’ve been teaching myself c++ (and SQLite)
>during a lengthy illness so this is just experimentation for me.
>
>
>
>This subject touches on a previous question of mine you were involved
>in regarding redundant tables. The following may jog your memory
>
>
>
>create table TblA(A integer primary key, B int, C int);
>
>create table TblB(B integer primary key, BX int);
>
>create table TblC(C integer primary key, CX int);
>
>
>
>explain query plan select A from TblA left join TblB using (B) left
>join TblC using (C) where BX=?;
>
>0|0|0|SCAN TABLE TblA
>
>0|1|1|SEARCH TABLE TblB USING INTEGER PRIMARY KEY (rowid=?)
>
>
>
>explain query plan select A from TblA left join TblB using (B) left
>join TblC using (C) where CX=?;
>
>0|0|0|SCAN TABLE TblA
>
>0|1|1|SEARCH TABLE TblB USING INTEGER PRIMARY KEY (rowid=?)
>
>0|2|2|SEARCH TABLE TblC USING INTEGER PRIMARY KEY (rowid=?)
>
>
>
>In the first explain SQLite drops the trailing redundant table but in
>the second explain it doesn’t drop the middle redundant table.
>
>
>
>As TblB is included in the second query it must surely run slower
>than if it were omitted.
>
>
>
>I’m not complaining about the SQLite optimiser failing to spot the
>redundancy as it’s got to deal with a variety of queries far removed
>from my narrow experiment. Checking for such redundancies would
>likely slow down prepares and, when it comes down to it, anyone
>including TblB in the second query is only getting what they asked
>for.
>
>
>
>I do think though that it’s possible to write code to remove these
>redundancies so as to get the vector of RowIDs as fast as possible.
>So far I’ve been splitting SQL into ‘RowSQL’ (returns RowIDs involved
>in correct order) and ‘ColSQL’ (returns columns requested in original
>SQL for the requested range as shown in my second post) but I’ve only
>been doing it visually via knowledge of the tables. What I’m trying
>to do is write a function to automatically ‘split’ the sql into
>RowSQL and ColSQL. I’ll make another post later showing where I’m at
>with that.
>
>________________________________
>From: sqlite-users <[hidden email]> on
>behalf of Simon Slavin <[hidden email]>
>Sent: Saturday, November 25, 2017 1:26:00 PM
>To: SQLite mailing list
>Subject: Re: [sqlite] Getting an advance list of RowIDs for a query
>result set
>
>
>
>On 25 Nov 2017, at 1:15pm, curmudgeon <[hidden email]> wrote:
>
>> Given a select where a 'base table' is attached to lookup tables
>> how can I determine which of the lookup tables can be removed from
>the table
>> such that
>>
>> select BaseTbl.RowID from ... where ... order by ...
>>
>> will find the set of records that represents the original query.
>
>That optimization could be done at the level of the SQL engine. You
>wouldn’t want to do it inside your own code since that would make
>your code extremely complicated. So just execute the query without
>trying to optimize it and see what happens. Does it run fast enough
>for your users ?
>
>Simon.
>_______________________________________________
>sqlite-users mailing list
>[hidden email]>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users>_______________________________________________
>sqlite-users mailing list
>[hidden email]>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: Getting an advance list of RowIDs for a query result set

Update the omit-table-from-left-join optimization so that it can omit tables from the middle of the join as well as the end.

Deals with the case I mentioned in my last post but this one is also interesting

Fix a problem preventing the planner from identifying scans that visit at most one row in cases where that property is guaranteed by a unique, not-null, non-IPK column that is the leftmost in its table

The way I was going to tackle the problem of redundant tables was as follows.

SELECT BaseTbl.RowID
FROM BaseTbl
jointype_1 Tbl1 on comparison_1
jointype_2 Tbl2 on comparison_2
.
.
jointype_n Tbln on comparison_n
WHERE ....
ORDER BY ....

I proceed as follows

1. Omit a table join from the SQL and try preparing it.
2. If it prepares OK then the table isn’t involved in the WHERE or ORDER BY.
3. If it’s joined to the BaseTbl by an integer primary key or FULLY joined by a unique index then the table is redundant. By FULLY I mean ALL fields of the index are included in the join as otherwise the BaseTbl’s relationship with it is of a ONE TO MANY nature which means the BaseTbl RowID won’t uniquely identify a row of the original query. [Two things I’m unsure about are a) how nulls affect unique index joins and b) how to deal with tables that aren’t directly linked to the BaseTbl (i.e. they’re linked via an intermediate table)].
4. If the table is needed reintroduce it into the SQL.

Do this in turn for each of the joins.

I’m wondering if the second of those trunk changes is in any way related to what I’m trying to do. The above is a bit long winded and not easy to code so it would be great if the SQLite query optimizer did it all for me.

To me the time taken to grab a grid page of data is negligible if you know where to look for it on disc. I tend therefore to time queries by how fast I can get all the BaseTbl RowIDs into a vector. The biggest table in my database has 2.4 million rows and yet wait cursors are a very rare sight.

Re: Getting an advance list of RowIDs for a query result set

>
> I proceed as follows
>
>
> 1. Omit a table join from the SQL and try preparing it.
> 2. If it prepares OK then the table isn’t involved in the WHERE or
> ORDER BY.
> 3. If it’s joined to the BaseTbl by an integer primary key or
> FULLY joined by a unique index then the table is redundant. By FULLY
> I mean ALL fields of the index are included in the join as otherwise
> the BaseTbl’s relationship with it is of a ONE TO MANY nature which
> means the BaseTbl RowID won’t uniquely identify a row of the
> original query. [Two things I’m unsure about are a) how nulls affect
> unique index joins and b) how to deal with tables that aren’t
> directly linked to the BaseTbl (i.e. they’re linked via an
> intermediate table)].
> 4. If the table is needed reintroduce it into the SQL.
>
> Do this in turn for each of the joins.
>

If step 3 is xxx-ed and only left-joins remain to be considered then
the SQLite3 engine is likely to fullfill this optimization in its
next release.
I tested this on the current (2017-11-17) pre-release snapshot. As
far as I see any outer joined table may be discarded from the query
plan if only rowid from the base table is selected. Using SELECT
DISTINCT applies this also to one-to-many joins.

> I’m wondering if the second of those trunk changes is in any way
> related to what I’m trying to do. The above is a bit long winded and
> not easy to code so it would be great if the SQLite query optimizer
> did it all for me.
>
It looks relevant, but I did not test that.

> To me the time taken to grab a grid page of data is negligible if
> you know where to look for it on disc. I tend therefore to time
> queries by how fast I can get all the BaseTbl RowIDs into a vector.
> The biggest table in my database has 2.4 million rows and yet wait
> cursors are a very rare sight.

Re: Getting an advance list of RowIDs for a query result set

>If step 3 is xxx-ed and only left-joins remain to be considered then
>the SQLite3 engine is likely to fullfill this optimization in its
>next release.
>I tested this on the current (2017-11-17) pre-release snapshot. As
>far as I see any outer joined table may be discarded from the query
>plan if only rowid from the base table is selected. Using SELECT
>DISTINCT applies this also to one-to-many joins.

Thanks E. Pasma. Fixing the left join optimisation will be enough for me and will allow me to delete a fair amount of difficult code. My confusion over the inner joins arose out of trying to find a way of ensuring the BaseTbl RowIDs wouldn’t contain any duplicates. Somewhere along the line I forgot SQLite would have to still check the record existed in the inner joined table.

Re: Getting an advance list of RowIDs for a query result set

x wrote:

>
> From: E.Pasma<mailto:[hidden email]>
> Sent: 26 November 2017 17:30
> To: SQLite mailing list<mailto:[hidden email]>
> Subject: Re: [sqlite] Getting an advance list of RowIDs for a query
> result set
>
>> If step 3 is xxx-ed and only left-joins remain to be considered then
>> the SQLite3 engine is likely to fullfill this optimization in its
>> next release.
>> I tested this on the current (2017-11-17) pre-release snapshot. As
>> far as I see any outer joined table may be discarded from the query
>> plan if only rowid from the base table is selected. Using SELECT
>> DISTINCT applies this also to one-to-many joins.
>
> Thanks E. Pasma. Fixing the left join optimisation will be enough
> for me and will allow me to delete a fair amount of difficult code.
> My confusion over the inner joins arose out of trying to find a way
> of ensuring the BaseTbl RowIDs wouldn’t contain any duplicates.
> Somewhere along the line I forgot SQLite would have to still check
> the record existed in the inner joined table.

Re: Getting an advance list of RowIDs for a query result set

So if I build a view that includes look-ups in other tables, the
optimizer may skip these at places where not selected. However only if
the look-ups are written as outer joins. Then it may be good practice
allways doing that. For instance:

Re: Getting an advance list of RowIDs for a query result set

>So if I build a view that includes look-ups in other tables, the
>optimizer may skip these at places where not selected. However only if
>the look-ups are written as outer joins. Then it may be good practice
>allways doing that. For instance:

Yeah, I was thinking always to use left joins when it doesn’t make any difference. I’m using a C++ wrapper for SQLite I wrote myself. It contains a QueryGrid type that automatically splits the SQL into RowSQL and ColSQL as described earlier. It is still possible to build a vector of BaseTbl.RowIDs when the original query contains inner joins provided all columns of the unique index used in the join are satisfied (see ***Example). If it’s unable to do that (due to 1 to many joins) it computes ‘select count(*) from .....’ and sets ColSQL to ‘select ColList from ... limit ?1, ?2’ so that it still fits in with the ‘fetch a range of data’ modus operandi and also allows a record counter and vertical scrollbar positioning.

I’ve still to write code to check the query contains no joins that are ‘1 to many’. I don’t suppose there’s an easy way of determining this from explain or explain query plan? At the moment all i can think of is to scrape the index name from the explain query plan Detail column, check the number of variables involved (i.e. number of ? marks) and compare with the number of columns in the index.

Re: Getting an advance list of RowIDs for a query result set

Op 27 nov 2017, om 20:51 heeft x het volgende geschreven:

>> So if I build a view that includes look-ups in other tables, the
>> optimizer may skip these at places where not selected. However only
>> if
>> the look-ups are written as outer joins. Then it may be good practice
>> allways doing that. For instance:
>
>> create view vtrack as
>> select trackname, artistname
>> from track
>> left join artist ON trackartist=artistid -- note left join
> ;
> ~
> ~
>
> Yeah, I was thinking always to use left joins when it doesn’t make
> any difference. I’m using a C++ wrapper for SQLite I wrote myself.
> It contains a QueryGrid type that automatically splits the SQL into
> RowSQL and ColSQL as described earlier. It is still possible to
> build a vector of BaseTbl.RowIDs when the original query contains
> inner joins provided all columns of the unique index used in the
> join are satisfied (see ***Example). If it’s unable to do that (due
> to 1 to many joins) it computes ‘select count(*) from .....’ and
> sets ColSQL to ‘select ColList from ... limit ?1, ?2’ so that it
> still fits in with the ‘fetch a range of data’ modus operandi and
> also allows a record counter and vertical scrollbar positioning.
>
> ***Example
>
> select ColList
> from BaseTbl
> left join Tbl_1 on Tbl_1.X = BaseTbl.X
> Inner join Tbl_2 on Tbl2.Y = BaseTbl.Y
> left join Tbl_3 on Tbl3.Z = BaseTbl.Z
> where BaseTbl.Col=?
>
> Provided Tbl_2 has unique index on Y the RowID list could be had from
>
> Select BaseTbl.RowID
> from BaseTbl
> left join Tbl_1 on Tbl_1.X = BaseTbl.X
> Inner join Tbl_2 on Tbl2.Y = BaseTbl.Y
> left join Tbl_3 on Tbl3.Z = BaseTbl.Z
> where BaseTbl.Col=?
>
> which the SQLite query optimiser would presumably reduce to
>
> Select BaseTbl.RowID
> from BaseTbl
> Inner join Tbl_2 on Tbl2.Y = BaseTbl.Y
> where BaseTbl.Col=?
>
> I’ve still to write code to check the query contains no joins that
> are ‘1 to many’. I don’t suppose there’s an easy way of determining
> this from explain or explain query plan? At the moment all i can
> think of is to scrape the index name from the explain query plan
> Detail column, check the number of variables involved (i.e. number
> of ? marks) and compare with the number of columns in the index.

What about changing the remaining inner join to left join

Select BaseTbl.RowID
from BaseTbl
left join Tbl_2 on Tbl2.Y = BaseTbl.Y
where BaseTbl.Col=?

and see if the SQLiter optimizer now leaves Tbl_2 out from the query
plan. It will only do that if it is not a 1-to-n join.
But that leaves you with the change that the join is 1-to-0 so to say.
I understood that was a show stopper.

Re: Getting an advance list of RowIDs for a query result set

>and see if the SQLiter optimizer now leaves Tbl_2 out from the query
>plan. It will only do that if it is not a 1-to-n join.

If Tbl_2 isn’t involved in the columns, where or order by then changing it to left join will mean it will definitely be left out so I don’t get what you mean E.Pasma.

I’ve now realised what caused my earlier confusion regarding inner joins. I was getting mixed up with foreign keys which do offer a way of finding redundant inner joins.

If a Tbl (which is not involved in the query columns, where or order by) is inner joined to BaseTbl on all the columns of a unique index and it turns out there is a foreign key matching that join then the Tbl can be left out if the BaseTbl columns are defined as NOT NULL. If they are allowed to be NULL then it can still be left out provided the NOT NULL condition(s) is added to the where.

Select BaseTbl.RowID
from BaseTbl
where BaseTbl.Col=? and BaseTbl.Y IS NOT NULL

Could someone confirm I’ve got the above right?

If I’m correct, I wonder if the optimiser takes (or could take) this into account. As lookup tables are probably the main use of foreign keys you’d think there’d be no shortage of such joins. That said, I suppose if you’re aware of what you want the optimiser to do it wouldn’t offer any advantage over left joining and adding the appropriate NOT NULL constraint(s) to the WHERE.

Re: Getting an advance list of RowIDs for a query result set

E.Pasma wrote

>> What about changing the remaining inner join to left join
>
>> Select BaseTbl.RowID
>> from BaseTbl
>> left join Tbl_2 on Tbl2.Y = BaseTbl.Y
>> where BaseTbl.Col=?
>
>> and see if the SQLiter optimizer now leaves Tbl_2 out from the query
>> plan. It will only do that if it is not a 1-to-n join.

I replied

> If Tbl_2 isn’t involved in the columns, where or order by then
> changing it to left join will mean it will definitely be left out so
> I don’t get what you mean E.Pasma.

I’m talking bollocks again. The left join will only be omitted if it has a 1
to 1 relationship with BaseTbl so E.Pasma is correct.

I’m now wondering if you omit the WHERE & ORDER BY and run the following