For
this project,
you will explore,
albeit briefly,
IBM DB2's query optimizer (v11)
and how it optimizes SQL queries.
The DB2 server you will use
—
db2ledu,
and associated client
db2leduc
—
is the default DB2 server set up on the “PRISM”
eecs.yorku.ca domain.
(It is running DB2 v11.)

This hosts the database stl.
In particular,
your task is to
explore when the optimizer switches from
one plan to another
for the same templated query,
as reduction factors are changed.
These are the tipping points in the plan space.

You may work in teams of up to two people.

The STL Database

The
StL data-warehouse is for St Lawrence Booksellers, Inc.
It records sales by customer and book from the years 1997 to 2003.
StL is a small, boutique web-based bookseller.

The StL Data-warehouse Schema

The
schema of StL-DW consists of three tables:

Customer

cust#

INTEGER

country

VARCHAR(20)

province

VARCHAR(20)

city

VARCHAR(20)

Book

book#

INTEGER

language

VARCHAR(20)

genre

VARCHAR(20)

publisher

VARCHAR(40)

price

DECIMAL(5, 2)

Purchase

cust#

INTEGER

book#

INTEGER

when

DATE

qnty

INTEGER

sale

DECIMAL(6, 2)

The underlined attributes show the primary keys.
In the language of data-warehousing,
Customer
and
Book
are dimension tables,
and
Purchase
is the fact table.
Customer has 100,000 rows,
Book has 50,000 rows,
and
Purchase has 9,997,850 rows.
There is a foreign key
from Purchase to Customer
and from Purchase to Book.

In Purchase,
cust#,
book#, and
when
are the dimension columns.
The measure column is sale.
It reports how much the sale was for.
Each sale is equal
to the book's price × qnty.
The column qnty indicates how many copies of this book
was bought in that given purchase.
In most cases, qnty = 1.
Note that sale has already accommodated qnty!
In your queries,
you do not have to multiply by qnty.

Since StL is a web-based book-seller,
we mail the books to the customers' addresses.
So one can think of "customer" as a location.
Call that dimension location then.
That dimension offers a natural roll-up:
country >
state >
city >
cust#.

The next dimension is book.
Books have three attributes of interest:
language
(what language it is written in),
genre
(what category it is in, e.g., humor, politics, etc.),
and
publisher (what company publishes it).
These form three dimensions on book,
and offer another natural cube
(treating Book itself as a small fact table).

The last dimension is time,
which is represented by the attribute when
in Purchase.
It is of type DATE.
So it does not show the time of a purchase,
but reports the day of purchase.
Notice there is no table Time,
but we could pretend there is a virtual table
for Time.
E.g.,

Time

year

YEAR

month

MONTH

day

DAY

Thus, the time dimension provides for a natural roll-up.
Standard SQL provides many functions for dealing with DATE
and TIMESTAMP values.
In fact,
we can roll up when in many ways.

There are no NULL values in the data warehouse.

Indexes

The
clustered indexes are on the primary keys as follows.

customer:
cust#

Book:
book#

Purchase:
cust#,
book#,
when

Table Purchase
has two additional unclustered indexes:

book#

when

Accessing the database

The
database is accessible via the “db2” command
(which is the client)
on any of the PRISM machines
(departmentally managed machines
in the eecs.yorku.ca) domain;
e.g., red.eecs.yorku.ca.

The database is named stl.
So that is what you need to connect to.
E.g.,

% db2 "connect to stl"

The three tables are in a schema named stl.
So to access them, one needs to prepend
‘stl.’
in front of the table name.
E.g.,

% db2 "select count(*) from stl.purchase"

(See the other instruction pages for DB2
off of the class homepage for more help.)

The StL-DW is reasonably large.
Most of the (correct) queries
for this project take a few seconds to a minute or so
to execute.
Be patient.
For real DWs,
query execution time can be hours.
Design your queries carefully.
Do them in steps,
and debug along the way.
Also reuse (parts of) queries
once you have them working
in subsequent challenges.

Profiling Query Plans

We
shall use the tool db2expln
to explore query plans
that the DB2 optimizer creates.

You need to be able to “see”
commands in DB2's install.
From any shell you are using DB2,
run

% source ~db2leduc/cshrc.V11

This sets environment variables and command paths
to see the tools in the instance of the DB2 system
running on PRISM.

The following command will dynamically prepare
the query in file sql_file,
and output and explanation of DB2's plan for it to the shell.

% db2expln -d stl -terminal -z \; -f sql_file

This assumes the SQL query in sql_file
is terminated with a semicolon, as is convention.
(The “-z \;”
tells db2expln this.
Otherwise, db2expln assumes
the query text to be all on one line in the file.
The
“\”
simply escapes the semicolon for shell,
as semicolon has a special meaning in shell.)

To output into a file, say

% db2expln -d stl -o output_file -z \; -f sql_file

To prepare and explain a query just typed on the command line:

% db2expln -d stl -o output_file -z \;
-q "query text here"

Tasks

Design
the following queries.
For each,
you are finding a query with two variations:
the query is the same except
the values of a predicate is
—
or predicates are
—
different
between the two versions.
You are wanting that DB2's query optimizer finds different
query plans for the two versions.
E.g.,

select sum(sale) as total
from stl.purchase
where when between '01/01/2001' and '12/31/2001';

and

select sum(sale) as total
from stl.purchase
where when between '12/01/2001' and '12/31/2001';

Design queries with variants for the following.

Design a simple, one-relation query that
with two variants that result in different access paths.

Design a query that
uses aggregation
that requires an explicit sort for the aggregation
in the plan for one variant,
but does not need the sort
for the aggregation in the other variant
—
that is, it does the aggregation on the fly.

Design a query that
results in a plan that uses an index-nested loop join
for one variant, but not for the other variant.

Design a query that
results in plans with different join orders
for the two variants.

Design another query of your choice
where the two variants result in very different plans.
Explain why the optimizer is likely choosing
the plan for each variant.

Deliverables

Make
a directory named tipping for your project.

Within the directory,
put the SQL for your queries into files named
oneA.sql,
oneB.sql,
twoA.sql,
twoB.sql,
threeA.sql,
threeB.sql,
fourA.sql,
fourB.sql,
fiveA.sql,
and
fiveB.sql,
where one, two, three,
four, and five
correspond to the queries in the above section,
and A and B for each represent the two variants
for each.

Save the results of db2expl for each in
oneA.plan,
oneB.plan,
twoA.plan,
twoB.plan,
threeA.plan,
threeB.plan,
fourA.plan,
fourB.plan,
fiveA.plan,
and
fiveB.plan,
for the queries, respectively.

Write a text file named report.txt
Explain why each of the three query variations
result in different query plans.
Keep the report simple
(say, under 1,000 words).