2
CPSC 404, Laks V.S. Lakshmanan2 What you will learn from this lecture v What is a query plan? v What options exist for choosing query plans? v Example plans and cost estimation? v How are reduction factors estimated and used?

6
Overview of Query Optimization v Plan: Tree of RA ops, with choice of algo for each op. – Each operator typically implemented using a `pull’ interface: when an operator is `pulled’ for the next output tuples, it `pulls’ on its inputs and computes them. v Two main issues: – For a given query, what plans are considered? u Algorithm to search plan space for cheapest (estimated) plan. – How is the cost of a plan estimated? v Ideally: Want to find best plan. Practically: Avoid terrible plans! v We will study the System R approach. (Pioneered by IBM.)

7
Highlights of System R Optimizer v Impact: – Most widely used currently; works well for < 10 joins. v Cost estimation: Approximate art at best. – Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes. – Considers combination of CPU and I/O costs. – Expose to cost & intermediate result size estimation (only) in this lecture. v Plan Space: Too large, must be pruned. – Only the space of left-deep plans is considered. u Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation. – Cartesian products avoided.

10
Motivating Example (Plan 0) v Cost: 500+500*1000 I/Os v By no means the worst plan! v Misses several opportunities: selections could have been `pushed’ earlier, no use is made of any available indexes, etc. v Goal of optimization: To find more efficient plans that compute the same answer. SELECT S.genre FROM Ratings R, Songs S WHERE R.sid=S.sid AND R.uid=50 AND S.year>2000 Songs Ratings sid=sid uid=50 year > 2000 genre Songs Ratings sid=sid Uid=50 year > 2000 genre (Simple Nested Loops) (On-the-fly) RA Tree: Plan:

11
Nested Loop Join Plan (contd.) v The cost 500 + 500 * 1000 for NLJ corresponds to putting Songs (the smaller table) in the outer for loop. v Which among the above I/Os will be sequential and which ones random? v Why? v When does the randomness of I/O make a big impact?

13
No Indexes (contd.) – The cost would be_____if nested loop join was used. – The cost would be_____if we materialized T2 and pipelined T1 and used NLJ. – And it’d be_____if T1 was materialized and T2 pipelined.

14
Alternative Plan 1 (contd.) v Block-based NLJ: group blocks/pages of one table into chunks of k pages and read the other table in page by page for every chunk of first table. –Note: the “block” in “block-based” refers to chunk. v If we used block-based nested loop (BNL ) join, (e.g., scan T2 for every 3-page chunk of T1), join cost = 10+4*250, total cost = 1010 + 1760 = 2770. (Why 4 times?) v If we `push’ projections, T1 has only sid, T2 only sid and genre: –Assume fields of equal size (Ratings). Sizes: 10, 10, 15, 15 bytes (Songs). – T1 fits in ceil(10/4) = 3 pages, while T2 fits in 250/2 = 125 pages; cost of BNL drops to under 250 I/Os, total < 2000. – The cost will actually be ____.

16
Alternative Plans 2 With Indexes vTvTotal cost = 1210 I/Os when alternative 1 is used. What about alternative 2? vIvIf in the previous example, the index on uid was unclustered, the cost would be ____. vWvWould we still consider using index-based NLJ? vWvWhy (not)? vIvIf we used BNL join, then would pushing projection on outer relation help? Why?

17
Cost Estimation v For each plan considered, must estimate cost: – Must estimate cost of each operation in plan tree. u Depends on input cardinalities. u We’ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins, etc.) – Must estimate size of result for each operation in tree! u Use information about the input relations. u For selections and joins, assume independence of predicates. v We’ll discuss the System R cost estimation approach. – Inexact, but works ok in practice. – More sophisticated techniques known now.

19
Size Estimation and Reduction Factors v Consider a query block: v Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause. v Reduction factor (RF) associated with each term reflects the impact of the term in reducing result size. Result cardinality = Max # tuples * product of all RF’s. – Implicit assumption that terms are independent! – Term col=value has RF 1/NKeys(I), given index I on col – Term col1=col2 has RF 1/ MAX (NKeys(I1), NKeys(I2)) – Term col>value has RF (High(I)-value)/(High(I)-Low(I)) SELECT attribute list FROM relation list WHERE term1 AND... AND termk

20
Summary v Query optimization is an important task in a relational DBMS. v Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries). v Two parts to optimizing a query: – Consider a set of alternative plans. u Must prune search space; typically, left-deep plans only. – Must estimate cost of each plan that is considered. u Must estimate size of result and cost for each plan node. u Key issues: Statistics, indexes, operator implementations.