For the products with a certain product.parent, I need to select minimal and maximal price within the given site.region, and return first 10 products ordered by the minimal price.

Each parent has 100 to 10,000 products, and there are 10,000,000 records in price table.

We have a classical OLAP task here: a fact table (price) and two dimension tables (product and site).

The task would be almost trivial if we were given the exact values of product and site. This way, we could build a composite index on these fields and value, filter on the exact values and get the first 10 values from the index.

However, we are given not the values of the dimensions themselves but those of the higher levels (parent for product and region for site).

Since the values of the levels are not stored in the facts table, we cannot index them. We need to join the dimension tables and filter on them.

This article describes four ways to join the tables, their efficiency varying depending on the density of the dimension values.

Since the only algorithm to make the joins MySQL is capable of is nested loops, basically, we need to define the order in which the tables would be joined.

There are three tables and, hence, 3! = 6 permutations that define all possible join orders:

product

site

price

product

price

site

site

price

product

site

product

price

price

product

site

price

site

product

However, two dimension tables are completely independent on each other. This means that if they come one after another in the join their order does not actually matter: they both will be searched for independent values. This reduces the number of actual combinations:

product/site

price

product

price

site

site

price

product

price

product/site

The joins must be designed so that the tables with most selective conditions go first. This means that the join order is determined by the density of the values satisfying the criteria. The more values from the table satisfy the search criteria, the later should the table come in the join, so that by the time the join occurs, most values would have been already sifted out.

Now, let's build the queries for all types of the join. To do this, we will create sample tables:Table creation details

select `20100930_prices`.`p`.`id` AS `id`,`20100930_prices`.`p`.`parent` AS `parent`,`20100930_prices`.`p`.`name` AS `name`,min(`20100930_prices`.`r`.`value`) AS `min_value` from `20100930_prices`.`site` `s` straight_join `20100930_prices`.`price` `r` straight_join `20100930_prices`.`product` `p` where ((`20100930_prices`.`r`.`site` = `20100930_prices`.`s`.`id`) and (`20100930_prices`.`p`.`id` = `20100930_prices`.`r`.`product`) and (`20100930_prices`.`p`.`parent` = 1) and (`20100930_prices`.`s`.`region` = 30)) group by `20100930_prices`.`p`.`id` order by min(`20100930_prices`.`r`.`value`) limit 10

This query completes in 30 ms, since the conditions (in their totality) are less selective.

Product, price, site

This query is best for sparse products and dense sites:

In this query, we will replace the JOIN against the site table with an IN predicate, since we don't actually need any information from that table:

SELECT p.*, MIN(r.value) AS min_value
FROM product p
STRAIGHT_JOIN
price r
ON r.product = p.id
WHERE p.parent = 100
AND r.site IN
(
SELECT id
FROM site
WHERE region = 1
)
GROUP BY
p.id
ORDER BY
min_value
LIMIT 10

id

parent

name

min_value

44246

100

Product 44246

20.4741

54185

100

Product 54185

20.5249

241517

100

Product 241517

20.7065

131185

100

Product 131185

21.1123

61725

100

Product 61725

21.1389

80285

100

Product 80285

21.3354

22533

100

Product 22533

21.4976

226588

100

Product 226588

21.5061

114991

100

Product 114991

21.5185

5571

100

Product 5571

21.7684

10 rows fetched in 0.0004s (0.0302s)

id

select_type

table

type

possible_keys

key

key_len

ref

rows

filtered

Extra

1

PRIMARY

p

ref

PRIMARY,ix_product_parent

ix_product_parent

4

const

136

100.00

Using where; Using temporary; Using filesort

1

PRIMARY

r

ref

ix_price_product_value_id

ix_price_product_value_id

4

20100930_prices.p.id

12

100.00

Using where

2

DEPENDENT SUBQUERY

site

unique_subquery

PRIMARY,ix_site_region

PRIMARY

4

func

1

100.00

Using where

select `20100930_prices`.`p`.`id` AS `id`,`20100930_prices`.`p`.`parent` AS `parent`,`20100930_prices`.`p`.`name` AS `name`,min(`20100930_prices`.`r`.`value`) AS `min_value` from `20100930_prices`.`product` `p` straight_join `20100930_prices`.`price` `r` where ((`20100930_prices`.`r`.`product` = `20100930_prices`.`p`.`id`) and (`20100930_prices`.`p`.`parent` = 100) and <in_optimizer>(`20100930_prices`.`r`.`site`,<exists>(<primary_index_lookup>(<cache>(`20100930_prices`.`r`.`site`) in site on PRIMARY where ((`20100930_prices`.`site`.`region` = 1) and (<cache>(`20100930_prices`.`r`.`site`) = `20100930_prices`.`site`.`id`)))))) group by `20100930_prices`.`p`.`id` order by min(`20100930_prices`.`r`.`value`) limit 10

Again, this query is 30 ms.

Price, product/site

This query is best for dense products and dense sites. This is the most complex query which would require some explanation.

At the first sight, it may seem that price cannot be filtered on. However, it's not true.

What we need is 10 lowest values from price satisfying certain conditions (right products and sites, and, additionally, products should be unique).

This means that we could make use of an index on price.value, scanning it until 10 records satisfying these conditions are returned. In this case, ORDER BY and LIMIT 10 would serve as filtering conditions: the scanning would cease as soon as the limit is reached.

And of course the more dense are the products and sites, the more is the probability for the conditions to be satisfied, the sooner the query completes.

Filtering on products and sites is easy, but to ensure that the products are unique, we will use the trick described in my previous article:

Field or reference '20100930_prices.p.id' of SELECT #2 was resolved in SELECT #1
select `20100930_prices`.`p`.`id` AS `id`,`20100930_prices`.`p`.`parent` AS `parent`,`20100930_prices`.`p`.`name` AS `name`,`20100930_prices`.`r`.`value` AS `min_value` from `20100930_prices`.`price` `r` straight_join `20100930_prices`.`product` `p` where ((`20100930_prices`.`p`.`id` = `20100930_prices`.`r`.`product`) and (`20100930_prices`.`p`.`parent` = 1) and (`20100930_prices`.`r`.`id` = (select `20100930_prices`.`ri`.`id` from `20100930_prices`.`price` `ri` where ((`20100930_prices`.`ri`.`product` = `20100930_prices`.`p`.`id`) and <in_optimizer>(`20100930_prices`.`ri`.`site`,<exists>(<primary_index_lookup>(<cache>(`20100930_prices`.`ri`.`site`) in site on PRIMARY where ((`20100930_prices`.`site`.`region` = 1) and (<cache>(`20100930_prices`.`ri`.`site`) = `20100930_prices`.`site`.`id`)))))) order by `20100930_prices`.`ri`.`product`,`20100930_prices`.`ri`.`value`,`20100930_prices`.`ri`.`id` limit 1))) order by `20100930_prices`.`r`.`value`,`20100930_prices`.`r`.`id` limit 10

We see that this query completes in only 14 ms despite the fact that we used the most populated parent and region. As I said before, the LIMIT 10 served as another filtering condition.

Summary

Let's run all possible queries for all possible distributions and record the query times in a table

P, S and R stand for product, site and price in the join order.

+Inf means that the query did not complete in a reasonable time and had to be terminated.

Query time, ms

Product

Sparse (parent 100)

Dense (parent 1)

P/S,R

S,R,P

P,R,S

R,P/S

P/S,R

S,R,P

P,R,S

R,P/S

Site

Sparse (region 30)

13

27

30

+Inf

621

30

6,703

51,468

Dense (region 1)

7,679

4,427

30

1,051

+Inf

5,914

6,736

14

Conclusion

If limiting the results, MySQL allows filtering fact tables efficiently on higher level dimensions despite the fact these dimensions cannot be indexed. However, in each case the selectivity of the dimension level should be taken into account and the appropriate query should be used.

It should be noted that the queries involving a CROSS JOIN and index scan on the fact table perform intolerably poorly in the edge cases (too dense and too sparse dimensions, accordingly). On the other hand, the queries involving a join only differ in the join order which can be predicted by MySQL.

This means that when the dimension selectivity is unknown, a query using a plain join of all three tables (without forcing the join order) should be the query of choice:

The first two queries run smoothly, under 0.030 seconds, but the 3rd one takes 47 seconds. Tried to run multiple times, time no more improves. Why happens this so differently for 3rd one? I’ve run the explain extended on it, and returns the same as in the post above.