hive-dev mailing list archives

[jira] [Updated] (HIVE-9523) when columns on which tables are partitioned are used in the join condition same join optimizations as for bucketed tables should be applied

Date

Fri, 13 Feb 2015 19:45:12 GMT

[ https://issues.apache.org/jira/browse/HIVE-9523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vikram Dixit K updated HIVE-9523:
---------------------------------
Labels: gsoc2015 (was: )
> when columns on which tables are partitioned are used in the join condition same join
optimizations as for bucketed tables should be applied
> --------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-9523
> URL: https://issues.apache.org/jira/browse/HIVE-9523
> Project: Hive
> Issue Type: Improvement
> Components: Logical Optimizer, Physical Optimizer, SQL
> Affects Versions: 0.13.0, 0.14.0, 0.13.1
> Reporter: Maciek Kocon
> Labels: gsoc2015
>
> For JOIN conditions where partitioning criteria are used respectively:
> ⋮
> FROM TabA JOIN TabB
> ON TabA.partCol1 = TabB.partCol2
> AND TabA.partCol2 = TabB.partCol2
> the optimizer could/should choose to treat it the same way as with bucketed tables: ⋮
> FROM TabC
> JOIN TabD
> ON TabC.clusteredByCol1 = TabD.clusteredByCol2
> AND TabC.clusteredByCol2 = TabD.clusteredByCol2
> and use either Bucket Map Join or better, the Sort Merge Bucket Map Join.
> This is based on fact that same way as buckets translate to separate files, the partitions
essentially provide the same mapping.
> When data locality is known the optimizer could focus only on joining corresponding partitions
rather than whole data sets.
> #side notes:
> ⦿ Currently Table DDL Syntax where Partitioning and Bucketing defined at the same time
is allowed:
> CREATE TABLE
> ⋮
> PARTITIONED BY(…) CLUSTERED BY(…) INTO … BUCKETS;
> But in this case optimizer never chooses to use Bucket Map Join or Sort Merge Bucket
Map Join which defeats the purpose of creating BUCKETed tables in such scenarios. Should that
be raised as a separate BUG?
> ⦿ Currently partitioning and bucketing are two separate things but serve same purpose
- shouldn't the concept be merged (explicit/implicit partitions?)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)