Query more than one data source:

Query more than one data source Mobile users may wish to combine information from more than one remote databases
E.g., A vegetarian visits Hong Kong and looks for some restaurants recommended by both HK tourist office and HK vegetarian community

Download all relations?:

Download all relations? Download both relations (HK tourist office and HK vegetarian directory) onto the mobile device and evaluate the join on the device locally
Won’t be able to hold the large amount of data from the remote databases (for most mobile devices)
The transfer cost is very high though the result size is very small

Outline:

A simple late-projection strategy:

A simple late-projection strategy Traditional distributed query processing techniques like semi-join involves:
Shipping of join columns and (whole) tuples
Across the trusted distributed nodes directly
In high selective join, most tuples fails to contribute to the final result
Semi-join d/l the non-key attributes which may not be included in the result
Download and join the distinct values of join keys only (Do not download the non-key attributes)
Only tuples belong to join result entails downloading the rest of non-key attributes

Step 6:

Block-merge join (BMJ):

Block-merge join (BMJ) Late-projection still insufficient if the whole join column cannot fit into the memory of mobile devices
Applying sort-merge join, with the sorting part on the servers
1 block of ordered join keys are downloaded from each server and join them locally, until one block is exhausted
Each block must
Cover same data range
Sorted in same order (e.g., both in ascending order)
Small enough to be resided in memory
Each block can be downloaded by using ROWNUM or LIMIT SQL statements

Ship-data-as-queries join (SDAQ):

Ship-data-as-queries join (SDAQ) If R1<< R2, transfer cost can be reduced by:
Download the join column of R1 to the mobile device: SELECT Name FROM R1
Send the join keys to R2 in form of SQL selection queries (e.g., if two results returned in step 1): SELECT Name FROM R2 WHERE Name in (‘Beta Food’,‘Ceta Food’))
The result from R2 are the joined keys
Very few results returned

Can we do even better?:

Can we do even better? Block-merge join (BMJ) can handle the limited memory problem, BUT download all join keys essentially
Ship-data-as-queries (SDAQ) can do better ONLY if the sizes of two relations differ much
Pay small overhead 
Build histograms that capture the data distribution of target relations
Join space pruning
Bucket-base joining

Pruning the data space:

Pruning the data space

Constructing Histogram:

Constructing Histogram Problem
Mobile devices are not able to receive those histograms (they are some internal data structures in remote databases)
Solution
Constructing some queries that build histogram through SQL

String Histogram:

String Histogram Using the SUBSTRING function
SELECT SUBSTRING(Name,1,1) AS Bucket, COUNT(Name) As Count FROM R1 GROUP BY SUBSTRING(Name,1,1) HAVING COUNT(Name) > 0

Numeric Histogram:

Numeric Histogram Using ROUND function
SELECT ROUND(Cost/(D/G)) AS Bucket, COUNT(ROUND(Cost/(D/G)) As Count FROM R2 GROUP BY ROUND(Cost/(D/G) HAVING COUNT(ROUND(Cost/(D/G)) > 0
G is granularity that specifies the number of bucket
D is the numeric domain

A Bucket-base Approach:

A Bucket-base Approach So far we know that:
SDAQ is good when input relations have large size difference
But, BMJ is better when two input relations have similar size (Why?)
Histogram helpful to prune join space
“Which method is better?”
The histogram can do more!
The histogram already partitioned the data space in form of buckets
Depends on the data distribution of each bucket, assign the best action to them adaptively

Direct Join (~BMJ):

Direct Join (~BMJ)

Ship-join-keys (~SDAQ):

Ship-join-keys (~SDAQ)

Recursive Partitioning:

Recursive Partitioning

Recursive Partitioning:

Recursive Partitioning Further breaks a partition into more refined ones and further request histogram for it
Hoping some sub-buckets are being pruned in future
Or hoping cheap ship-join-keys join can be applied on some future sub-buckets

Recursive Partitioning:

Recursive Partitioning SELECT SUBSTRING(Name,1,2) AS Bucket, COUNT(Name) AS Count FROM R2 WHERE SUBSTRING(Name,1,1)=‘A’ GROUP BY SUBSTRING(Name,1,2) HAVING COUNT(Name) > 0

Which action is the best for each bucket? The cost model!:

Which action is the best for each bucket? The cost model! The largest amount of data that can be transferred in one packet is called MTU (Maximum Transfer Unit)
The largest segment of TCP data that can be transmitted is called MSS (Maximum Segment Size)
MTU = MSS + BH (BH is the size of headers)
To transfer B bytes data, the actual number of bytes to be transferred is:

C3 : Recursive Partitioning :

C3 : Recursive Partitioning Have to estimate the cost of:
Ask for finer histograms for that bucket from R1
Ask for finer histograms for that bucket from R2
For each pair of (future/virtual) sub-buckets, each of them may execute direct-join, ship-join-key or recursive partitioning again

Recursive Partitioning C3:

Recursive Partitioning – Optimistic Estimation:

Recursive Partitioning – Optimistic Estimation Optimistically assume that buckets in next level are all being pruned
It will hold if the data distribution in the two datasets is very different
Since all future (next-level) sub-buckets are being pruned, they would NOT have any actions
Therefore:
C3(αi,βi)= Ch (G,R1) + Ch (G,R2) + CRP (αi,βi)

Recursive Partitioning – Linear Interpolation Estimation:

Recursive Partitioning – Linear Interpolation Estimation More accurate. Higher computational cost
Exploit the histogram in current level to estimate the distribution of next level We DON’T have
histograms in
this level We have
histograms in
this level

RAMJ Algorithm:

RAMJ Algorithm Recursive and Adaptive Mobile Join

Real Data Experiment:

Real Data Experiment Real Data Set 1
DBLP
Join relations “Conference” (235K tuples) and “Journal” (125K tuples) in order to find the set of publications that have the same conference and journal title
3836 publications have same title in both conference and journal paper
SELECT R1.Title
FROM Conference R1, Journal R2
WHERE R1.Title = R2.Title

Real Data Experiment:

Real Data Experiment Real Data Set 2
Restaurants Data Set
Crawled from www.restaurantrow.com
Join relation “Steak” (4573 tuples) and “Vegetarian” (2098 tuples) in order to find the set of restaurants that offer both steak and vegetarian dishes (163 joined)
SELECT R1.Name
FROM Steak R1, Vegetarian R2
WHERE R1.Name = R2.Name

Synthetic Data Experiment:

Synthetic Data Experiment Result:

Synthetic Data Experiment Result

The impact of data skew:

The impact of data skew

The impact of memory size:

The impact of memory size

Conclusions and Future Work:

Conclusions and Future Work Identify the requirements and limitation on evaluating ad-hoc join on mobile devices
A recursive and adaptive algorithm – RAMJ
Extension to multi-way joins and multi-attributes
Extension to Top-K-Join
Existing approaches on Top-K problem ONLY works on collaborative database

Approach 2:Distributed query processing?:

Approach 2: Distributed query processing? Not work on our problem!
DB1 and DB2 are non-collaborative
Would not accept “data structures” as input (e.g., a “join column” in semi-join or a “hash-table” in bloom-join
Accept SQL only
Semi-join is worked by some modifications:
Send J and K through the mobile device
 High transfer cost

Other Join Types:

Other Join Types Equi-join with selection constraints
E.g., We are interested in restaurants appear in both datasets and the expense is less than $20
SELECT R1.Name, R1.Address, R2.Cost
FROM R1, R2
WHERE R1.Name = R2.Name
AND R2.Cost < 20
Add this condition in histogram construction:
SELECT SUBSTRING(Name,1,1) AS Bucket,
COUNT(Name) As Count
FROM R1
WHERE R2.Cost < 20
GROUP BY SUBSTRING(Name,1,1)
HAVING COUNT(Name) > 0

Iceberg Semi-join:

Iceberg Semi-join Find all restaurants in R1 which are recommended by at least 10 users in a discussion group R2
Properties:
Equi-join between R1 and R2
Results are comes from R1 only
Condition is applied on R2 only (>10 users)
SELECT SUBSTRING(Name,1,1) AS Bucket,
COUNT(Name) As Count
FROM R2
GROUP BY SUBSTRING(Name,1,1)
HAVING COUNT(Name) > t