hive-user mailing list archives

Perhaps you can first create a temp table that contains only the records that will match? See
the UNION ALL trick at
http://www.mail-archive.com/hive-user@hadoop.apache.org/msg01906.html
________________________________
From: Brad Ruderman <bruderman@radiumone.com>
To: user@hive.apache.org
Sent: Monday, July 29, 2013 11:38 AM
Subject: Best Performance on Large Scale Join
Hi All-
I have 2 tables:
CREATE TABLE users (
a bigint,
b int
)
CREATE TABLE products (
a bigint,
c int
)
Each table has about 8 billion records (roughly 2k files total mappers). I want to know the
most performant way to do the following query:
SELECT u.b,
p.c,
count(*) as count
FROM users u
INNER JOIN products p
ON u.a = p.a
GROUP BY u.b, p.c
Right now the reducing is killing me. Any suggestions on improving performance? Would a mapbucket
join be optimal here?
Thanks,
Brad