hive-user mailing list archives

On Mon, Jan 31, 2011 at 11:08 AM, <hive1@gmx.de> wrote:
> Hello,
>
> I like to do a reporting with Hive on something like tracking data.
> The raw data which is about 2 gigs or more a day I want to query with hive. This works
already for me, no problem.
> Also I want to cascade down the reporting data to something like client, date, something
in Hive like partitioned by (client String, date String).
> That means I have multiple aggrgation-levels. I like to do all levels in Hive for a consistent
reporting source.
> And here is the thing: Might it a problem if it comes to many small files?
> The aggrgation level e.g. client/date might produce files about 1MB and in amount of
1000 a day.
> Is this a problem? I read about the "to many open files problem" with hadoop. And might
this lead to a bad hive/map-reduce performance?
> Maybe someone has some clues for that...
>
> Thanks in advance
> labtrax
> --
> GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit
> gratis Handy-Flat! http://portal.gmx.net/de/go/dsl
>
You probably do not want to partition on something that has a lot of
cardinality such as client_id . You do not want many small partitions
it is bad for the NameNode and mad for Map Reduce performance. So if
you have 1000 client ids that is 1000+ files per day and that is
trouble over a long period of time.
One option is to bucket on client into 64 Buckets on client_id. hive
can use the bucket to prune the amount of information that may get
table-scanned for scan. It is a compromise between many files and
really large files.
Generally you want big files so hadoop can use brute force table scans.
Edward